By thinkmelt in cuda — 31 May 2023

0x002C - CUDA - Clion Development / Stepping off C.

CUDA - Clion Development / Stepping off C.

CLion is an incredibly powerful C development IDE, but it only does some integration with Cuda:

Straight C code is steppable and inspectable.
A "__global__" directive runs on the device (gpu card) and is callable from the host.
A "__device__" directive runs on the device (gpu card) and is callable from the device.
A "__host__" directive runs on the host - which you can simply leave out and treat as straight C code.
CLion cannot step inside the gpu code side - once the program exectuion gets there we rely on copious printf to determine what is happening.

Starting from this excellent guide - Consider the following simple code example of saxpy. Because it has the "__global__" it runs on the GPU

// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

It would be called with following example code template:

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

Inside the VecAdd function - we have three float arrays float* A, float* B, and float* C.

These are "__device__" side pointers that are created in the cudaMalloc calls in the main code block.
They are always loaded prior to 'warp activation' - defined as the activation of a contiguous block of threads.
Device limitations limit simultaneous threads in a warp to 1024. Typically it is 32. Block sizes are also usually 32. This would activate 1024 threads simultaneously - and if you had a 3060ti capable of launching 4864 simultaneous cores you could run a block x thread setup of << 76, 64 >>
Block counts determine the number of warps.
Dimensional size is determined in the <<<Block_Size, Thread_Size>>> determination.

Using the 3060ti as the benchmark (what I am working in) we have these specs

Shader Units / Cuda Cores: 4864
SM Count: 38 (128 Cores per SM)
16.2 Teraflops of compute power.
Amazingly this is based on the GA104 which some reports put at 7680 cores, with 2860 cores disabled, and other references to 6144 cores with 1280 cores disabled. Marketing or mask yield?

Core GPU computing is more limited as per wikipedia:

This is effectively how it works: The GA100 is allowed 65536 FP-32 (Floating Point 32-bit) variables per SMP. If we want to launch 1024 cores successfully on a single SMP - then we only want to allow 64 floating point variables per core - limited)

Auditing your code for register counts?

To see how many registers the nvcc / ptx assembler used in the compiling of your code add the -Xptxas=“-v” option to nvcc.
This would have to be done separately at the command line outside CLion.

0x002C - CUDA - Clion Development / Stepping off C.

0x002B - Cuda - Clion Setup [PASSING but why?]

0x002D CUDA - Market Simulation Building (The Basics)