The CUDA programming model organizes parallel computation through a hierarchy: Grids, Thread Blocks, and Threads. This structure allows developers to manage and scale parallelism effectively on the GPU.
Grid: A kernel is launched as a grid of thread blocks. A grid can be 1D, 2D, or 3D, allowing a natural mapping of computation to the problem's data structure (e.g., a 2D grid for image processing). All blocks in a grid run the same kernel code.
Thread Block: A thread block is a group of threads that execute concurrently on the same Streaming Multiprocessor (SM). Threads within a block can cooperate by sharing data via fast on-chip shared memory and can synchronize their execution using barriers like __syncthreads(). Blocks can also be 1D, 2D, or 3D.
Thread: The fundamental unit of parallel execution. Each thread executes an instance of the kernel function. Threads are identified within their block by a unique threadIdx (which can be 1D, 2D, or 3D). A global thread ID can be computed using threadIdx and blockIdx (the block's unique ID within the grid).
Visualizing a 2D Grid of 2D Blocks:
The diagram below illustrates how a 2D grid is composed of 2D blocks, and how threads within a block are indexed. This is analogous to Figure 3 in the NVIDIA "CUDA Refresher" blog post.
Grid (e.g., gridDim(3,2))
blockIdx
(0,0)
blockIdx (1,0)
t(0,0)
t(1,0)
t(0,1)
t(1,1)
(e.g. blockDim(2,2))
blockIdx
(2,0)
blockIdx
(0,1)
blockIdx
(1,1)
blockIdx
(2,1)
Each 'blockIdx(x,y)' is a thread block. The highlighted block shows threads 't(x,y)' representing 'threadIdx(x,y)'.
Developers use CUDA C/C++ to define kernels and launch them. The `dim3` type is crucial for specifying the dimensions of grids and blocks.
Kernel Definition: A C function executed on the GPU, marked with `__global__`.
__global__ void myVectorAdd(float* A, float* B, float* C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
Execution Configuration: When launching a kernel, you specify the grid and block dimensions using `<<>>`.
int N = 1024 * 1024;
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
dim3 dimGrid(blocksPerGrid);
dim3 dimBlock(threadsPerBlock);
myVectorAdd<<>>(d_A, d_B, d_C, N);
Built-in Variables: Inside the kernel, threads use variables like:
threadIdx.(x,y,z): Thread index within its block.
blockIdx.(x,y,z): Block index within the grid.
blockDim.(x,y,z): Dimensions of the thread block.
gridDim.(x,y,z): Dimensions of the grid.
Warp: The hardware groups threads within a block into Warps of 32 threads. A warp is the fundamental unit of scheduling and execution on an SM. For instance, a block of 256 threads is divided into $256 / 32 = 8$ warps.
SIMT (Single Instruction, Multiple Thread): All 32 threads in a warp execute the same instruction at the same time, but on different data. This is highly efficient if all threads follow the same execution path.
Warp Divergence: If threads within a warp encounter a conditional statement (e.g., `if-else`) and take different paths, the paths are executed serially for that warp. Some threads will be temporarily idle while others execute their path. This can significantly impact performance. Minimizing divergence is a key optimization.
Choosing a block size that is a multiple of 32 (the warp size) is crucial. If not, the last warp will have inactive threads, but the hardware still allocates resources for a full warp, leading to inefficiency.
SM Hardware Constraints:
Each Streaming Multiprocessor (SM) has fixed limits that determine occupancy:
- 64 warps maximum (2048 threads) per SM on datacenter GPUs (A100/H100/B100)
- 32 thread blocks maximum per SM
- 4 warp schedulers per SM, each capable of issuing instructions independently
The actual occupancy depends on which limit (warps, blocks, registers, or shared memory) is hit first.
Block Size vs. SM Limits Examples:
The table below shows how block size affects the maximum number of concurrent blocks per SM, constrained by both the warp limit (64) and the block limit (32).
| Block Size |
Warps/Block |
Max Blocks (warp limit) |
Max Blocks (block limit) |
Actual Max Blocks |
| 32 threads |
1 |
64 |
32 |
32 (block-limited) |
| 64 threads |
2 |
32 |
32 |
32 (both limits) |
| 128 threads |
4 |
16 |
32 |
16 (warp-limited) |
| 256 threads |
8 |
8 |
32 |
8 (warp-limited) |
| 512 threads |
16 |
4 |
32 |
4 (warp-limited) |
| 1024 threads |
32 |
2 |
32 |
2 (warp-limited) |
Note: 64 threads per block is the sweet spot where both limits are reached simultaneously, maximizing occupancy potential.
CUDA programming involves managing memory on both the CPU (host) and the GPU (device). Data must be transferred to the GPU for processing and results transferred back.
Typical CUDA Application Flow:
1. Allocate Host Memory (CPU RAM) for input/output data.
↓
2. Allocate Device Memory (GPU VRAM) using cudaMalloc().
↓ (Host to Device)
3. Copy Input Data from Host to Device using cudaMemcpy().
↓
4. CPU Launches Kernel on GPU myKernel<<<...>>>().
↓
5. GPU Executes Kernel in Parallel (processes data in Device Memory).
↓ (Device to Host)
6. Copy Results from Device to Host using cudaMemcpy().
↓
7. Process Results on CPU.
↓
8. Free Device Memory using cudaFree().
↓
9. Free Host Memory.
Code Examples for Memory Operations:
int N = 1024;
size_t size = N * sizeof(float);
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);
float* d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
Error checking for CUDA API calls (e.g., `cudaMalloc`, `cudaMemcpy`) is omitted for brevity but crucial in production code.