Interactive CUDA Execution Model Visualizer

Configure Your Thread Block

In CUDA, developers specify the number of threads per block. This choice directly impacts how the GPU hardware organizes these threads into Warps (groups of 32) for SIMT execution.

Use the slider below to visualize this relationship and its performance implications.

Threads per Block: 256

32 128 256 384 512

Threads per Block

256

Warps per Block

Warp Size (Fixed)

Inactive Thread Slots

Performance Implication:

A block size of 256 threads is a common and efficient choice. It creates 8 full warps, ensuring no hardware resources are wasted on inactive threads within the last warp.

Logical View: Programmer's Thread Block

This represents the block of threads as defined by the programmer. Each small square is a thread, colored by the warp it will belong to.

Physical View: Hardware Warp Grouping

The GPU hardware partitions the block's threads into fixed-size Warps of 32. This is how the hardware schedules and executes them.

CUDA Concepts Explained

The CUDA programming model organizes parallel computation through a hierarchy: Grids, Thread Blocks, and Threads. This structure allows developers to manage and scale parallelism effectively on the GPU.

Grid: A kernel is launched as a grid of thread blocks. A grid can be 1D, 2D, or 3D, allowing a natural mapping of computation to the problem's data structure (e.g., a 2D grid for image processing). All blocks in a grid run the same kernel code.

Thread Block: A thread block is a group of threads that execute concurrently on the same Streaming Multiprocessor (SM). Threads within a block can cooperate by sharing data via fast on-chip shared memory and can synchronize their execution using barriers like __syncthreads(). Blocks can also be 1D, 2D, or 3D.

Thread: The fundamental unit of parallel execution. Each thread executes an instance of the kernel function. Threads are identified within their block by a unique threadIdx (which can be 1D, 2D, or 3D). A global thread ID can be computed using threadIdx and blockIdx (the block's unique ID within the grid).

Visualizing a 2D Grid of 2D Blocks:

The diagram below illustrates how a 2D grid is composed of 2D blocks, and how threads within a block are indexed. This is analogous to Figure 3 in the NVIDIA "CUDA Refresher" blog post.

Grid (e.g., gridDim(3,2))

blockIdx
(0,0)

blockIdx (1,0)

t(0,0)

t(1,0)

t(0,1)

t(1,1)

(e.g. blockDim(2,2))

blockIdx
(2,0)

blockIdx
(0,1)

blockIdx
(1,1)

blockIdx
(2,1)

Each 'blockIdx(x,y)' is a thread block. The highlighted block shows threads 't(x,y)' representing 'threadIdx(x,y)'.

Warp: The hardware groups threads within a block into Warps of 32 threads. A warp is the fundamental unit of scheduling and execution on an SM. For instance, a block of 256 threads is divided into $256 / 32 = 8$ warps.

SIMT (Single Instruction, Multiple Thread): All 32 threads in a warp execute the same instruction at the same time, but on different data. This is highly efficient if all threads follow the same execution path.

Warp Divergence: If threads within a warp encounter a conditional statement (e.g., `if-else`) and take different paths, the paths are executed serially for that warp. Some threads will be temporarily idle while others execute their path. This can significantly impact performance. Minimizing divergence is a key optimization.

Choosing a block size that is a multiple of 32 (the warp size) is crucial. If not, the last warp will have inactive threads, but the hardware still allocates resources for a full warp, leading to inefficiency.

SM Hardware Constraints:

Each Streaming Multiprocessor (SM) has fixed limits that determine occupancy:

64 warps maximum (2048 threads) per SM on datacenter GPUs (A100/H100/B100)
32 thread blocks maximum per SM
4 warp schedulers per SM, each capable of issuing instructions independently

The actual occupancy depends on which limit (warps, blocks, registers, or shared memory) is hit first.

Block Size vs. SM Limits Examples:

The table below shows how block size affects the maximum number of concurrent blocks per SM, constrained by both the warp limit (64) and the block limit (32).

Block Size	Warps/Block	Max Blocks (warp limit)	Max Blocks (block limit)	Actual Max Blocks
32 threads	1	64	32	32 (block-limited)
64 threads	2	32	32	32 (both limits)
128 threads	4	16	32	16 (warp-limited)
256 threads	8	8	32	8 (warp-limited)
512 threads	16	4	32	4 (warp-limited)
1024 threads	32	2	32	2 (warp-limited)

Note: 64 threads per block is the sweet spot where both limits are reached simultaneously, maximizing occupancy potential.

CUDA Programming Model Explorer