GPU Performance Bottlenecks

Diagnosis, Profiler Indicators & Remedies

Based on Table 8-2 from GPU Performance Engineering

Quick Reference Table

Limiting Factor Description Profiler Indicators Key Remedies
Memory Bound
Moving data at peak DRAM bandwidth but not enough compute per byte
High Memory BW
Low FLOPS
Tiling, fusion, coalescing, caching
Compute Bound
Memory latency hidden, ALUs are the bottleneck
High FLOPS
Low Memory Util
ILP, Tensor Cores, lower precision
Latency Bound
Not enough concurrent work to hide load/store latencies
Low Achieved BW
High Stalls
Occupancy, ILP, prefetching
Underutilized
SMs not fully occupied, both memory and compute idle
Low Occupancy
Timeline Gaps
More threads, batch work, streams
📊

Memory Bound

You're moving as much data as you can—close to peak DRAM bandwidth—but you don't have enough work per byte to fully utilize the ALUs.

Profiler Indicators

Memory Bandwidth ~80-95%

Near peak DRAM bandwidth utilization

Compute (FLOPS) ~10-30%

Low arithmetic intensity

Detailed Remedies

Tiling (also called blocking) loads data into shared memory or registers once, then reuses it multiple times for computation.

Example: Matrix multiplication with tiles

// Load tile into shared memory
__shared__ float As[TILE_SIZE][TILE_SIZE];
As[ty][tx] = A[row * K + (t * TILE_SIZE + tx)];
__syncthreads();
// Compute using tile (multiple times)
for (int k = 0; k < TILE_SIZE; ++k)
  sum += As[ty][k] * Bs[k][tx];

Impact: Can reduce memory traffic by 10-100x for suitable algorithms. Each byte loaded from global memory is now used for many FLOPs instead of just one.

Combine multiple kernels into one to avoid intermediate writes to global memory. Data stays in registers or shared memory between operations.

❌ Separate kernels

kernel1: y = matmul(A, x)
  → write y to DRAM
kernel2: z = relu(y)
  → read y from DRAM

✓ Fused kernel

kernel: z = relu(matmul(A, x))
  → y stays in registers
  → only write z

Tools: PyTorch torch.compile, Triton, custom CUDA kernels, FlashAttention pattern.

Ensure threads in a warp access consecutive memory addresses. This allows the hardware to combine 32 separate requests into fewer transactions.

Coalesced vs non-coalesced access

// ✓ Coalesced: thread i accesses element i
float val = data[threadIdx.x];

// ❌ Strided: threads access non-consecutive elements
float val = data[threadIdx.x * stride];

// ❌ Random: unpredictable access pattern
float val = data[indices[threadIdx.x]];

Check: Use NCU's "Excessive Sectors" metric to identify non-coalesced accesses. Aim for 32 bytes (1 sector) per 32 threads for FP32.

Structure your access patterns to maximize cache hits. Data accessed by nearby threads or same thread repeatedly should fit in cache.

  • L2 Cache: Shared across all SMs (~40-80 MB on modern GPUs). Persist frequently-used data via L2 persistence APIs.
  • L1 / Shared Memory: Per-SM (~128-256 KB). Use for working sets that fit and are reused.
  • Texture Cache: Optimized for 2D spatial locality. Consider for image/volume data.

Tip: Use __ldg() for read-only data to enable caching through texture cache path.

Compute Bound

You've hidden memory latency and are no longer saturating memory bandwidth. Now the ALUs (CUDA cores and Tensor Cores) are the bottleneck.

Profiler Indicators

Compute (FLOPS) ~70-95%

Approaching GPU peak FLOPS

Memory Bandwidth ~20-50%

Memory not saturated

Detailed Remedies

Give the compiler more independent instructions to schedule. Modern GPUs can dual-issue certain instruction combinations.

Loop unrolling for ILP

// Before: Sequential dependency
for (int i = 0; i < N; i++)
  sum += a[i] * b[i];

// After: Multiple accumulators
float sum0=0, sum1=0, sum2=0, sum3=0;
for (int i = 0; i < N; i += 4) {
  sum0 += a[i] * b[i];
  sum1 += a[i+1] * b[i+1];
  sum2 += a[i+2] * b[i+2];
  sum3 += a[i+3] * b[i+3];
}
float sum = sum0 + sum1 + sum2 + sum3;

Why it works: Independent accumulators break dependency chains, allowing the compiler to overlap FMA latencies (~4 cycles).

Tensor Cores provide 2-16x more FLOPS than CUDA cores for matrix operations. They require specific data types and sizes.

Tensor Core FLOPS (H100)

  • FP32 CUDA: 67 TFLOPS
  • FP16 Tensor: 1,979 TFLOPS
  • FP8 Tensor: 3,958 TFLOPS

Requirements

  • • Matrix dims divisible by 8/16
  • • Use wmma or cuBLAS
  • • Aligned memory

PyTorch: with torch.autocast('cuda', dtype=torch.bfloat16)

Restructure code to minimize read-after-write dependencies. Each FMA has ~4-cycle latency; chaining them serially limits throughput.

  • Reorder operations: Interleave independent computations
  • Use local variables: Give compiler freedom to reorder
  • Check SASS: Look for stall cycles in assembly

NCU metric: Check "Short Scoreboard" stalls—these indicate ALU dependency waits.

Ampere and later GPUs have hardware support for 2:4 structured sparsity, providing up to 2x speedup for sparse matrices.

2:4 sparsity pattern

// 2:4 means 2 zeros in every 4 elements
[a, 0, b, 0] or [0, a, 0, b] or [a, b, 0, 0] etc.
// Hardware skips zero computations

Use cases: Pruned neural networks, attention masks. Requires cuSPARSELt or specialized libraries.

⏱️

Latency Bound

You're not sustaining enough concurrent work to hide individual load/store latencies, so warps stall waiting on data.

Profiler Indicators

Achieved Bandwidth ~20-40%

Well below peak (should be higher if truly memory-bound)

Stall Reasons High %
Long Scoreboard Not Selected

Detailed Remedies

More active warps per SM means more opportunities to hide latency. When one warp stalls, another can execute.

Occupancy limiters (check NCU)

  • Registers: Reduce via __launch_bounds__ or -maxrregcount
  • Shared memory: Reduce allocation or split across blocks
  • Block size: Ensure enough threads per block (multiples of 128-256)

Trade-off: Lower registers may spill to local memory, hurting performance. Profile both configurations!

Even at low occupancy, ILP can help hide latency. Each thread issues multiple independent memory requests in flight.

Memory-level parallelism

// Issue multiple loads before using results
float a = data[i];
float b = data[i + stride];
float c = data[i + 2*stride];
float d = data[i + 3*stride];
// Now use a, b, c, d (all loads in flight)

Key insight: Ampere+ can have ~12 outstanding loads per thread. Issuing just 1 load wastes this capability.

Structure your kernel as a pipeline where loading, computing, and storing overlap. While processing tile N, prefetch tile N+1.

Double-buffering pattern

__shared__ float buf[2][TILE_SIZE];
// Load first tile
load_async(buf[0], global_ptr);
for (int t = 0; t < num_tiles; t++) {
  // Start loading next tile
  if (t+1 < num_tiles)
    load_async(buf[(t+1)%2], global_ptr + (t+1)*TILE_SIZE);
  // Wait and process current tile
  wait_async();
  process(buf[t%2]);
}

CUDA 11+: Use cuda::memcpy_async for hardware-accelerated async copies.

Issue loads for future iterations early, so data arrives by the time it's needed.

Prefetch pattern

// Manual prefetch (load into L2/L1)
asm volatile("prefetch.global.L2 [%0];" :: "l"(ptr + prefetch_distance));

// Or just issue the load early
float next_val = data[i + LOOKAHEAD];
// ... do other work ...
// By now next_val is likely ready

When to use: When access pattern is predictable and occupancy is limited. Adds register pressure.

💤

Underutilizing the GPU

You're not fully occupying SMs or launching enough work—both memory and compute resources remain idle.

Profiler Indicators

Occupancy Low
Bandwidth Low
FLOPS Low

Nsys Timeline: Look for gaps between kernels, sparse kernel activity, or kernels not spanning all SMs.

Detailed Remedies

The simplest fix: give the GPU more work. Modern GPUs have 80-144 SMs; each needs multiple blocks to stay busy.

Minimum blocks for full occupancy

// H100: 132 SMs × 16-32 blocks/SM = 2,000-4,000 blocks minimum
// If batch_size=32 and each item = 1 block:
// Only 32 blocks → GPU is 99% idle!

// Solution: increase batch_size or tile smaller

ML training: Use gradient accumulation if memory-limited, or data parallelism across GPUs.

Adjust kernel launch configuration to expose more parallelism. Each thread can do less work if needed.

❌ Coarse-grained

// 100 blocks, each does 1000 items
kernel<<<100, 256>>>();

✓ Fine-grained

// 10,000 blocks, each does 10 items
kernel<<<10000, 256>>>();

Balance: Too fine = launch overhead; too coarse = underutilization. Profile to find sweet spot.

Run multiple independent kernels or overlap compute with memory transfers using streams.

Stream concurrency

cudaStream_t s1, s2, s3;
cudaStreamCreate(&s1); // ...

// These can execute concurrently if resources available
kernel_a<<>>();
kernel_b<<>>();
kernel_c<<>>();

// Overlap compute and transfer
cudaMemcpyAsync(d_out, h_out, size, cudaMemcpyDeviceToHost, s1);
kernel<<>>(d_in2, d_out2);

PyTorch: torch.cuda.Stream() with with stream: context.

Instead of launching many small kernels, launch one kernel that persists and processes a work queue. Eliminates launch overhead.

Persistent kernel pattern

__global__ void persistent_kernel(WorkQueue* queue) {
  while (true) {
    int task = atomicAdd(&queue->head, 1);
    if (task >= queue->total_tasks) break;
    process_task(queue->tasks[task]);
  }
}
// Launch once with enough blocks to fill GPU
persistent_kernel<<>>(queue);

Use cases: Graph neural networks, dynamic workloads, task queues. Requires careful synchronization.

If you have many tiny kernels launching sequentially, fuse them into one larger kernel to reduce launch overhead and improve occupancy.

❌ Many small kernels

add<<<1, 256>>>(a, b, c);
mul<<<1, 256>>>(c, d, e);
relu<<<1, 256>>>(e, f);
// 3 launches, gaps between

✓ One fused kernel

fused_add_mul_relu<<>>(
  a, b, d, f);
// 1 launch, no gaps

Automation: PyTorch torch.compile, TensorRT, Triton auto-fuse many operations.

Diagnosis Decision Flow

Use this flowchart to identify your bottleneck type

1

Check Nsys timeline: Are there gaps between kernels?

→ Yes: Underutilized (launch overhead, small kernels) → No: Continue
2

Check NCU Speed of Light: Memory vs Compute %

→ Memory ~80%+, Compute low: Memory Bound

→ Compute ~80%+, Memory low: Compute Bound

→ Both low: Continue

3

Check stall reasons: Long Scoreboard, Not Selected high?

→ Yes: Latency Bound (not enough work to hide latency) → No: Underutilized (check occupancy)

💡 Pro Tip

Most real kernels are mixed—some phases memory-bound, others compute-bound. Profile individual sections or use NCU Source Page to pinpoint bottlenecks at the instruction level.