GPU Performance Bottlenecks - Diagnosis & Remedies

Quick Reference Table

Limiting Factor	Description	Profiler Indicators	Key Remedies
Memory Bound	Moving data at peak DRAM bandwidth but not enough compute per byte	High Memory BW Low FLOPS	Tiling, fusion, coalescing, caching
Compute Bound	Memory latency hidden, ALUs are the bottleneck	High FLOPS Low Memory Util	ILP, Tensor Cores, lower precision
Latency Bound	Not enough concurrent work to hide load/store latencies	Low Achieved BW High Stalls	Occupancy, ILP, prefetching
Underutilized	SMs not fully occupied, both memory and compute idle	Low Occupancy Timeline Gaps	More threads, batch work, streams

📊

Memory Bound

You're moving as much data as you can—close to peak DRAM bandwidth—but you don't have enough work per byte to fully utilize the ALUs.

Profiler Indicators

Memory Bandwidth ~80-95%

Near peak DRAM bandwidth utilization

Compute (FLOPS) ~10-30%

Low arithmetic intensity

Detailed Remedies

Tiling (also called blocking) loads data into shared memory or registers once, then reuses it multiple times for computation.

Example: Matrix multiplication with tiles


// Load tile into shared memory

__shared__ float As[TILE_SIZE][TILE_SIZE];

As[ty][tx] = A[row * K + (t * TILE_SIZE + tx)];

__syncthreads();

// Compute using tile (multiple times)

for (int k = 0; k < TILE_SIZE; ++k)

  sum += As[ty][k] * Bs[k][tx];

Impact: Can reduce memory traffic by 10-100x for suitable algorithms. Each byte loaded from global memory is now used for many FLOPs instead of just one.

Combine multiple kernels into one to avoid intermediate writes to global memory. Data stays in registers or shared memory between operations.

❌ Separate kernels


kernel1: y = matmul(A, x)

  → write y to DRAM

kernel2: z = relu(y)

  → read y from DRAM

✓ Fused kernel


kernel: z = relu(matmul(A, x))

  → y stays in registers

  → only write z

Tools: PyTorch torch.compile, Triton, custom CUDA kernels, FlashAttention pattern.

Ensure threads in a warp access consecutive memory addresses. This allows the hardware to combine 32 separate requests into fewer transactions.

Coalesced vs non-coalesced access


// ✓ Coalesced: thread i accesses element i

float val = data[threadIdx.x];


// ❌ Strided: threads access non-consecutive elements

float val = data[threadIdx.x * stride];


// ❌ Random: unpredictable access pattern

float val = data[indices[threadIdx.x]];

Check: Use NCU's "Excessive Sectors" metric to identify non-coalesced accesses. Aim for 32 bytes (1 sector) per 32 threads for FP32.

Structure your access patterns to maximize cache hits. Data accessed by nearby threads or same thread repeatedly should fit in cache.

L2 Cache: Shared across all SMs (~40-80 MB on modern GPUs). Persist frequently-used data via L2 persistence APIs.
L1 / Shared Memory: Per-SM (~128-256 KB). Use for working sets that fit and are reused.
Texture Cache: Optimized for 2D spatial locality. Consider for image/volume data.

Tip: Use __ldg() for read-only data to enable caching through texture cache path.

⚡

Compute Bound

You've hidden memory latency and are no longer saturating memory bandwidth. Now the ALUs (CUDA cores and Tensor Cores) are the bottleneck.

Profiler Indicators

Compute (FLOPS) ~70-95%

Approaching GPU peak FLOPS

Memory Bandwidth ~20-50%

Memory not saturated

Detailed Remedies

Give the compiler more independent instructions to schedule. Modern GPUs can dual-issue certain instruction combinations.

Loop unrolling for ILP


// Before: Sequential dependency

for (int i = 0; i < N; i++)

  sum += a[i] * b[i];


// After: Multiple accumulators

float sum0=0, sum1=0, sum2=0, sum3=0;

for (int i = 0; i < N; i += 4) {

  sum0 += a[i] * b[i];

  sum1 += a[i+1] * b[i+1];

  sum2 += a[i+2] * b[i+2];

  sum3 += a[i+3] * b[i+3];

}

float sum = sum0 + sum1 + sum2 + sum3;

Why it works: Independent accumulators break dependency chains, allowing the compiler to overlap FMA latencies (~4 cycles).

Tensor Cores provide 2-16x more FLOPS than CUDA cores for matrix operations. They require specific data types and sizes.

Tensor Core FLOPS (H100)

FP32 CUDA: 67 TFLOPS
FP16 Tensor: 1,979 TFLOPS
FP8 Tensor: 3,958 TFLOPS

Requirements

• Matrix dims divisible by 8/16
• Use wmma or cuBLAS
• Aligned memory

PyTorch: with torch.autocast('cuda', dtype=torch.bfloat16)

Restructure code to minimize read-after-write dependencies. Each FMA has ~4-cycle latency; chaining them serially limits throughput.

Reorder operations: Interleave independent computations
Use local variables: Give compiler freedom to reorder
Check SASS: Look for stall cycles in assembly

NCU metric: Check "Short Scoreboard" stalls—these indicate ALU dependency waits.

Ampere and later GPUs have hardware support for 2:4 structured sparsity, providing up to 2x speedup for sparse matrices.

2:4 sparsity pattern


// 2:4 means 2 zeros in every 4 elements

[a, 0, b, 0] or [0, a, 0, b] or [a, b, 0, 0] etc.

// Hardware skips zero computations

Use cases: Pruned neural networks, attention masks. Requires cuSPARSELt or specialized libraries.

⏱️

Latency Bound

You're not sustaining enough concurrent work to hide individual load/store latencies, so warps stall waiting on data.

Profiler Indicators

Achieved Bandwidth ~20-40%

Well below peak (should be higher if truly memory-bound)

Stall Reasons High %

Long Scoreboard Not Selected

Detailed Remedies

More active warps per SM means more opportunities to hide latency. When one warp stalls, another can execute.

Occupancy limiters (check NCU)

• Registers: Reduce via __launch_bounds__ or -maxrregcount
• Shared memory: Reduce allocation or split across blocks
• Block size: Ensure enough threads per block (multiples of 128-256)

Trade-off: Lower registers may spill to local memory, hurting performance. Profile both configurations!

Even at low occupancy, ILP can help hide latency. Each thread issues multiple independent memory requests in flight.

Memory-level parallelism


// Issue multiple loads before using results

float a = data[i];

float b = data[i + stride];

float c = data[i + 2*stride];

float d = data[i + 3*stride];

// Now use a, b, c, d (all loads in flight)

Key insight: Ampere+ can have ~12 outstanding loads per thread. Issuing just 1 load wastes this capability.

Structure your kernel as a pipeline where loading, computing, and storing overlap. While processing tile N, prefetch tile N+1.

Double-buffering pattern


__shared__ float buf[2][TILE_SIZE];

// Load first tile

load_async(buf[0], global_ptr);

for (int t = 0; t < num_tiles; t++) {

  // Start loading next tile

  if (t+1 < num_tiles)

    load_async(buf[(t+1)%2], global_ptr + (t+1)*TILE_SIZE);

  // Wait and process current tile

  wait_async();

  process(buf[t%2]);

}

CUDA 11+: Use cuda::memcpy_async for hardware-accelerated async copies.

Issue loads for future iterations early, so data arrives by the time it's needed.

Prefetch pattern


// Manual prefetch (load into L2/L1)

asm volatile("prefetch.global.L2 [%0];" :: "l"(ptr + prefetch_distance));


// Or just issue the load early

float next_val = data[i + LOOKAHEAD];

// ... do other work ...

// By now next_val is likely ready

When to use: When access pattern is predictable and occupancy is limited. Adds register pressure.

💤

Underutilizing the GPU

You're not fully occupying SMs or launching enough work—both memory and compute resources remain idle.

Profiler Indicators

Occupancy Low

Bandwidth Low

FLOPS Low

Nsys Timeline: Look for gaps between kernels, sparse kernel activity, or kernels not spanning all SMs.

Detailed Remedies

The simplest fix: give the GPU more work. Modern GPUs have 80-144 SMs; each needs multiple blocks to stay busy.

Minimum blocks for full occupancy


// H100: 132 SMs × 16-32 blocks/SM = 2,000-4,000 blocks minimum

// If batch_size=32 and each item = 1 block:

//   Only 32 blocks → GPU is 99% idle!


// Solution: increase batch_size or tile smaller

ML training: Use gradient accumulation if memory-limited, or data parallelism across GPUs.

Adjust kernel launch configuration to expose more parallelism. Each thread can do less work if needed.

❌ Coarse-grained


// 100 blocks, each does 1000 items

kernel<<<100, 256>>>();

✓ Fine-grained


// 10,000 blocks, each does 10 items

kernel<<<10000, 256>>>();

Balance: Too fine = launch overhead; too coarse = underutilization. Profile to find sweet spot.

Run multiple independent kernels or overlap compute with memory transfers using streams.

Stream concurrency


cudaStream_t s1, s2, s3;

cudaStreamCreate(&s1); // ...


// These can execute concurrently if resources available

kernel_a<<>>();

kernel_b<<>>();

kernel_c<<>>();


// Overlap compute and transfer

cudaMemcpyAsync(d_out, h_out, size, cudaMemcpyDeviceToHost, s1);

kernel<<>>(d_in2, d_out2);

PyTorch: torch.cuda.Stream() with with stream: context.

Instead of launching many small kernels, launch one kernel that persists and processes a work queue. Eliminates launch overhead.

Persistent kernel pattern


__global__ void persistent_kernel(WorkQueue* queue) {

  while (true) {

    int task = atomicAdd(&queue->head, 1);

    if (task >= queue->total_tasks) break;

    process_task(queue->tasks[task]);

  }

}

// Launch once with enough blocks to fill GPU

persistent_kernel<<>>(queue);

Use cases: Graph neural networks, dynamic workloads, task queues. Requires careful synchronization.

If you have many tiny kernels launching sequentially, fuse them into one larger kernel to reduce launch overhead and improve occupancy.

❌ Many small kernels


add<<<1, 256>>>(a, b, c);

mul<<<1, 256>>>(c, d, e);

relu<<<1, 256>>>(e, f);

// 3 launches, gaps between

✓ One fused kernel


fused_add_mul_relu<<>>(

  a, b, d, f);

// 1 launch, no gaps

Automation: PyTorch torch.compile, TensorRT, Triton auto-fuse many operations.

Diagnosis Decision Flow

Use this flowchart to identify your bottleneck type

1

Check Nsys timeline: Are there gaps between kernels?

→ Yes: Underutilized (launch overhead, small kernels) → No: Continue

2

Check NCU Speed of Light: Memory vs Compute %

→ Memory ~80%+, Compute low: Memory Bound

→ Compute ~80%+, Memory low: Compute Bound

→ Both low: Continue

3

Check stall reasons: Long Scoreboard, Not Selected high?

→ Yes: Latency Bound (not enough work to hide latency) → No: Underutilized (check occupancy)

💡 Pro Tip

Most real kernels are mixed—some phases memory-bound, others compute-bound. Profile individual sections or use NCU Source Page to pinpoint bottlenecks at the instruction level.