Quick Reference Table
| Limiting Factor | Description | Profiler Indicators | Key Remedies |
|---|---|---|---|
|
Memory Bound
|
Moving data at peak DRAM bandwidth but not enough compute per byte |
High Memory BW
Low FLOPS
|
Tiling, fusion, coalescing, caching |
|
Compute Bound
|
Memory latency hidden, ALUs are the bottleneck |
High FLOPS
Low Memory Util
|
ILP, Tensor Cores, lower precision |
|
Latency Bound
|
Not enough concurrent work to hide load/store latencies |
Low Achieved BW
High Stalls
|
Occupancy, ILP, prefetching |
|
Underutilized
|
SMs not fully occupied, both memory and compute idle |
Low Occupancy
Timeline Gaps
|
More threads, batch work, streams |
Memory Bound
You're moving as much data as you can—close to peak DRAM bandwidth—but you don't have enough work per byte to fully utilize the ALUs.
Profiler Indicators
Near peak DRAM bandwidth utilization
Low arithmetic intensity
Detailed Remedies
Compute Bound
You've hidden memory latency and are no longer saturating memory bandwidth. Now the ALUs (CUDA cores and Tensor Cores) are the bottleneck.
Profiler Indicators
Approaching GPU peak FLOPS
Memory not saturated
Detailed Remedies
Latency Bound
You're not sustaining enough concurrent work to hide individual load/store latencies, so warps stall waiting on data.
Profiler Indicators
Well below peak (should be higher if truly memory-bound)
Detailed Remedies
Underutilizing the GPU
You're not fully occupying SMs or launching enough work—both memory and compute resources remain idle.
Profiler Indicators
Nsys Timeline: Look for gaps between kernels, sparse kernel activity, or kernels not spanning all SMs.
Detailed Remedies
Diagnosis Decision Flow
Use this flowchart to identify your bottleneck type
Check Nsys timeline: Are there gaps between kernels?
Check NCU Speed of Light: Memory vs Compute %
→ Memory ~80%+, Compute low: Memory Bound
→ Compute ~80%+, Memory low: Compute Bound
→ Both low: Continue
Check stall reasons: Long Scoreboard, Not Selected high?
💡 Pro Tip
Most real kernels are mixed—some phases memory-bound, others compute-bound. Profile individual sections or use NCU Source Page to pinpoint bottlenecks at the instruction level.