Speed of Light (SOL) Analysis: Visual Example
Conceptual representation of how NVIDIA Nsight Compute displays SOL metrics for three kernel types. Each shows SM SOL (compute ceiling) and Memory SOL (bandwidth ceiling).
1. Compute-Bound
e.g. Large Matrix Multiply
Observation: SM is almost at theoretical peak.
Interview: “Math-saturated. Check Tensor Cores or FP8/BF16 to raise the ceiling.”
2. Memory-Bound
e.g. Element-wise Addition
Observation: Math units starving; memory bus full.
Interview: “HBM-limited. Try kernel fusion or fix non-coalesced access.”
3. Latency / Under-utilized
e.g. Small Grid or High Sync
Observation: Both lights low; hardware mostly idling.
Interview: “Dead zone. Look at occupancy, global sync, tail effects.”
The same three cases in Mermaid (summary flow):
The Roofline Model: A Deep Dive
The roofline model is a simplified, visual model of performance used to quickly determine whether a program is bound by memory bandwidth or arithmetic bandwidth. Two hardware-derived “roofs” put a ceiling on possible performance: the compute roof (peak rate of CUDA Cores or Tensor Cores = arithmetic bandwidth) and the memory roof (peak memory throughput = memory bandwidth). See Modal GPU Glossary: Roofline Model.
The model is drawn on a plane with arithmetic intensity (operations per byte) on the x-axis and performance (operations per second) on the y-axis. The compute roof is a horizontal line at the arithmetic bandwidth; the memory roof is a slanted line whose slope equals the memory bandwidth (rise over run = ops/sec ÷ ops/byte = bytes/sec).
Memory-Bound Region
Workloads like LayerNorm or Element-wise ops live here. Limited by HBM bandwidth.
Fix: Kernel fusion, better caching.
Compute-Bound Region
Workloads like Large MatMuls live here. Limited by Tensor Core throughput.
Fix: SASS/PTX tuning, lower precision (FP8).
Ridge Point
Inflection point where the bottleneck shifts. The critical arithmetic intensity (AIcrit):
AI_crit = Peak Compute / Peak Memory BW
Units: (FLOP/s) ÷ (Byte/s) = FLOP/Byte — the seconds cancel.
H100: 989 TFLOP/s ÷ 3.35 TB/s ≈ 295 FLOP/Byte
H100 has a high ridge (~300), so kernels must be very math-heavy to reach 100% SM utilization.
Yellow line: Memory bandwidth. Cyan line: Arithmetic bandwidth. They meet at the Ridge Point. Example kernels (Softmax/Add, Dense GEMM, FlashAttention) show where typical workloads sit. If a kernel is far below the lines, discuss occupancy and memory latency stalls.
Two Roofs & Ridge Point
The ridge point is the boundary where the two roofs meet. Its x-coordinate is the minimum arithmetic intensity required to escape the memory bottleneck. Computer systems whose ridge point is further to the left are easier to achieve maximum performance on; the relatively poor scaling of memory relative to compute has pushed ridge points to the right over time.
Subsystems & NCU
The compute and memory roofs need only be derived once per subsystem—and they vary by subsystem, not just by system. Tensor Cores have more FLOPS than CUDA Cores, so the flat roof is higher for Tensor Core workloads.
NVIDIA Nsight Compute automatically performs roofline analysis for profiled kernels, so you can see where your kernel sits relative to both roofs without plotting by hand.
Caveats & Origin
The roofline model is deceptively simple. System latencies do not appear anywhere in the diagram—only bandwidths and throughputs. It is simple because it is highly opinionated; understanding those opinions and their reasoning is key to using the model well.
The roofline model was introduced by Samuel Williams, Andrew Waterman, and David Patterson in their 2008 paper (CACM). It was proposed in the face of several hardware scaling trends that still shape system design today.
Why Roofline: Historical Context
- Latency lags bandwidth (Patterson, 2004): linear improvement in latency has historically come with quadratic improvement in bandwidth → throughput-oriented designs (like GPUs).
- Memory wall (Wulf & McKee, 1994): compute has scaled much faster than memory/caches/DRAM.
- End of Dennard scaling: clock speed could not keep rising at equal power; transistor count kept rising (Moore’s Law) → solution was hardware specialization (e.g. GPUs, accelerators).
Taken together, these trends suggested that future systems would be throughput-oriented and that memory bandwidth would be the primary performance bottleneck. Applications that want peak performance need high arithmetic intensity for the hardware’s specialized ops—on GPUs, that means high AI for Tensor Cores, i.e. very large matrix multiplications.