Roofline Model (Speed of Light Diagram)

Speed of Light (SOL) Analysis: Visual Example

Conceptual representation of how NVIDIA Nsight Compute displays SOL metrics for three kernel types. Each shows SM SOL (compute ceiling) and Memory SOL (bandwidth ceiling).

1. Compute-Bound

e.g. Large Matrix Multiply

SM SOL: 92.4%

Memory SOL: 18.2%

Observation: SM is almost at theoretical peak.

Interview: “Math-saturated. Check Tensor Cores or FP8/BF16 to raise the ceiling.”

2. Memory-Bound

e.g. Element-wise Addition

SM SOL: 12.1%

Memory SOL: 88.7%

Observation: Math units starving; memory bus full.

Interview: “HBM-limited. Try kernel fusion or fix non-coalesced access.”

3. Latency / Under-utilized

e.g. Small Grid or High Sync

SM SOL: 22.5%

Memory SOL: 15.8%

Observation: Both lights low; hardware mostly idling.

Interview: “Dead zone. Look at occupancy, global sync, tail effects.”

The same three cases in Mermaid (summary flow):

flowchart LR subgraph K1["1. Compute-Bound"] A1["SM SOL 92.4%"] A2["Mem SOL 18.2%"] A1 --- A2 end subgraph K2["2. Memory-Bound"] B1["SM SOL 12.1%"] B2["Mem SOL 88.7%"] B1 --- B2 end subgraph K3["3. Latency"] C1["SM SOL 22.5%"] C2["Mem SOL 15.8%"] C1 --- C2 end K1 --- K2 --- K3 style K1 fill:#1e3a2e,stroke:#22c55e,color:#e2e8f0 style K2 fill:#1e2a3e,stroke:#3b82f6,color:#e2e8f0 style K3 fill:#3e2a1e,stroke:#f59e0b,color:#e2e8f0

The Roofline Model: A Deep Dive

The roofline model is a simplified, visual model of performance used to quickly determine whether a program is bound by memory bandwidth or arithmetic bandwidth. Two hardware-derived “roofs” put a ceiling on possible performance: the compute roof (peak rate of CUDA Cores or Tensor Cores = arithmetic bandwidth) and the memory roof (peak memory throughput = memory bandwidth). See Modal GPU Glossary: Roofline Model.

The model is drawn on a plane with arithmetic intensity (operations per byte) on the x-axis and performance (operations per second) on the y-axis. The compute roof is a horizontal line at the arithmetic bandwidth; the memory roof is a slanted line whose slope equals the memory bandwidth (rise over run = ops/sec ÷ ops/byte = bytes/sec).

Memory-Bound Region

Workloads like LayerNorm or Element-wise ops live here. Limited by HBM bandwidth.

Fix: Kernel fusion, better caching.

Compute-Bound Region

Workloads like Large MatMuls live here. Limited by Tensor Core throughput.

Fix: SASS/PTX tuning, lower precision (FP8).

Ridge Point

Inflection point where the bottleneck shifts. The critical arithmetic intensity (AI_crit): AI_crit = Peak Compute / Peak Memory BW Units: (FLOP/s) ÷ (Byte/s) = FLOP/Byte — the seconds cancel. H100: 989 TFLOP/s ÷ 3.35 TB/s ≈ 295 FLOP/Byte H100 has a high ridge (~300), so kernels must be very math-heavy to reach 100% SM utilization.

Yellow line: Memory bandwidth. Cyan line: Arithmetic bandwidth. They meet at the Ridge Point. Example kernels (Softmax/Add, Dense GEMM, FlashAttention) show where typical workloads sit. If a kernel is far below the lines, discuss occupancy and memory latency stalls.

Two Roofs & Ridge Point

flowchart LR subgraph COMPROOF["Compute roof"] C1["Horizontal line"] C2["Height = arithmetic bandwidth"] C3["CUDA Cores or Tensor Cores peak"] C1 --> C2 --> C3 end subgraph MEMROOF["Memory roof"] M1["Slanted line"] M2["Slope = memory bandwidth"] M3["On line = memory-limited"] M1 --> M2 --> M3 end subgraph RIDGE["Ridge point"] R1["Where diagonal and horizontal meet"] R2["x = min AI to escape memory bottleneck"] R3["Ridge right = harder to hit max perf"] R1 --> R2 --> R3 end COMPROOF --- MEMROOF --- RIDGE style COMPROOF fill:#1e3a1e,stroke:#22c55e,color:#e2e8f0 style MEMROOF fill:#1e3a5f,stroke:#3b82f6,color:#e2e8f0 style RIDGE fill:#334155,stroke:#76B900,color:#e2e8f0

The ridge point is the boundary where the two roofs meet. Its x-coordinate is the minimum arithmetic intensity required to escape the memory bottleneck. Computer systems whose ridge point is further to the left are easier to achieve maximum performance on; the relatively poor scaling of memory relative to compute has pushed ridge points to the right over time.

Subsystems & NCU

The compute and memory roofs need only be derived once per subsystem—and they vary by subsystem, not just by system. Tensor Cores have more FLOPS than CUDA Cores, so the flat roof is higher for Tensor Core workloads.

NVIDIA Nsight Compute automatically performs roofline analysis for profiled kernels, so you can see where your kernel sits relative to both roofs without plotting by hand.

Caveats & Origin

The roofline model is deceptively simple. System latencies do not appear anywhere in the diagram—only bandwidths and throughputs. It is simple because it is highly opinionated; understanding those opinions and their reasoning is key to using the model well.

The roofline model was introduced by Samuel Williams, Andrew Waterman, and David Patterson in their 2008 paper (CACM). It was proposed in the face of several hardware scaling trends that still shape system design today.

Why Roofline: Historical Context

flowchart TB subgraph TRENDS["Trends that motivated roofline"] T1["Latency lags bandwidth"] T2["Memory wall: compute scaled faster than memory"] T3["End of Dennard scaling; Moore's Law continued"] T1 --> T2 --> T3 end subgraph IMPL["Implications"] I1["Throughput-oriented systems"] I2["Memory as primary bottleneck"] I3["Need high AI for peak perf e.g. Tensor Cores"] I1 --> I2 --> I3 end TRENDS --> IMPL style TRENDS fill:#1e293b,stroke:#475569,color:#e2e8f0 style IMPL fill:#1e3a2e,stroke:#22c55e,color:#e2e8f0

Latency lags bandwidth (Patterson, 2004): linear improvement in latency has historically come with quadratic improvement in bandwidth → throughput-oriented designs (like GPUs).
Memory wall (Wulf & McKee, 1994): compute has scaled much faster than memory/caches/DRAM.
End of Dennard scaling: clock speed could not keep rising at equal power; transistor count kept rising (Moore’s Law) → solution was hardware specialization (e.g. GPUs, accelerators).

Taken together, these trends suggested that future systems would be throughput-oriented and that memory bandwidth would be the primary performance bottleneck. Applications that want peak performance need high arithmetic intensity for the hardware’s specialized ops—on GPUs, that means high AI for Tensor Cores, i.e. very large matrix multiplications.