Overview
When a thread issues a load from global memory (e.g. LDG), the request does not go straight to HBM. It passes through the cache hierarchy. Each level has a size (capacity), latency (cycles to satisfy a hit), and throughput (bandwidth). On a miss, the request goes to the next level until it reaches HBM; data then flows back up into the requesting thread’s register.
The diagram below shows one path: Register (request) → L1 → L2 → HBM, with typical sizes, latencies, and throughput for each level (H100 SXM–style numbers). On Hopper (H100) and Blackwell (B200), an additional level—Distributed Shared Memory (DSMEM)—lets SMs in a thread block cluster access each other’s shared memory without going through L2. Actual cycles vary by GPU and access pattern; use these as a mental model.
Diagram: Sizes, Latencies, Throughput
Request flows down on a miss; data flows back up into the register. Solid arrows: normal load path. Dashed: alternative path (TMA) — Hopper+ bulk copy from HBM/L2 into shared memory, bypassing registers. Numbers are approximate (H100-class GPU).
────────────
Size: private, ~65K regs/SM total
Latency: ~1 cycle (operand read)
Throughput: ~20 TB/s effective"] L1N["L1 / Shared
────────────
Size: 128–256 KB per SM
Latency: ~20–30 cycles (hit)
Throughput: ~12–19 TB/s"] L2N["L2
────────────
Size: 50–96 MB (whole GPU)
Latency: ~200 cycles (hit)
Throughput: ~6–8 TB/s
Line: 128 B"] H["HBM (Global)
────────────
Size: 80–192 GB
Latency: ~400–800 cycles (DRAM)
Throughput: ~3.35 TB/s (H100)"] end subgraph ALT["Alternative: TMA (no register path)"] direction TB TMA["TMA Unit (Hopper+)
────────────
Bulk copy, register bypass"] SH["Shared Memory
(TMA destination)"] end R -->|"1. Request (miss)"| L1N L1N -->|"2. L1 miss"| L2N L2N -->|"3. L2 miss"| H H -->|"4. Data return"| L2N L2N -->|"5. Fill L1"| L1N L1N -->|"6. To register"| R L2N -.->|"alt: TMA copy"| TMA TMA -.->|"alt: to shared"| SH style R fill:#76B900,stroke:#76B900,color:#000 style L1N fill:#22c55e,stroke:#22c55e,color:#000 style L2N fill:#3b82f6,stroke:#3b82f6,color:#fff style H fill:#6366f1,stroke:#6366f1,color:#fff style TMA fill:#76B900,stroke:#76B900,color:#000 style SH fill:#22c55e,stroke:#22c55e,color:#000
Alternative (TMA): Data can flow HBM → L2 → TMA unit → Shared Memory without passing through the register file. One thread initiates the bulk copy; see Tensor Memory Accelerator (TMA) for details.
Distributed Shared Memory (Hopper / Blackwell)
On H100 (Hopper) and B200 (Blackwell), there is an additional level of hierarchy: Distributed Shared Memory (DSMEM). In a Thread Block Cluster, threads can read and write the shared memory of other SMs in the same cluster via a high-speed SM-to-SM interconnect. That path does not go through L2, so latency for cluster-wide shared data is lower than going to L2 or HBM—which matters for complex, multi-block kernels like FlashAttention and other attention or producer-consumer patterns. For another alternative path—bulk copy from HBM to shared memory without going through registers—see Alternative path: TMA below.
Local Shared"] SM2["SM 2
Local Shared"] SM3["SM 3
Local Shared"] SM1 <-->|"DSMEM
SM-to-SM"| SM2 SM2 <-->|"DSMEM"| SM3 SM3 <-->|"DSMEM"| SM1 end style SM1 fill:#22c55e,stroke:#22c55e,color:#000 style SM2 fill:#22c55e,stroke:#22c55e,color:#000 style SM3 fill:#22c55e,stroke:#22c55e,color:#000
SMs in the same cluster can exchange data over DSMEM without hitting L2, reducing latency compared to going through global memory or L2.
“In addition to this hierarchy, on Hopper and Blackwell, I also consider Distributed Shared Memory, which allows SMs to exchange data without hitting the L2, further reducing latency for complex kernels like FlashAttention.”
— Interview flex: shows you know the full hierarchy including DSMEM and its impact on modern attention kernels.
Alternative Path: TMA
On Hopper (H100) and later, the Tensor Memory Accelerator (TMA) provides a different data path: HBM → L2 → Shared Memory via a dedicated TMA unit, bypassing the register file and the normal load pipeline. This path is shown as the dashed “alt: TMA” edges in the diagram above. One thread initiates a bulk copy; data lands directly in shared memory without passing through registers. The path still goes through L2; only registers (and the L1 load path) are bypassed. For full details, NCU metrics, and kernel names, see Tensor Memory Accelerator (TMA).
Request Path Step-by-Step
- Thread issues load. The warp executes a load instruction; the address is in a register. The request is sent to L1 (or the memory pipeline that checks L1).
- L1 lookup. If the line is in L1, data is returned in ~20–30 cycles and written to the destination register. If L1 miss, the request goes to L2.
- L2 lookup. If the line is in L2, it is fetched in ~200 cycles, fills L1, and then data reaches the register. If L2 miss, the request goes to HBM.
- HBM (DRAM) access. The controller fetches a cache line (e.g. 128 bytes) from HBM. Latency is hundreds of cycles (e.g. 400–800); throughput is limited by HBM bandwidth (~3.35 TB/s on H100).
- Data returns. Data flows back: HBM → L2 (line filled) → L1 (line filled) → register. The thread’s warp may have been descheduled while waiting; when data is ready, the warp can be scheduled again and the register gets the value.
Because latency is high for L2 and HBM, GPUs hide it by running many warps: when one warp stalls on a load, the scheduler runs another warp. See CUDA Execution & Memory Hierarchy (Warp Scheduler).
Reference Table
Approximate values for a modern datacenter GPU (H100-class). Latencies are in cycles; actual values depend on GPU and workload.
| Level | Size | Latency (hit) | Throughput |
|---|---|---|---|
| Register | ~65K 32-bit regs per SM (shared by threads) | ~1 cycle | ~20 TB/s effective |
| L1 / Shared | 128–256 KB per SM | ~20–30 cycles | ~12–19 TB/s |
| DSMEM (cluster shared) | Shared memory of other SMs in same cluster (H100/B200) | Lower than L2; SM-to-SM interconnect | High; bypasses L2 |
| L2 | 50–96 MB (whole GPU) | ~200 cycles | ~6–8 TB/s |
| HBM (Global) | 80–192 GB | ~400–800+ cycles | ~3.35 TB/s (H100 SXM) |