GPU Memory Hierarchy

Understanding cache levels, access patterns, and bandwidth optimization for CUDA programming

1. GPU Memory Hierarchy

Modern NVIDIA GPUs have a multi-level memory hierarchy. Each level trades off capacity for speed. Understanding this hierarchy is critical for writing high-performance CUDA code.

flowchart TB subgraph GPU["GPU Die"] direction TB REG["Registers
~20 TB/s effective
256 KB per SM
~1 cycle latency"] L1["L1 Cache + Shared Memory
~12-19 TB/s
128-256 KB per SM
~30 cycles latency"] L2["L2 Cache
~6-8 TB/s
48-96 MB total
~200 cycles latency"] end HBM["HBM (VRAM)
~3.35 TB/s (H100)
80-192 GB total
~400+ cycles latency"] REG --> L1 L1 --> L2 L2 --> HBM style REG fill:#76B900,stroke:#76B900,color:#000 style L1 fill:#22c55e,stroke:#22c55e,color:#000 style L2 fill:#3b82f6,stroke:#3b82f6,color:#fff style HBM fill:#6366f1,stroke:#6366f1,color:#fff

Bandwidth Comparison (H100 SXM)

Registers
~20 TB/s
L1/Shared
~15 TB/s
L2 Cache
~7 TB/s
HBM
~3.35 TB/s

Data Path: Global Memory → L2 → Streaming Multiprocessors

Data flows from VRAM (global memory) through the L2 cache to the SMs. A cache miss causes a high-latency fetch from VRAM into a 128-byte L2 cache line; a cache hit keeps access low latency. A warp (32 threads) typically fetches 128 bytes at a time. Inside each SM, L1/shared memory is organized in 32 banksbank conflict serializes accesses when threads hit the same bank; parallel access occurs when threads use different banks.

GPU memory hierarchy: VRAM, L2 cache with 128-byte lines and fill/evict, streaming multiprocessors with L1/shared memory (32 banks), warp scheduler, SIMT execution; cache hit vs miss; bank conflict vs parallel access.
VRAM → L2 (128-byte cache lines, cache miss/fill/evict) → SMs. Each SM: L1/Shared Memory (32 banks), instruction unit, warp scheduler, SIMT execution. Bank conflict vs parallel access in shared memory.

2. Cache Line Architecture

L2 Cache Organization

Total Size 50-96 MB
Cache Line Size 128 bytes
Floats per Line 32 floats (4B each)
Associativity 16-32 way set-associative
Sectors per Line 4 × 32B sectors

Cache Line Layout

flowchart LR subgraph CL["Cache Line (128 bytes)"] direction LR S1["Sector 0
32B"] S2["Sector 1
32B"] S3["Sector 2
32B"] S4["Sector 3
32B"] end style S1 fill:#76B900,stroke:#76B900,color:#000 style S2 fill:#22c55e,stroke:#22c55e,color:#000 style S3 fill:#3b82f6,stroke:#3b82f6,color:#fff style S4 fill:#6366f1,stroke:#6366f1,color:#fff

Memory Layout: Cache Lines with Floats

Cache Line 0 (bytes 0-127): 32 floats
f0
f1
f2
f3
f4
f5
f6
f7
...
f28
f29
f30
f31
Cache Line 1 (bytes 128-255): 32 floats
f32
f33
f34
f35
f36
f37
f38
f39
...
f60
f61
f62
f63
Cache Line 2 (bytes 256-383): 32 floats
f64
f65
f66
f67
f68
f69
f70
f71
...
f92
f93
f94
f95

3. Access Patterns & Cache Efficiency

Sequential Access (stride=1) - Best Case

Every element in the cache line is used. Maximum cache efficiency!

// Access pattern: f0, f1, f2, f3, ... for (int i = 0; i < N; i++) { sum += data[i]; // stride = 1 }
Cache Efficiency: ~100%
All 32 floats per cache line are used
Cache Line 0: All floats used ✓
f0
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
f11
f12
f13
f14
f15
f16
f17
f18
f19
f20
f21
f22
f23
f24
f25
f26
f27
f28
f29
f30
f31
Measured BW: ~2500 GB/s (L2 cache serving most requests)

Strided Access (stride=32) - 1 Float per Line

Only 1 float used per cache line. Fetching 128 bytes to use 4 bytes!

// Access pattern: f0, f32, f64, f96, ... for (int i = 0; i < N; i += 32) { sum += data[i]; // stride = 32 }
Cache Efficiency: 3.125%
1 of 32 floats used per cache line
Cache Line 0: Only f0 used, f1-f31 wasted!
f0
f1
f2
f3
f4
f5
f6
f7
f8
f9
f10
f11
f12
f13
f14
f15
f16
f17
f18
f19
f20
f21
f22
f23
f24
f25
f26
f27
f28
f29
f30
f31
Cache Line 1: Only f32 used...
f32
f33
f34
...
Measured BW: ~1800 GB/s (some L2 misses)

Large Stride (stride=1024) - Pure HBM

Stride of 1024 floats = 4KB between accesses. Cache lines are evicted before reuse. This measures raw HBM bandwidth.

// Access pattern: f0, f1024, f2048, ... // Each access is 4KB apart! for (int i = 0; i < N; i += 1024) { sum += data[i]; // stride = 1024 }
Cache Efficiency: ~0%
Every access = L2 miss → HBM fetch
flowchart TD A["Read f0
(Cache Line 0)"] --> B["Skip 31 lines"] B --> C["Read f1024
(Cache Line 32)"] C --> D["Skip 31 lines"] D --> E["Read f2048
(Cache Line 64)"] E --> F["..."] G["Cache Line 0 evicted
before f32 is ever read!"] style A fill:#ef4444,stroke:#ef4444,color:#fff style C fill:#ef4444,stroke:#ef4444,color:#fff style E fill:#ef4444,stroke:#ef4444,color:#fff style G fill:#1e293b,stroke:#64748b,color:#94a3b8
Measured BW: ~800 GB/s (true HBM bandwidth)

Why L2 Can't Help

  • L2 Size: 96 MB, but tensor is 256 MB
  • Access Gap: 4KB between accesses = 32 cache lines skipped
  • Result: By the time loop returns to reuse data, it's been evicted
  • Every read: Goes to HBM (400+ cycle latency)

4. Stride vs Bandwidth Summary

Stride Bytes Skipped Cache Efficiency Measured BW What's Happening
1 4 bytes ~100% ~2500 GB/s L2 cache serves most requests
32 128 bytes 3.125% ~1800 GB/s 1 float per cache line used
128 512 bytes ~1% ~1200 GB/s L2 thrashing begins
512 2 KB ~0% ~900 GB/s Mostly HBM
1024+ 4 KB+ 0% ~800 GB/s Pure HBM (no cache benefit)

5. Practical Implications

✓ When Cache Helps

  • Sequential/coalesced access - stride=1
  • Dense matrix operations - GEMM, convolutions
  • Working set fits in L2 - <96 MB on H100
  • Data reuse - same data accessed multiple times

✗ When Cache Can't Help

  • Sparse matrix operations - random access
  • Embedding lookups - scattered indices
  • Large stride patterns - stride >128
  • Working set exceeds L2 - streaming data

Optimization Strategies

Memory Coalescing

Ensure threads access adjacent memory locations for maximum cache line utilization.

Tiling / Blocking

Process data in chunks that fit in L2/shared memory to maximize reuse.

Prefetching

Use async copies (TMA, cp.async) to hide memory latency.