GPU Memory Hierarchy - Interactive Guide

1. GPU Memory Hierarchy

Modern NVIDIA GPUs have a multi-level memory hierarchy. Each level trades off capacity for speed. Understanding this hierarchy is critical for writing high-performance CUDA code.

flowchart TB subgraph GPU["GPU Die"] direction TB REG["Registers
~20 TB/s effective
256 KB per SM
~1 cycle latency"] L1["L1 Cache + Shared Memory
~12-19 TB/s
128-256 KB per SM
~30 cycles latency"] L2["L2 Cache
~6-8 TB/s
48-96 MB total
~200 cycles latency"] end HBM["HBM (VRAM)
~3.35 TB/s (H100)
80-192 GB total
~400+ cycles latency"] REG --> L1 L1 --> L2 L2 --> HBM style REG fill:#76B900,stroke:#76B900,color:#000 style L1 fill:#22c55e,stroke:#22c55e,color:#000 style L2 fill:#3b82f6,stroke:#3b82f6,color:#fff style HBM fill:#6366f1,stroke:#6366f1,color:#fff

Bandwidth Comparison (H100 SXM)

Registers

~20 TB/s

L1/Shared

~15 TB/s

L2 Cache

~7 TB/s

HBM

~3.35 TB/s

Data Path: Global Memory → L2 → Streaming Multiprocessors

Data flows from VRAM (global memory) through the L2 cache to the SMs. A cache miss causes a high-latency fetch from VRAM into a 128-byte L2 cache line; a cache hit keeps access low latency. A warp (32 threads) typically fetches 128 bytes at a time. Inside each SM, L1/shared memory is organized in 32 banks—bank conflict serializes accesses when threads hit the same bank; parallel access occurs when threads use different banks.

GPU memory hierarchy: VRAM, L2 cache with 128-byte lines and fill/evict, streaming multiprocessors with L1/shared memory (32 banks), warp scheduler, SIMT execution; cache hit vs miss; bank conflict vs parallel access. — VRAM → L2 (128-byte cache lines, cache miss/fill/evict) → SMs. Each SM: L1/Shared Memory (32 banks), instruction unit, warp scheduler, SIMT execution. Bank conflict vs parallel access in shared memory.

2. Cache Line Architecture

L2 Cache Organization

Total Size	50-96 MB
Cache Line Size	128 bytes
Floats per Line	32 floats (4B each)
Associativity	16-32 way set-associative
Sectors per Line	4 × 32B sectors

Cache Line Layout

flowchart LR subgraph CL["Cache Line (128 bytes)"] direction LR S1["Sector 0
32B"] S2["Sector 1
32B"] S3["Sector 2
32B"] S4["Sector 3
32B"] end style S1 fill:#76B900,stroke:#76B900,color:#000 style S2 fill:#22c55e,stroke:#22c55e,color:#000 style S3 fill:#3b82f6,stroke:#3b82f6,color:#fff style S4 fill:#6366f1,stroke:#6366f1,color:#fff

Memory Layout: Cache Lines with Floats

Cache Line 0 (bytes 0-127): 32 floats

f0

f1

f2

f3

f4

f5

f6

f7

...

f28

f29

f30

f31

Cache Line 1 (bytes 128-255): 32 floats

f32

f33

f34

f35

f36

f37

f38

f39

...

f60

f61

f62

f63

Cache Line 2 (bytes 256-383): 32 floats

f64

f65

f66

f67

f68

f69

f70

f71

...

f92

f93

f94

f95

3. Access Patterns & Cache Efficiency

Sequential Access (stride=1) - Best Case

Every element in the cache line is used. Maximum cache efficiency!

// Access pattern: f0, f1, f2, f3, ...
for (int i = 0; i < N; i++) {
    sum += data[i];  // stride = 1
}
                        

Cache Efficiency: ~100%

All 32 floats per cache line are used

Cache Line 0: All floats used ✓

f0

f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

f11

f12

f13

f14

f15

f16

f17

f18

f19

f20

f21

f22

f23

f24

f25

f26

f27

f28

f29

f30

f31

Measured BW: ~2500 GB/s (L2 cache serving most requests)

Strided Access (stride=32) - 1 Float per Line

Only 1 float used per cache line. Fetching 128 bytes to use 4 bytes!

// Access pattern: f0, f32, f64, f96, ...
for (int i = 0; i < N; i += 32) {
    sum += data[i];  // stride = 32
}
                        

Cache Efficiency: 3.125%

1 of 32 floats used per cache line

Cache Line 0: Only f0 used, f1-f31 wasted!

f0

f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

f11

f12

f13

f14

f15

f16

f17

f18

f19

f20

f21

f22

f23

f24

f25

f26

f27

f28

f29

f30

f31

Cache Line 1: Only f32 used...

f32

f33

f34

...

Measured BW: ~1800 GB/s (some L2 misses)

Large Stride (stride=1024) - Pure HBM

Stride of 1024 floats = 4KB between accesses. Cache lines are evicted before reuse. This measures raw HBM bandwidth.

// Access pattern: f0, f1024, f2048, ...
// Each access is 4KB apart!
for (int i = 0; i < N; i += 1024) {
    sum += data[i];  // stride = 1024
}
                        

Cache Efficiency: ~0%

Every access = L2 miss → HBM fetch

flowchart TD A["Read f0
(Cache Line 0)"] --> B["Skip 31 lines"] B --> C["Read f1024
(Cache Line 32)"] C --> D["Skip 31 lines"] D --> E["Read f2048
(Cache Line 64)"] E --> F["..."] G["Cache Line 0 evicted
before f32 is ever read!"] style A fill:#ef4444,stroke:#ef4444,color:#fff style C fill:#ef4444,stroke:#ef4444,color:#fff style E fill:#ef4444,stroke:#ef4444,color:#fff style G fill:#1e293b,stroke:#64748b,color:#94a3b8

Measured BW: ~800 GB/s (true HBM bandwidth)

Why L2 Can't Help

• L2 Size: 96 MB, but tensor is 256 MB
• Access Gap: 4KB between accesses = 32 cache lines skipped
• Result: By the time loop returns to reuse data, it's been evicted
• Every read: Goes to HBM (400+ cycle latency)

4. Stride vs Bandwidth Summary

Stride	Bytes Skipped	Cache Efficiency	Measured BW	What's Happening
1	4 bytes	~100%	~2500 GB/s	L2 cache serves most requests
32	128 bytes	3.125%	~1800 GB/s	1 float per cache line used
128	512 bytes	~1%	~1200 GB/s	L2 thrashing begins
512	2 KB	~0%	~900 GB/s	Mostly HBM
1024+	4 KB+	0%	~800 GB/s	Pure HBM (no cache benefit)

5. Practical Implications

✓ When Cache Helps

• Sequential/coalesced access - stride=1
• Dense matrix operations - GEMM, convolutions
• Working set fits in L2 - <96 MB on H100
• Data reuse - same data accessed multiple times

✗ When Cache Can't Help

• Sparse matrix operations - random access
• Embedding lookups - scattered indices
• Large stride patterns - stride >128
• Working set exceeds L2 - streaming data

Optimization Strategies

Memory Coalescing

Ensure threads access adjacent memory locations for maximum cache line utilization.

Tiling / Blocking

Process data in chunks that fit in L2/shared memory to maximize reuse.

Prefetching

Use async copies (TMA, cp.async) to hide memory latency.