Understanding cache levels, access patterns, and bandwidth optimization for CUDA programming
Modern NVIDIA GPUs have a multi-level memory hierarchy. Each level trades off capacity for speed. Understanding this hierarchy is critical for writing high-performance CUDA code.
Data flows from VRAM (global memory) through the L2 cache to the SMs. A cache miss causes a high-latency fetch from VRAM into a 128-byte L2 cache line; a cache hit keeps access low latency. A warp (32 threads) typically fetches 128 bytes at a time. Inside each SM, L1/shared memory is organized in 32 banks—bank conflict serializes accesses when threads hit the same bank; parallel access occurs when threads use different banks.
| Total Size | 50-96 MB |
| Cache Line Size | 128 bytes |
| Floats per Line | 32 floats (4B each) |
| Associativity | 16-32 way set-associative |
| Sectors per Line | 4 × 32B sectors |
Every element in the cache line is used. Maximum cache efficiency!
Only 1 float used per cache line. Fetching 128 bytes to use 4 bytes!
Stride of 1024 floats = 4KB between accesses. Cache lines are evicted before reuse. This measures raw HBM bandwidth.
| Stride | Bytes Skipped | Cache Efficiency | Measured BW | What's Happening |
|---|---|---|---|---|
| 1 | 4 bytes | ~100% | ~2500 GB/s | L2 cache serves most requests |
| 32 | 128 bytes | 3.125% | ~1800 GB/s | 1 float per cache line used |
| 128 | 512 bytes | ~1% | ~1200 GB/s | L2 thrashing begins |
| 512 | 2 KB | ~0% | ~900 GB/s | Mostly HBM |
| 1024+ | 4 KB+ | 0% | ~800 GB/s | Pure HBM (no cache benefit) |
Ensure threads access adjacent memory locations for maximum cache line utilization.
Process data in chunks that fit in L2/shared memory to maximize reuse.
Use async copies (TMA, cp.async) to hide memory latency.