LLM Inference Performance
Compute vs Memory Bottleneck Analysis for Llama 3.3 70B
LLM inference has two distinct phases — prefill (compute-bound, processes all input tokens in parallel) and decode (memory-bandwidth-bound, generates output tokens one at a time). The balance between input and output length determines which phase dominates and where the bottleneck lies. This page visualizes the bottleneck regimes for NVIDIA B200 running Llama 3.3 70B, using data from NVIDIA's published benchmarks.
Bottleneck Analysis — B200
How does the balance between input tokens (prefill) and output tokens (decode) shape throughput? The charts below reveal two distinct bottleneck regimes on a single B200 GPU running Llama 3.3 70B.
Throughput Landscape
Bottleneck Transition
Key Takeaway: Two Bottleneck Regimes
When input tokens dominate (e.g., 2k/128, 5k/500, 20k/2k), the prefill phase processes all input tokens in parallel via large GEMMs. The GPU's tensor cores are the bottleneck. Throughput drops sharply — the GPU is doing heavy math, not waiting on memory.
When output tokens dominate (e.g., 128/2k, 128/4k), the decode phase generates tokens one at a time, loading the full model weights per step. Memory bandwidth is the bottleneck — but batching amortizes the weight load, so throughput is higher.
NVIDIA Published Data — Llama 3.3 70B
Source: NVIDIA Deep Learning Performance — Max Throughput scenario.
| Input | Output | Throughput (tok/s) | Throughput / GPU |
|---|---|---|---|
| 128 | 2,048 | 9,922 | 9,922 |
| 128 | 4,096 | 6,831 | 6,831 |
| 500 | 2,000 | 7,762 | 7,762 |
| 1,000 | 1,000 | 7,007 | 7,007 |
| 1,000 | 2,000 | 6,737 | 6,737 |
| 2,048 | 128 | 1,339 | 1,339 |
| 2,048 | 2,048 | 4,783 | 4,783 |
| 5,000 | 500 | 1,459 | 1,459 |
| 20,000 | 2,000 | 665 | 665 |
| Input | Output | Throughput (tok/s) | Throughput / GPU |
|---|---|---|---|
| 128 | 2,048 | 6,651 | 3,326 |
| 128 | 4,096 | 4,199 | 2,100 |
| 500 | 2,000 | 5,222 | 2,611 |
| 1,000 | 1,000 | 4,205 | 2,103 |
| 1,000 | 2,000 | 4,146 | 2,073 |
| 2,048 | 128 | 762 | 381 |
| 2,048 | 2,048 | 3,082 | 1,541 |
| 5,000 | 500 | 898 | 449 |
| 20,000 | 2,000 | 437 | 219 |