LLM Inference Performance

Compute vs Memory Bottleneck Analysis for Llama 3.3 70B

LLM inference has two distinct phases — prefill (compute-bound, processes all input tokens in parallel) and decode (memory-bandwidth-bound, generates output tokens one at a time). The balance between input and output length determines which phase dominates and where the bottleneck lies. This page visualizes the bottleneck regimes for NVIDIA B200 running Llama 3.3 70B, using data from NVIDIA's published benchmarks.

Bottleneck Analysis — B200

How does the balance between input tokens (prefill) and output tokens (decode) shape throughput? The charts below reveal two distinct bottleneck regimes on a single B200 GPU running Llama 3.3 70B.

Throughput Landscape

Bottleneck Transition

Key Takeaway: Two Bottleneck Regimes

Compute-Bound (Prefill-Heavy)

When input tokens dominate (e.g., 2k/128, 5k/500, 20k/2k), the prefill phase processes all input tokens in parallel via large GEMMs. The GPU's tensor cores are the bottleneck. Throughput drops sharply — the GPU is doing heavy math, not waiting on memory.

Memory-BW Bound (Decode-Heavy)

When output tokens dominate (e.g., 128/2k, 128/4k), the decode phase generates tokens one at a time, loading the full model weights per step. Memory bandwidth is the bottleneck — but batching amortizes the weight load, so throughput is higher.

Memory capacity also matters: Very long total sequences (e.g., 128/4k vs 128/2k) reduce throughput even within the decode-heavy regime, because the larger KV cache limits how many requests can run concurrently (smaller batch = less amortization).

NVIDIA Published Data — Llama 3.3 70B

Source: NVIDIA Deep Learning Performance — Max Throughput scenario.

B200 1x B200 · PP=1 TP=1 · FP4 · TensorRT-LLM 1.0
InputOutputThroughput (tok/s)Throughput / GPU
1282,0489,9229,922
1284,0966,8316,831
5002,0007,7627,762
1,0001,0007,0077,007
1,0002,0006,7376,737
2,0481281,3391,339
2,0482,0484,7834,783
5,0005001,4591,459
20,0002,000665665
H100 SXM5 2x H100 · PP=1 TP=2 · FP8 · TensorRT-LLM 1.0
InputOutputThroughput (tok/s)Throughput / GPU
1282,0486,6513,326
1284,0964,1992,100
5002,0005,2222,611
1,0001,0004,2052,103
1,0002,0004,1462,073
2,048128762381
2,0482,0483,0821,541
5,000500898449
20,0002,000437219