LLM Inference Performance

Compute vs Memory Bottleneck Analysis for Llama 3.3 70B

LLM inference has two distinct phases — prefill (compute-bound, processes all input tokens in parallel) and decode (memory-bandwidth-bound, generates output tokens one at a time). The balance between input and output length determines which phase dominates and where the bottleneck lies. This page visualizes the bottleneck regimes for NVIDIA B200 running Llama 3.3 70B, using data from NVIDIA's published benchmarks.

Bottleneck Analysis — B200

How does the balance between input tokens (prefill) and output tokens (decode) shape throughput? The charts below reveal two distinct bottleneck regimes on a single B200 GPU running Llama 3.3 70B.

Throughput Landscape

Bottleneck Transition

Key Takeaway: Two Bottleneck Regimes

Compute-Bound (Prefill-Heavy)

When input tokens dominate (e.g., 2k/128, 5k/500, 20k/2k), the prefill phase processes all input tokens in parallel via large GEMMs. The GPU's tensor cores are the bottleneck. Throughput drops sharply — the GPU is doing heavy math, not waiting on memory.

Memory-BW Bound (Decode-Heavy)

When output tokens dominate (e.g., 128/2k, 128/4k), the decode phase generates tokens one at a time, loading the full model weights per step. Memory bandwidth is the bottleneck — but batching amortizes the weight load, so throughput is higher.

                    Memory capacity also matters: Very long total sequences (e.g., 128/4k vs 128/2k) reduce throughput even within the decode-heavy regime, because the larger KV cache limits how many requests can run concurrently (smaller batch = less amortization).
                

NVIDIA Published Data — Llama 3.3 70B

Source: NVIDIA Deep Learning Performance — Max Throughput scenario.

B200 1x B200 · PP=1 TP=1 · FP4 · TensorRT-LLM 1.0

Input	Output	Throughput (tok/s)	Throughput / GPU
128	2,048	9,922	9,922
128	4,096	6,831	6,831
500	2,000	7,762	7,762
1,000	1,000	7,007	7,007
1,000	2,000	6,737	6,737
2,048	128	1,339	1,339
2,048	2,048	4,783	4,783
5,000	500	1,459	1,459
20,000	2,000	665	665

H100 SXM5 2x H100 · PP=1 TP=2 · FP8 · TensorRT-LLM 1.0

Input	Output	Throughput (tok/s)	Throughput / GPU
128	2,048	6,651	3,326
128	4,096	4,199	2,100
500	2,000	5,222	2,611
1,000	1,000	4,205	2,103
1,000	2,000	4,146	2,073
2,048	128	762	381
2,048	2,048	3,082	1,541
5,000	500	898	449
20,000	2,000	437	219