NCCL Performance: Protocols, Algorithms & Tuning

Why NCCL?

In data-parallel training every GPU holds a full copy of the model and processes a different mini-batch. After the forward and backward pass each GPU has local gradients—but every GPU needs the same globally-averaged gradients before it can update its parameters. That synchronisation step is an AllReduce, and it is the single hottest communication primitive in distributed training.

NCCL (NVIDIA Collective Communications Library) is what makes this fast. It picks the best algorithm, protocol, and transport for the hardware topology so that gradient synchronisation overlaps with computation instead of blocking it.

flowchart TB subgraph GPU0["GPU 0"] P0["Parameters"] --> F0["Forward + Backward\n(batch 0)"] F0 --> G0["Local Gradients"] end subgraph GPU1["GPU 1"] P1["Parameters"] --> F1["Forward + Backward\n(batch 1)"] F1 --> G1["Local Gradients"] end subgraph GPU2["GPU 2"] P2["Parameters"] --> F2["Forward + Backward\n(batch 2)"] F2 --> G2["Local Gradients"] end subgraph GPU3["GPU 3"] P3["Parameters"] --> F3["Forward + Backward\n(batch 3)"] F3 --> G3["Local Gradients"] end G0 --> AR["NCCL AllReduce\nSum gradients across GPUs"] G1 --> AR G2 --> AR G3 --> AR AR --> U0["Averaged Gradients"] AR --> U1["Averaged Gradients"] AR --> U2["Averaged Gradients"] AR --> U3["Averaged Gradients"] U0 -.->|"update"| P0 U1 -.->|"update"| P1 U2 -.->|"update"| P2 U3 -.->|"update"| P3 style AR fill:#76B900,stroke:#5a8f00,color:#fff,font-weight:bold style G0 fill:#555,stroke:#333,color:#fff style G1 fill:#555,stroke:#333,color:#fff style G2 fill:#555,stroke:#333,color:#fff style G3 fill:#555,stroke:#333,color:#fff style U0 fill:#555,stroke:#333,color:#fff style U1 fill:#555,stroke:#333,color:#fff style U2 fill:#555,stroke:#333,color:#fff style U3 fill:#555,stroke:#333,color:#fff style P0 fill:#444,stroke:#333,color:#fff style P1 fill:#444,stroke:#333,color:#fff style P2 fill:#444,stroke:#333,color:#fff style P3 fill:#444,stroke:#333,color:#fff

Data-parallel training: each GPU computes local gradients on its own batch, then NCCL AllReduce sums them so every GPU gets identical averaged gradients for the parameter update.

The Three Protocols at a Glance

Every NCCL collective ultimately moves data through one of three communication protocols. Understanding their trade-offs is the single most important thing for reasoning about NCCL performance: Simple maximises bandwidth for large transfers, LL minimises latency for small ones, and LL128 tries to give you both—especially over NVLink.

The two tables below are the reference you need. The first shows how each protocol synchronises data, what it costs per hop, and how much bandwidth it can use. The second shows the buffer geometry—how much data each protocol can pipeline through a single channel. Together they explain why NCCL picks different protocols for different message sizes.

Protocol Comparison

Property	Simple	LL (Low Latency)	LL128
Design Goal	High bandwidth	Low latency	Low latency + high bandwidth
Synchronization	Memory fences (high overhead)	Flag-based	Flag-based
Payload Unit	Data chunks	4B data + 4B flag	120B data + 8B flag
Bandwidth Utilization	Near peak	25–50% of peak	~95% of peak
Per-hop Latency	~6 μs	~1 μs	~2 μs

Channel Buffer Sizes

Each protocol allocates a fixed-size buffer per communication channel, divided into 8 pipeline slots. The buffer geometry determines how much data can be in-flight and directly affects pipelining efficiency.

Protocol	Total Channel Buffer	Buffer per Slot	Effective Data per Slot
Simple	4 MiB	512 KiB	512 KiB
LL	256 KiB	32 KiB	16 KiB
LL128	~4800 KiB	600 KiB	562.5 KiB

LL wastes 50% of each 8-byte unit on flags—hence its low bandwidth. LL128 wastes only 8B out of 128B (~6%), which is why it recovers ~95% of peak. Simple has zero flag overhead; synchronization cost is in the memory fences instead.

Algorithm & Collective Support Matrix

NCCL implements multiple algorithms, each supporting a subset of collectives and protocols. The table below summarises which algorithm–collective combinations are available in NCCL v2.19+.

Algorithm	Protocols	AllReduce	Broadcast	Reduce	ReduceScatter	AllGather
Ring	Simple / LL / LL128	✓	✓	✓	✓	✓
Tree	Simple / LL / LL128	✓	✓	✓	✓	✓
NVLS (Intra-node)	Simple only	✓	✗	✗	✓	✓
NVLS Tree (Multi-node)	Simple only	✓	✗	✗	✗	✗

NVLS algorithms leverage NVLink Switch (NVSwitch) for intra-node reduction. NVLS Tree extends this with a tree-based fan-out for inter-node communication. CollNet algorithms (not shown) offload reductions to SHARP-enabled network switches.

Protocol Deep Dive

The protocol comparison table above captures the what. Here is the why—how each protocol synchronises data and where the bandwidth/latency trade-off comes from.

Simple Protocol

Designed to maximise bandwidth. Data is divided into large chunks dispatched across communication channels. Uses memory fences to enforce ordering—a receiver must wait until a full chunk has landed before accessing it. This makes it optimal for large transfers but adds significant overhead for small payloads.

LL (Low Latency) Protocol

Replaces memory fences with lightweight flag-based synchronization. Each transmission is 4 bytes of data + 4 bytes of flag, sent together via an 8-byte atomic operation. The intermediate buffer is placed in host memory so the CPU can poll the flag, which prevents GPUDirect RDMA and limits bandwidth to 25–50% of peak. Preferred when latency matters more than throughput.

LL128 Protocol

Extends LL by transmitting 128-byte units (120B data + 8B flag), recovering ~95% of peak bandwidth while keeping flag-based sync. Works best over NVLink, where atomic 128-byte writes are guaranteed. On interconnects that cannot guarantee unsplit 128-byte atomics (e.g. PCIe), NCCL disables LL128 automatically.

Protocol Selection: NCCL's autotuner picks LL/LL128 for small messages and Simple for large ones. Override with NCCL_PROTO=Simple|LL|LL128. In most cases, the autotuner's default is the best choice.

Communication Channels

NCCL subdivides every collective into communication channels. Each channel is launched as a separate CUDA block on its own SM, operating on a disjoint slice of the input buffer in parallel. This raises aggregate throughput, especially for large payloads, and helps balance traffic across multiple NICs on NVLink platforms.

However, too many channels can cause the per-channel chunk size to fall below the 512 KiB NIC-transport FIFO buffer, leading to partially filled sends that degrade PCIe and network throughput. NCCL heuristically reduces the active channel count for smaller messages.

Tuning note: Environment variables like NCCL_NTHREADS were once used to influence channel behaviour, but are now discouraged in recent NCCL versions and may cause incorrect behaviour if set.

Data Transfer Methods

NCCL uses different transport mechanisms depending on whether communication is intra-node or inter-node.

Intra-Node Transports

P2P (NVLink): Preferred path. GPUDirect Peer-to-Peer over NVLink gives highest bandwidth and lowest latency.
P2P (PCIe): Fallback when NVLink is unavailable. Still avoids host-memory staging via GPUDirect.
P2P_DIRECT: Optimisation for same-process ranks—bypasses IPC handles and intermediate FIFO copies by using direct GPU memory pointers.
SHM (Shared Memory): Used when P2P over PCIe is suboptimal (e.g. inter-socket PCIe traffic). Routes via system memory.

Inter-Node Transports

IB Verbs (InfiniBand / RoCE): High-performance RDMA transport. Data staged through intermediate buffers; proxy thread manages DMA/RDMA operations.
GPUDirect RDMA: When NIC and GPU share a PCIe switch, the intermediate buffer lives in GPU memory—NIC accesses it directly, bypassing CPU and host memory entirely.
Socket (TCP): Fallback when RDMA is unavailable. Data copies through CPU-pinned host memory, incurring extra PCIe round-trips.

Collective Algorithms

NCCL breaks each collective into low-level primitives (send, recv, recvReduceSend, recvCopySend, recvReduceCopySend) and distributes them across parallel channels. Algorithms fall into two execution patterns:

Non-Pipelined

Each GPU must finish all steps in one iteration before starting the next.

Ring AllReduce (2k−1 steps)
Ring AllGather (k−1 steps)
Ring ReduceScatter (k−1 steps)

Pipelined

Consecutive loop iterations can overlap, enabling higher throughput.

Tree AllReduce (Reduce + Broadcast phases)
Ring Broadcast
Ring Reduce

Ring AllReduce

Combines a ReduceScatter phase with an AllGather phase over 2k−1 steps (k = number of GPUs). In the first k steps, each GPU receives a segment, reduces it with local data, and forwards the result. In the remaining k−1 steps, fully reduced segments are propagated around the ring via recvCopySend.

Step	Primitive
0	`send`
1 … k−2	`recvReduceSend`
k−1	`recvReduceCopySend`
k … 2k−3	`recvCopySend`
2k−2	`recv`

Tree AllReduce

Uses a double binary tree topology. The algorithm proceeds in two phases within each loop iteration:

Reduce phase: Leaves send upward; middle nodes recvReduceSend; root performs recvReduceCopySend.
Broadcast phase: Root and middle nodes recvCopySend downward; leaves recv.

These two phases can run concurrently by partitioning SMs into two groups—one for the bandwidth-intensive reduction, another for the broadcast—enabling overlap and better SM utilization.

When to use which? Ring excels for large messages (bandwidth-optimal), while Tree performs best for smaller messages (lower latency). NCCL's autotuner selects the algorithm based on message size and topology. Override with NCCL_ALGO=Ring|Tree.

Performance Insights (Theoretical)

Benchmarking on NVIDIA Grace Hopper (GH200) nodes (150 GB/s intra-node, 25 GB/s Slingshot inter-node) confirms the protocol trade-offs:

Inter-node: LL/LL128 win for small messages (<64 KiB); Simple dominates for large messages (GB range) as flag-based sync accumulates millions of operations.
Intra-node: LL128 shines across all sizes over NVLink—only ~5% slower than Simple at large messages, comparable to LL for small ones.
Algorithm: Ring excels for large messages (bandwidth-optimal); Tree for small messages (lower latency). Autotuning generally picks the right combination.

NCCL Auto-Selection Logic

When no NCCL_ALGO or NCCL_PROTO overrides are set, NCCL picks based on message size:

Message Size	Strategy	Algorithm	Protocol
Small (<128 KB)	Minimise latency	Tree	LL / LL128
Medium (128 KB – 10 MB)	Balance	Tree / Ring	LL128 / Simple
Large (>10 MB)	Maximise bandwidth	Ring / NVLS	Simple

Real-World Benchmarks: H100 AllReduce

Test Configuration

GPUs: 8× H100-80GB per node
Intra-node: NVLink
Inter-node: GPUDirect-TCPX (no InfiniBand)

Message sizes: 1 MB – 1 GB
Warmup: 100 iterations, 10,000 iterations per size
Date: 2026-03-31

Setting	Default	NVLSTree Override
NCCL_ALGO	Ring	Ring;allreduce:NVLSTree
NCCL_PROTO	Simple	LL128,LL,Simple;allreduce:Simple
NCCL_IB_DISABLE	1	1

2-Node Results (16 GPUs): Default vs NVLSTree

Size	Default Avg	Default BW	NVLSTree Avg	NVLSTree BW	Winner
1 MB	2.90 ms	5.4 GB/s	0.86 ms	18.3 GB/s	NVLSTree (3.4×)
2 MB	2.95 ms	10.7 GB/s	0.85 ms	36.9 GB/s	NVLSTree (3.5×)
4 MB	2.93 ms	21.5 GB/s	0.87 ms	72.4 GB/s	NVLSTree (3.4×)
8 MB	3.02 ms	41.7 GB/s	1.33 ms	94.6 GB/s	NVLSTree (2.3×)
16 MB	3.17 ms	79.4 GB/s	2.23 ms	113.0 GB/s	NVLSTree (1.4×)
32 MB	3.31 ms	152.1 GB/s	2.31 ms	217.8 GB/s	NVLSTree (1.4×)
64 MB	3.65 ms	276.1 GB/s	3.95 ms	254.8 GB/s	Crossover
128 MB	4.89 ms	411.4 GB/s	8.22 ms	245.0 GB/s	Default (1.7×)
256 MB	7.15 ms	562.9 GB/s	16.66 ms	241.8 GB/s	Default (2.3×)
512 MB	14.21 ms	566.9 GB/s	33.31 ms	241.7 GB/s	Default (2.3×)
1 GB	28.18 ms	571.6 GB/s	—	—	Default

8-Node Results (64 GPUs): Default vs NVLSTree

Size	Default Avg	Default BW	NVLSTree Avg	NVLSTree BW	Winner
1 MB	13.35 ms	4.95 GB/s	2.30 ms	28.8 GB/s	NVLSTree (5.8×)
2 MB	13.44 ms	9.83 GB/s	2.31 ms	57.3 GB/s	NVLSTree (5.8×)
4 MB	13.33 ms	19.8 GB/s	2.48 ms	106.8 GB/s	NVLSTree (5.4×)
8 MB	13.14 ms	40.2 GB/s	3.64 ms	145.2 GB/s	NVLSTree (3.6×)
16 MB	13.42 ms	78.8 GB/s	6.11 ms	172.9 GB/s	NVLSTree (2.2×)
32 MB	13.86 ms	152.6 GB/s	10.98 ms	192.5 GB/s	NVLSTree (1.3×)
64 MB	14.13 ms	299.3 GB/s	19.90 ms	212.5 GB/s	Crossover
128 MB	14.52 ms	582.3 GB/s	38.00 ms	222.5 GB/s	Default (2.6×)
256 MB	16.29 ms	1038.4 GB/s	73.89 ms	228.9 GB/s	Default (4.5×)

Scaling: 2-Node vs 8-Node (Default Settings)

Size	2-Node Avg	2-Node BW	8-Node Avg	8-Node BW	Latency	Bandwidth
1 MB	2.90 ms	5.4 GB/s	13.35 ms	4.95 GB/s	4.6× worse	~same
4 MB	2.93 ms	21.5 GB/s	13.33 ms	19.8 GB/s	4.6× worse	~same
16 MB	3.17 ms	79.4 GB/s	13.42 ms	78.8 GB/s	4.2× worse	~same
64 MB	3.65 ms	276.1 GB/s	14.13 ms	299.3 GB/s	3.9× worse	1.1× better
128 MB	4.89 ms	411.4 GB/s	14.52 ms	582.3 GB/s	3.0× worse	1.4× better
256 MB	7.15 ms	562.9 GB/s	16.29 ms	1038.4 GB/s	2.3× worse	1.8× better
512 MB	14.21 ms	566.9 GB/s	22.64 ms	1493.8 GB/s	1.6× worse	2.6× better

Key Takeaways

Default (Ring/Simple) wins for large messages (≥64 MB): up to 2.3× higher bandwidth on 2 nodes, up to 4.5× on 8 nodes. Peak ~571 GB/s (2N) and ~1494 GB/s (8N).
NVLSTree wins for small messages (≤32 MB): up to 3.5× lower latency on 2 nodes, up to 5.8× on 8 nodes via NVLS hardware intra-node reduction.
Crossover point: ~64 MB on both 2-node and 8-node setups.
Scaling: latency grows ~4× from 2→8 nodes for small messages, but aggregate bandwidth improves for large messages (2.6× better at 512 MB).
NVLSTree advantage grows with node count—Ring latency scales linearly with GPU count, while NVLSTree scales logarithmically.

Key Environment Variables

Variable	Purpose
NCCL_DEBUG=INFO	Verbose logging of GPU communication topology, algorithm selection, and transport details.
NCCL_PROTO=Simple\|LL\|LL128	Force a specific communication protocol (overrides autotuner).
NCCL_ALGO=Ring\|Tree	Force a specific collective algorithm.
NCCL_CROSS_NIC=0\|1\|2	Control whether NCCL routes intra-node traffic through NICs across sockets.
NCCL_NET_GDR_LEVEL	Control GPUDirect RDMA usage level based on GPU-NIC topology distance.
NCCL_IB_HCA	Specify which InfiniBand HCA(s) to use.
NCCL_SOCKET_IFNAME	Specify network interface for socket-based communication.