Protocols, Algorithms & Tuning for Multi-GPU Communication
In data-parallel training every GPU holds a full copy of the model and processes a different mini-batch. After the forward and backward pass each GPU has local gradients—but every GPU needs the same globally-averaged gradients before it can update its parameters. That synchronisation step is an AllReduce, and it is the single hottest communication primitive in distributed training.
NCCL (NVIDIA Collective Communications Library) is what makes this fast. It picks the best algorithm, protocol, and transport for the hardware topology so that gradient synchronisation overlaps with computation instead of blocking it.
Data-parallel training: each GPU computes local gradients on its own batch, then NCCL AllReduce sums them so every GPU gets identical averaged gradients for the parameter update.
Every NCCL collective ultimately moves data through one of three communication protocols. Understanding their trade-offs is the single most important thing for reasoning about NCCL performance: Simple maximises bandwidth for large transfers, LL minimises latency for small ones, and LL128 tries to give you both—especially over NVLink.
The two tables below are the reference you need. The first shows how each protocol synchronises data, what it costs per hop, and how much bandwidth it can use. The second shows the buffer geometry—how much data each protocol can pipeline through a single channel. Together they explain why NCCL picks different protocols for different message sizes.
| Property | Simple | LL (Low Latency) | LL128 |
|---|---|---|---|
| Design Goal | High bandwidth | Low latency | Low latency + high bandwidth |
| Synchronization | Memory fences (high overhead) |
Flag-based | Flag-based |
| Payload Unit | Data chunks | 4B data + 4B flag | 120B data + 8B flag |
| Bandwidth Utilization | Near peak | 25–50% of peak | ~95% of peak |
| Per-hop Latency | ~6 μs | ~1 μs | ~2 μs |
Each protocol allocates a fixed-size buffer per communication channel, divided into 8 pipeline slots. The buffer geometry determines how much data can be in-flight and directly affects pipelining efficiency.
| Protocol | Total Channel Buffer | Buffer per Slot | Effective Data per Slot |
|---|---|---|---|
| Simple | 4 MiB | 512 KiB | 512 KiB |
| LL | 256 KiB | 32 KiB | 16 KiB |
| LL128 | ~4800 KiB | 600 KiB | 562.5 KiB |
LL wastes 50% of each 8-byte unit on flags—hence its low bandwidth. LL128 wastes only 8B out of 128B (~6%), which is why it recovers ~95% of peak. Simple has zero flag overhead; synchronization cost is in the memory fences instead.
NCCL implements multiple algorithms, each supporting a subset of collectives and protocols. The table below summarises which algorithm–collective combinations are available in NCCL v2.19+.
| Algorithm | Protocols | AllReduce | Broadcast | Reduce | ReduceScatter | AllGather |
|---|---|---|---|---|---|---|
| Ring | Simple / LL / LL128 | ✓ | ✓ | ✓ | ✓ | ✓ |
| Tree | Simple / LL / LL128 | ✓ | ✓ | ✓ | ✓ | ✓ |
| NVLS (Intra-node) | Simple only | ✓ | ✗ | ✗ | ✓ | ✓ |
| NVLS Tree (Multi-node) | Simple only | ✓ | ✗ | ✗ | ✗ | ✗ |
NVLS algorithms leverage NVLink Switch (NVSwitch) for intra-node reduction. NVLS Tree extends this with a tree-based fan-out for inter-node communication. CollNet algorithms (not shown) offload reductions to SHARP-enabled network switches.
The protocol comparison table above captures the what. Here is the why—how each protocol synchronises data and where the bandwidth/latency trade-off comes from.
Designed to maximise bandwidth. Data is divided into large chunks dispatched across communication channels. Uses memory fences to enforce ordering—a receiver must wait until a full chunk has landed before accessing it. This makes it optimal for large transfers but adds significant overhead for small payloads.
Replaces memory fences with lightweight flag-based synchronization. Each transmission is 4 bytes of data + 4 bytes of flag, sent together via an 8-byte atomic operation. The intermediate buffer is placed in host memory so the CPU can poll the flag, which prevents GPUDirect RDMA and limits bandwidth to 25–50% of peak. Preferred when latency matters more than throughput.
Extends LL by transmitting 128-byte units (120B data + 8B flag), recovering ~95% of peak bandwidth while keeping flag-based sync. Works best over NVLink, where atomic 128-byte writes are guaranteed. On interconnects that cannot guarantee unsplit 128-byte atomics (e.g. PCIe), NCCL disables LL128 automatically.
Protocol Selection: NCCL's autotuner picks LL/LL128 for small messages and Simple for large ones. Override with NCCL_PROTO=Simple|LL|LL128. In most cases, the autotuner's default is the best choice.
NCCL subdivides every collective into communication channels. Each channel is launched as a separate CUDA block on its own SM, operating on a disjoint slice of the input buffer in parallel. This raises aggregate throughput, especially for large payloads, and helps balance traffic across multiple NICs on NVLink platforms.
However, too many channels can cause the per-channel chunk size to fall below the 512 KiB NIC-transport FIFO buffer, leading to partially filled sends that degrade PCIe and network throughput. NCCL heuristically reduces the active channel count for smaller messages.
Tuning note: Environment variables like NCCL_NTHREADS were once used to influence channel behaviour, but are now discouraged in recent NCCL versions and may cause incorrect behaviour if set.
NCCL uses different transport mechanisms depending on whether communication is intra-node or inter-node.
NCCL breaks each collective into low-level primitives (send,
recv,
recvReduceSend,
recvCopySend,
recvReduceCopySend)
and distributes them across parallel channels.
Algorithms fall into two execution patterns:
Each GPU must finish all steps in one iteration before starting the next.
Consecutive loop iterations can overlap, enabling higher throughput.
Combines a ReduceScatter phase with an AllGather phase over 2k−1 steps (k = number of GPUs).
In the first k steps, each GPU receives a segment, reduces it with local data, and forwards the result.
In the remaining k−1 steps, fully reduced segments are propagated around the ring via recvCopySend.
| Step | Primitive |
|---|---|
| 0 | send |
| 1 … k−2 | recvReduceSend |
| k−1 | recvReduceCopySend |
| k … 2k−3 | recvCopySend |
| 2k−2 | recv |
Uses a double binary tree topology. The algorithm proceeds in two phases within each loop iteration:
send upward; middle nodes recvReduceSend; root performs recvReduceCopySend.recvCopySend downward; leaves recv.These two phases can run concurrently by partitioning SMs into two groups—one for the bandwidth-intensive reduction, another for the broadcast—enabling overlap and better SM utilization.
When to use which? Ring excels for large messages (bandwidth-optimal), while Tree performs best for smaller messages (lower latency). NCCL's autotuner selects the algorithm based on message size and topology. Override with NCCL_ALGO=Ring|Tree.
Benchmarking on NVIDIA Grace Hopper (GH200) nodes (150 GB/s intra-node, 25 GB/s Slingshot inter-node) confirms the protocol trade-offs:
When no NCCL_ALGO or NCCL_PROTO overrides are set, NCCL picks based on message size:
| Message Size | Strategy | Algorithm | Protocol |
|---|---|---|---|
| Small (<128 KB) | Minimise latency | Tree | LL / LL128 |
| Medium (128 KB – 10 MB) | Balance | Tree / Ring | LL128 / Simple |
| Large (>10 MB) | Maximise bandwidth | Ring / NVLS | Simple |
| Setting | Default | NVLSTree Override |
|---|---|---|
| NCCL_ALGO | Ring | Ring;allreduce:NVLSTree |
| NCCL_PROTO | Simple | LL128,LL,Simple;allreduce:Simple |
| NCCL_IB_DISABLE | 1 | 1 |
| Size | Default Avg | Default BW | NVLSTree Avg | NVLSTree BW | Winner |
|---|---|---|---|---|---|
| 1 MB | 2.90 ms | 5.4 GB/s | 0.86 ms | 18.3 GB/s | NVLSTree (3.4×) |
| 2 MB | 2.95 ms | 10.7 GB/s | 0.85 ms | 36.9 GB/s | NVLSTree (3.5×) |
| 4 MB | 2.93 ms | 21.5 GB/s | 0.87 ms | 72.4 GB/s | NVLSTree (3.4×) |
| 8 MB | 3.02 ms | 41.7 GB/s | 1.33 ms | 94.6 GB/s | NVLSTree (2.3×) |
| 16 MB | 3.17 ms | 79.4 GB/s | 2.23 ms | 113.0 GB/s | NVLSTree (1.4×) |
| 32 MB | 3.31 ms | 152.1 GB/s | 2.31 ms | 217.8 GB/s | NVLSTree (1.4×) |
| 64 MB | 3.65 ms | 276.1 GB/s | 3.95 ms | 254.8 GB/s | Crossover |
| 128 MB | 4.89 ms | 411.4 GB/s | 8.22 ms | 245.0 GB/s | Default (1.7×) |
| 256 MB | 7.15 ms | 562.9 GB/s | 16.66 ms | 241.8 GB/s | Default (2.3×) |
| 512 MB | 14.21 ms | 566.9 GB/s | 33.31 ms | 241.7 GB/s | Default (2.3×) |
| 1 GB | 28.18 ms | 571.6 GB/s | — | — | Default |
| Size | Default Avg | Default BW | NVLSTree Avg | NVLSTree BW | Winner |
|---|---|---|---|---|---|
| 1 MB | 13.35 ms | 4.95 GB/s | 2.30 ms | 28.8 GB/s | NVLSTree (5.8×) |
| 2 MB | 13.44 ms | 9.83 GB/s | 2.31 ms | 57.3 GB/s | NVLSTree (5.8×) |
| 4 MB | 13.33 ms | 19.8 GB/s | 2.48 ms | 106.8 GB/s | NVLSTree (5.4×) |
| 8 MB | 13.14 ms | 40.2 GB/s | 3.64 ms | 145.2 GB/s | NVLSTree (3.6×) |
| 16 MB | 13.42 ms | 78.8 GB/s | 6.11 ms | 172.9 GB/s | NVLSTree (2.2×) |
| 32 MB | 13.86 ms | 152.6 GB/s | 10.98 ms | 192.5 GB/s | NVLSTree (1.3×) |
| 64 MB | 14.13 ms | 299.3 GB/s | 19.90 ms | 212.5 GB/s | Crossover |
| 128 MB | 14.52 ms | 582.3 GB/s | 38.00 ms | 222.5 GB/s | Default (2.6×) |
| 256 MB | 16.29 ms | 1038.4 GB/s | 73.89 ms | 228.9 GB/s | Default (4.5×) |
| Size | 2-Node Avg | 2-Node BW | 8-Node Avg | 8-Node BW | Latency | Bandwidth |
|---|---|---|---|---|---|---|
| 1 MB | 2.90 ms | 5.4 GB/s | 13.35 ms | 4.95 GB/s | 4.6× worse | ~same |
| 4 MB | 2.93 ms | 21.5 GB/s | 13.33 ms | 19.8 GB/s | 4.6× worse | ~same |
| 16 MB | 3.17 ms | 79.4 GB/s | 13.42 ms | 78.8 GB/s | 4.2× worse | ~same |
| 64 MB | 3.65 ms | 276.1 GB/s | 14.13 ms | 299.3 GB/s | 3.9× worse | 1.1× better |
| 128 MB | 4.89 ms | 411.4 GB/s | 14.52 ms | 582.3 GB/s | 3.0× worse | 1.4× better |
| 256 MB | 7.15 ms | 562.9 GB/s | 16.29 ms | 1038.4 GB/s | 2.3× worse | 1.8× better |
| 512 MB | 14.21 ms | 566.9 GB/s | 22.64 ms | 1493.8 GB/s | 1.6× worse | 2.6× better |
| Variable | Purpose |
|---|---|
| NCCL_DEBUG=INFO | Verbose logging of GPU communication topology, algorithm selection, and transport details. |
| NCCL_PROTO=Simple|LL|LL128 | Force a specific communication protocol (overrides autotuner). |
| NCCL_ALGO=Ring|Tree | Force a specific collective algorithm. |
| NCCL_CROSS_NIC=0|1|2 | Control whether NCCL routes intra-node traffic through NICs across sockets. |
| NCCL_NET_GDR_LEVEL | Control GPUDirect RDMA usage level based on GPU-NIC topology distance. |
| NCCL_IB_HCA | Specify which InfiniBand HCA(s) to use. |
| NCCL_SOCKET_IFNAME | Specify network interface for socket-based communication. |