NCCL Performance

Protocols, Algorithms & Tuning for Multi-GPU Communication

Why NCCL?

In data-parallel training every GPU holds a full copy of the model and processes a different mini-batch. After the forward and backward pass each GPU has local gradients—but every GPU needs the same globally-averaged gradients before it can update its parameters. That synchronisation step is an AllReduce, and it is the single hottest communication primitive in distributed training.

NCCL (NVIDIA Collective Communications Library) is what makes this fast. It picks the best algorithm, protocol, and transport for the hardware topology so that gradient synchronisation overlaps with computation instead of blocking it.

flowchart TB subgraph GPU0["GPU 0"] P0["Parameters"] --> F0["Forward + Backward\n(batch 0)"] F0 --> G0["Local Gradients"] end subgraph GPU1["GPU 1"] P1["Parameters"] --> F1["Forward + Backward\n(batch 1)"] F1 --> G1["Local Gradients"] end subgraph GPU2["GPU 2"] P2["Parameters"] --> F2["Forward + Backward\n(batch 2)"] F2 --> G2["Local Gradients"] end subgraph GPU3["GPU 3"] P3["Parameters"] --> F3["Forward + Backward\n(batch 3)"] F3 --> G3["Local Gradients"] end G0 --> AR["NCCL AllReduce\nSum gradients across GPUs"] G1 --> AR G2 --> AR G3 --> AR AR --> U0["Averaged Gradients"] AR --> U1["Averaged Gradients"] AR --> U2["Averaged Gradients"] AR --> U3["Averaged Gradients"] U0 -.->|"update"| P0 U1 -.->|"update"| P1 U2 -.->|"update"| P2 U3 -.->|"update"| P3 style AR fill:#76B900,stroke:#5a8f00,color:#fff,font-weight:bold style G0 fill:#555,stroke:#333,color:#fff style G1 fill:#555,stroke:#333,color:#fff style G2 fill:#555,stroke:#333,color:#fff style G3 fill:#555,stroke:#333,color:#fff style U0 fill:#555,stroke:#333,color:#fff style U1 fill:#555,stroke:#333,color:#fff style U2 fill:#555,stroke:#333,color:#fff style U3 fill:#555,stroke:#333,color:#fff style P0 fill:#444,stroke:#333,color:#fff style P1 fill:#444,stroke:#333,color:#fff style P2 fill:#444,stroke:#333,color:#fff style P3 fill:#444,stroke:#333,color:#fff

Data-parallel training: each GPU computes local gradients on its own batch, then NCCL AllReduce sums them so every GPU gets identical averaged gradients for the parameter update.

The Three Protocols at a Glance

Every NCCL collective ultimately moves data through one of three communication protocols. Understanding their trade-offs is the single most important thing for reasoning about NCCL performance: Simple maximises bandwidth for large transfers, LL minimises latency for small ones, and LL128 tries to give you both—especially over NVLink.

The two tables below are the reference you need. The first shows how each protocol synchronises data, what it costs per hop, and how much bandwidth it can use. The second shows the buffer geometry—how much data each protocol can pipeline through a single channel. Together they explain why NCCL picks different protocols for different message sizes.

Protocol Comparison

Property Simple LL (Low Latency) LL128
Design Goal High bandwidth Low latency Low latency + high bandwidth
Synchronization Memory fences
(high overhead)
Flag-based Flag-based
Payload Unit Data chunks 4B data + 4B flag 120B data + 8B flag
Bandwidth Utilization Near peak 25–50% of peak ~95% of peak
Per-hop Latency ~6 μs ~1 μs ~2 μs

Channel Buffer Sizes

Each protocol allocates a fixed-size buffer per communication channel, divided into 8 pipeline slots. The buffer geometry determines how much data can be in-flight and directly affects pipelining efficiency.

Protocol Total Channel Buffer Buffer per Slot Effective Data per Slot
Simple 4 MiB 512 KiB 512 KiB
LL 256 KiB 32 KiB 16 KiB
LL128 ~4800 KiB 600 KiB 562.5 KiB

LL wastes 50% of each 8-byte unit on flags—hence its low bandwidth. LL128 wastes only 8B out of 128B (~6%), which is why it recovers ~95% of peak. Simple has zero flag overhead; synchronization cost is in the memory fences instead.

Algorithm & Collective Support Matrix

NCCL implements multiple algorithms, each supporting a subset of collectives and protocols. The table below summarises which algorithm–collective combinations are available in NCCL v2.19+.

Algorithm Protocols AllReduce Broadcast Reduce ReduceScatter AllGather
Ring Simple / LL / LL128
Tree Simple / LL / LL128
NVLS (Intra-node) Simple only
NVLS Tree (Multi-node) Simple only

NVLS algorithms leverage NVLink Switch (NVSwitch) for intra-node reduction. NVLS Tree extends this with a tree-based fan-out for inter-node communication. CollNet algorithms (not shown) offload reductions to SHARP-enabled network switches.

Protocol Deep Dive

The protocol comparison table above captures the what. Here is the why—how each protocol synchronises data and where the bandwidth/latency trade-off comes from.

Simple Protocol

Designed to maximise bandwidth. Data is divided into large chunks dispatched across communication channels. Uses memory fences to enforce ordering—a receiver must wait until a full chunk has landed before accessing it. This makes it optimal for large transfers but adds significant overhead for small payloads.

LL (Low Latency) Protocol

Replaces memory fences with lightweight flag-based synchronization. Each transmission is 4 bytes of data + 4 bytes of flag, sent together via an 8-byte atomic operation. The intermediate buffer is placed in host memory so the CPU can poll the flag, which prevents GPUDirect RDMA and limits bandwidth to 25–50% of peak. Preferred when latency matters more than throughput.

LL128 Protocol

Extends LL by transmitting 128-byte units (120B data + 8B flag), recovering ~95% of peak bandwidth while keeping flag-based sync. Works best over NVLink, where atomic 128-byte writes are guaranteed. On interconnects that cannot guarantee unsplit 128-byte atomics (e.g. PCIe), NCCL disables LL128 automatically.

Protocol Selection: NCCL's autotuner picks LL/LL128 for small messages and Simple for large ones. Override with NCCL_PROTO=Simple|LL|LL128. In most cases, the autotuner's default is the best choice.

Communication Channels

NCCL subdivides every collective into communication channels. Each channel is launched as a separate CUDA block on its own SM, operating on a disjoint slice of the input buffer in parallel. This raises aggregate throughput, especially for large payloads, and helps balance traffic across multiple NICs on NVLink platforms.

However, too many channels can cause the per-channel chunk size to fall below the 512 KiB NIC-transport FIFO buffer, leading to partially filled sends that degrade PCIe and network throughput. NCCL heuristically reduces the active channel count for smaller messages.

Tuning note: Environment variables like NCCL_NTHREADS were once used to influence channel behaviour, but are now discouraged in recent NCCL versions and may cause incorrect behaviour if set.

Data Transfer Methods

NCCL uses different transport mechanisms depending on whether communication is intra-node or inter-node.

Intra-Node Transports

  • P2P (NVLink): Preferred path. GPUDirect Peer-to-Peer over NVLink gives highest bandwidth and lowest latency.
  • P2P (PCIe): Fallback when NVLink is unavailable. Still avoids host-memory staging via GPUDirect.
  • P2P_DIRECT: Optimisation for same-process ranks—bypasses IPC handles and intermediate FIFO copies by using direct GPU memory pointers.
  • SHM (Shared Memory): Used when P2P over PCIe is suboptimal (e.g. inter-socket PCIe traffic). Routes via system memory.

Inter-Node Transports

  • IB Verbs (InfiniBand / RoCE): High-performance RDMA transport. Data staged through intermediate buffers; proxy thread manages DMA/RDMA operations.
  • GPUDirect RDMA: When NIC and GPU share a PCIe switch, the intermediate buffer lives in GPU memory—NIC accesses it directly, bypassing CPU and host memory entirely.
  • Socket (TCP): Fallback when RDMA is unavailable. Data copies through CPU-pinned host memory, incurring extra PCIe round-trips.

Collective Algorithms

NCCL breaks each collective into low-level primitives (send, recv, recvReduceSend, recvCopySend, recvReduceCopySend) and distributes them across parallel channels. Algorithms fall into two execution patterns:

Non-Pipelined

Each GPU must finish all steps in one iteration before starting the next.

  • Ring AllReduce (2k−1 steps)
  • Ring AllGather (k−1 steps)
  • Ring ReduceScatter (k−1 steps)

Pipelined

Consecutive loop iterations can overlap, enabling higher throughput.

  • Tree AllReduce (Reduce + Broadcast phases)
  • Ring Broadcast
  • Ring Reduce

Ring AllReduce

Combines a ReduceScatter phase with an AllGather phase over 2k−1 steps (k = number of GPUs). In the first k steps, each GPU receives a segment, reduces it with local data, and forwards the result. In the remaining k−1 steps, fully reduced segments are propagated around the ring via recvCopySend.

Step Primitive
0send
1 … k−2recvReduceSend
k−1recvReduceCopySend
k … 2k−3recvCopySend
2k−2recv

Tree AllReduce

Uses a double binary tree topology. The algorithm proceeds in two phases within each loop iteration:

These two phases can run concurrently by partitioning SMs into two groups—one for the bandwidth-intensive reduction, another for the broadcast—enabling overlap and better SM utilization.

When to use which? Ring excels for large messages (bandwidth-optimal), while Tree performs best for smaller messages (lower latency). NCCL's autotuner selects the algorithm based on message size and topology. Override with NCCL_ALGO=Ring|Tree.

Performance Insights (Theoretical)

Benchmarking on NVIDIA Grace Hopper (GH200) nodes (150 GB/s intra-node, 25 GB/s Slingshot inter-node) confirms the protocol trade-offs:

NCCL Auto-Selection Logic

When no NCCL_ALGO or NCCL_PROTO overrides are set, NCCL picks based on message size:

Message Size Strategy Algorithm Protocol
Small (<128 KB) Minimise latency Tree LL / LL128
Medium (128 KB – 10 MB) Balance Tree / Ring LL128 / Simple
Large (>10 MB) Maximise bandwidth Ring / NVLS Simple

Real-World Benchmarks: H100 AllReduce

Test Configuration

  • GPUs: 8× H100-80GB per node
  • Intra-node: NVLink
  • Inter-node: GPUDirect-TCPX (no InfiniBand)
  • Message sizes: 1 MB – 1 GB
  • Warmup: 100 iterations, 10,000 iterations per size
  • Date: 2026-03-31
Setting Default NVLSTree Override
NCCL_ALGORingRing;allreduce:NVLSTree
NCCL_PROTOSimpleLL128,LL,Simple;allreduce:Simple
NCCL_IB_DISABLE11

2-Node Results (16 GPUs): Default vs NVLSTree

Size Default Avg Default BW NVLSTree Avg NVLSTree BW Winner
1 MB2.90 ms5.4 GB/s0.86 ms18.3 GB/sNVLSTree (3.4×)
2 MB2.95 ms10.7 GB/s0.85 ms36.9 GB/sNVLSTree (3.5×)
4 MB2.93 ms21.5 GB/s0.87 ms72.4 GB/sNVLSTree (3.4×)
8 MB3.02 ms41.7 GB/s1.33 ms94.6 GB/sNVLSTree (2.3×)
16 MB3.17 ms79.4 GB/s2.23 ms113.0 GB/sNVLSTree (1.4×)
32 MB3.31 ms152.1 GB/s2.31 ms217.8 GB/sNVLSTree (1.4×)
64 MB3.65 ms276.1 GB/s3.95 ms254.8 GB/sCrossover
128 MB4.89 ms411.4 GB/s8.22 ms245.0 GB/sDefault (1.7×)
256 MB7.15 ms562.9 GB/s16.66 ms241.8 GB/sDefault (2.3×)
512 MB14.21 ms566.9 GB/s33.31 ms241.7 GB/sDefault (2.3×)
1 GB28.18 ms571.6 GB/sDefault

8-Node Results (64 GPUs): Default vs NVLSTree

Size Default Avg Default BW NVLSTree Avg NVLSTree BW Winner
1 MB13.35 ms4.95 GB/s2.30 ms28.8 GB/sNVLSTree (5.8×)
2 MB13.44 ms9.83 GB/s2.31 ms57.3 GB/sNVLSTree (5.8×)
4 MB13.33 ms19.8 GB/s2.48 ms106.8 GB/sNVLSTree (5.4×)
8 MB13.14 ms40.2 GB/s3.64 ms145.2 GB/sNVLSTree (3.6×)
16 MB13.42 ms78.8 GB/s6.11 ms172.9 GB/sNVLSTree (2.2×)
32 MB13.86 ms152.6 GB/s10.98 ms192.5 GB/sNVLSTree (1.3×)
64 MB14.13 ms299.3 GB/s19.90 ms212.5 GB/sCrossover
128 MB14.52 ms582.3 GB/s38.00 ms222.5 GB/sDefault (2.6×)
256 MB16.29 ms1038.4 GB/s73.89 ms228.9 GB/sDefault (4.5×)

Scaling: 2-Node vs 8-Node (Default Settings)

Size 2-Node Avg 2-Node BW 8-Node Avg 8-Node BW Latency Bandwidth
1 MB2.90 ms5.4 GB/s13.35 ms4.95 GB/s4.6× worse~same
4 MB2.93 ms21.5 GB/s13.33 ms19.8 GB/s4.6× worse~same
16 MB3.17 ms79.4 GB/s13.42 ms78.8 GB/s4.2× worse~same
64 MB3.65 ms276.1 GB/s14.13 ms299.3 GB/s3.9× worse1.1× better
128 MB4.89 ms411.4 GB/s14.52 ms582.3 GB/s3.0× worse1.4× better
256 MB7.15 ms562.9 GB/s16.29 ms1038.4 GB/s2.3× worse1.8× better
512 MB14.21 ms566.9 GB/s22.64 ms1493.8 GB/s1.6× worse2.6× better

Key Takeaways

  • Default (Ring/Simple) wins for large messages (≥64 MB): up to 2.3× higher bandwidth on 2 nodes, up to 4.5× on 8 nodes. Peak ~571 GB/s (2N) and ~1494 GB/s (8N).
  • NVLSTree wins for small messages (≤32 MB): up to 3.5× lower latency on 2 nodes, up to 5.8× on 8 nodes via NVLS hardware intra-node reduction.
  • Crossover point: ~64 MB on both 2-node and 8-node setups.
  • Scaling: latency grows ~4× from 2→8 nodes for small messages, but aggregate bandwidth improves for large messages (2.6× better at 512 MB).
  • NVLSTree advantage grows with node count—Ring latency scales linearly with GPU count, while NVLSTree scales logarithmically.

Key Environment Variables

Variable Purpose
NCCL_DEBUG=INFO Verbose logging of GPU communication topology, algorithm selection, and transport details.
NCCL_PROTO=Simple|LL|LL128 Force a specific communication protocol (overrides autotuner).
NCCL_ALGO=Ring|Tree Force a specific collective algorithm.
NCCL_CROSS_NIC=0|1|2 Control whether NCCL routes intra-node traffic through NICs across sockets.
NCCL_NET_GDR_LEVEL Control GPUDirect RDMA usage level based on GPU-NIC topology distance.
NCCL_IB_HCA Specify which InfiniBand HCA(s) to use.
NCCL_SOCKET_IFNAME Specify network interface for socket-based communication.