Advanced Optimization & Profiling Techniques

LLM Training on NVIDIA Grace Hopper

AMP, CPU offloading, ZeRO, CUDA Graphs, KV cache offload, selective profiling

AMP / Mixed Precision (FP16/BF16)

Automatic Mixed Precision (AMP) keeps most of training in FP16 or BF16, using FP32 only where needed (e.g. loss scaling, numerically sensitive ops). This reduces memory use and increases throughput while keeping training stable.

  • Nearly halves GPU memory for activations and gradients vs FP32
  • Leverages Tensor Cores (FP16/BF16) for faster matrix math
  • Typically used with a GradScaler to avoid underflow when scaling loss for backprop
  • BF16 is often preferred for LLM training: same exponent range as FP32, fewer precision issues than FP16

In practice: Frameworks like PyTorch (torch.cuda.amp), Lightning (precision="bf16-mixed"), and NeMo apply AMP automatically once enabled.

CPU Offloading

CPU offloading moves part of the workload—e.g. optimizer states, gradients, or inactive parameters—from GPU memory to CPU RAM. The GPU stays less memory-bound and can fit larger models or larger batches.

  • Extension of FSDP / DeepSpeed: inactive parameters, gradients, or optimizer states are offloaded to CPU
  • On Grace Hopper, NVLink-C2C (~900 GB/s) makes CPU–GPU transfers much faster than PCIe, so offloading is more effective
  • SuperOffload (PyTorch) is built for Grace–Hopper: full fine-tuning of 20B+ models on a single GH200, up to ~4× throughput vs prior ZeRO-Offload, GPU utilization from ~50% to >80%
  • Integrates with DeepSpeed ZeRO Stage 3 and Hugging Face Transformers without model code changes

Trade-off: Extra CPU–GPU transfer latency. On systems with slow PCIe, offload can slow training; on Grace Hopper, the fast link makes it a win.

ZeRO (Zero Redundancy Optimizer)

ZeRO (DeepSpeed) partitions optimizer states, gradients, and optionally parameters across GPUs so that each device only holds a fraction. This reduces per-GPU memory and enables larger models or larger batch sizes when using multiple GPUs.

  • Stage 1: Partition optimizer states across ranks
  • Stage 2: Partition gradients as well
  • Stage 3: Partition parameters; each GPU only stores a slice of the model
  • Often combined with ZeRO-Offload: offload optimizer states (and optionally more) to CPU to further reduce GPU memory
  • When memory exceeds single-GPU capacity, multi-GPU setups use TP (Tensor Parallelism) / PP (Pipeline Parallelism) / ZeRO together

CUDA Graphs

CUDA Graphs record a sequence of kernel launches and replay them with a single call from the CPU. This cuts kernel launch overhead and reduces CPU–GPU synchronization, which helps when the CPU is too slow to keep the GPU busy (the “CPU bottleneck”).

  • Like a “tape recorder”: record the exact sequence of kernels, then replay it repeatedly
  • Works best with fixed shapes and fixed memory addresses; dynamic control flow or changing tensor sizes limit usability
  • Does not fuse kernels—it only launches existing kernels more efficiently
  • Use when kernels are already optimized but launch latency is the bottleneck; for general PyTorch, torch.compile (which can use CUDA Graphs under the hood with mode="reduce-overhead") is often the easier choice

KV Cache Offload

In autoregressive LLM inference (and some training setups), the KV cache holds key/value states for all previous tokens. For long sequences or large models, it can exceed GPU memory. KV cache offloading moves part of the cache to CPU memory and fetches it as needed.

  • Uses Unified Memory or explicit CPU–GPU memory sharing: GPU can access a shared address space; pages are migrated or copied on demand
  • Reduces GPU OOMs and allows running very large models (e.g. Llama 3 70B) by spilling KV cache (and optionally other tensors) to CPU
  • On Grace Hopper, NVLink-C2C makes CPU–GPU transfer fast enough for offloaded KV cache to be practical

Selective Profiling

Profiling every iteration of a long LLM training run produces huge traces and can change timing. Selective profiling limits capture to a chosen range of steps so you get actionable data without oversized files or excessive overhead.

  • Use environment variables (e.g. TLLM_PROFILE_START_STOP in TensorRT-LLM) or profiler APIs to start/stop capture only for specific iterations
  • CUDA Profiler API and Nsight Systems support toggling profiling on/off in code so you can target a warm-up phase plus a few training steps
  • Keeps profile size manageable and focuses the timeline on the region of interest
  • Combine with NVTX markers and PyTorch Profiler for a clear view of which ops and kernels run in the profiled window

Workflow: Run with profiling disabled for warm-up, enable for a short range (e.g. 2–5 iterations), then inspect the trace in Nsight Systems or Chrome tracing.

References

This page summarizes techniques described in the following NVIDIA technical blogs: