Advanced Optimization & Profiling Techniques for LLM Training

AMP / Mixed Precision (FP16/BF16)

Automatic Mixed Precision (AMP) keeps most of training in FP16 or BF16, using FP32 only where needed (e.g. loss scaling, numerically sensitive ops). This reduces memory use and increases throughput while keeping training stable.

Nearly halves GPU memory for activations and gradients vs FP32
Leverages Tensor Cores (FP16/BF16) for faster matrix math
Typically used with a GradScaler to avoid underflow when scaling loss for backprop
BF16 is often preferred for LLM training: same exponent range as FP32, fewer precision issues than FP16

In practice: Frameworks like PyTorch (torch.cuda.amp), Lightning (precision="bf16-mixed"), and NeMo apply AMP automatically once enabled.

CPU Offloading

CPU offloading moves part of the workload—e.g. optimizer states, gradients, or inactive parameters—from GPU memory to CPU RAM. The GPU stays less memory-bound and can fit larger models or larger batches.

Extension of FSDP / DeepSpeed: inactive parameters, gradients, or optimizer states are offloaded to CPU
On Grace Hopper, NVLink-C2C (~900 GB/s) makes CPU–GPU transfers much faster than PCIe, so offloading is more effective
SuperOffload (PyTorch) is built for Grace–Hopper: full fine-tuning of 20B+ models on a single GH200, up to ~4× throughput vs prior ZeRO-Offload, GPU utilization from ~50% to >80%
Integrates with DeepSpeed ZeRO Stage 3 and Hugging Face Transformers without model code changes

Trade-off: Extra CPU–GPU transfer latency. On systems with slow PCIe, offload can slow training; on Grace Hopper, the fast link makes it a win.

ZeRO (Zero Redundancy Optimizer)

ZeRO (DeepSpeed) partitions optimizer states, gradients, and optionally parameters across GPUs so that each device only holds a fraction. This reduces per-GPU memory and enables larger models or larger batch sizes when using multiple GPUs.

Stage 1: Partition optimizer states across ranks
Stage 2: Partition gradients as well
Stage 3: Partition parameters; each GPU only stores a slice of the model
Often combined with ZeRO-Offload: offload optimizer states (and optionally more) to CPU to further reduce GPU memory
When memory exceeds single-GPU capacity, multi-GPU setups use TP (Tensor Parallelism) / PP (Pipeline Parallelism) / ZeRO together

CUDA Graphs

CUDA Graphs record a sequence of kernel launches and replay them with a single call from the CPU. This cuts kernel launch overhead and reduces CPU–GPU synchronization, which helps when the CPU is too slow to keep the GPU busy (the “CPU bottleneck”).

Like a “tape recorder”: record the exact sequence of kernels, then replay it repeatedly
Works best with fixed shapes and fixed memory addresses; dynamic control flow or changing tensor sizes limit usability
Does not fuse kernels—it only launches existing kernels more efficiently
Use when kernels are already optimized but launch latency is the bottleneck; for general PyTorch, torch.compile (which can use CUDA Graphs under the hood with mode="reduce-overhead") is often the easier choice

KV Cache Offload

In autoregressive LLM inference (and some training setups), the KV cache holds key/value states for all previous tokens. For long sequences or large models, it can exceed GPU memory. KV cache offloading moves part of the cache to CPU memory and fetches it as needed.

Uses Unified Memory or explicit CPU–GPU memory sharing: GPU can access a shared address space; pages are migrated or copied on demand
Reduces GPU OOMs and allows running very large models (e.g. Llama 3 70B) by spilling KV cache (and optionally other tensors) to CPU
On Grace Hopper, NVLink-C2C makes CPU–GPU transfer fast enough for offloaded KV cache to be practical

Selective Profiling

Profiling every iteration of a long LLM training run produces huge traces and can change timing. Selective profiling limits capture to a chosen range of steps so you get actionable data without oversized files or excessive overhead.

Use environment variables (e.g. TLLM_PROFILE_START_STOP in TensorRT-LLM) or profiler APIs to start/stop capture only for specific iterations
CUDA Profiler API and Nsight Systems support toggling profiling on/off in code so you can target a warm-up phase plus a few training steps
Keeps profile size manageable and focuses the timeline on the region of interest
Combine with NVTX markers and PyTorch Profiler for a clear view of which ops and kernels run in the profiled window

Workflow: Run with profiling disabled for warm-up, enable for a short range (e.g. 2–5 iterations), then inspect the trace in Nsight Systems or Chrome tracing.

References

This page summarizes techniques described in the following NVIDIA technical blogs:

Profiling LLM Training Workflows on NVIDIA Grace Hopper — Nsight Systems, PyTorch Profiler, selective iteration profiling, NVTX, Chrome tracing.
Advanced Optimization Strategies for LLM Training on NVIDIA Grace Hopper — CPU offloading, Unified Memory, AMP, FP8, SuperOffload, KV cache offload.