AI Performance Design Guide

Performance

Performance analysis, optimization techniques, profiling tools, and fundamental performance principles governing distributed systems.

Performance Numbers Every Performance Engineer Need to Know HTML
Essential performance numbers and benchmarks for system design and optimization
NVIDIA Performance Tools Explorer HTML
Interactive guide to NVIDIA's suite of performance analysis tools for GPU workloads
Tale of Two Profilers: Nsys vs PyTorch Profiler HTML
Interactive guide to synergy between NVIDIA Nsight Systems and PyTorch Profiler for performance analysis
Matrix Multiplications Performance Guide HTML
Performance analysis and optimization techniques for matrix multiplication operations
Performance Laws Guide HTML
Comprehensive guide to performance laws including Amdahl's Law, Little's Law, and more
Performance Trajectory A100->B200 HTML
Analysis of performance trajectory and scaling characteristics
B200 System Topology Explorer HTML
Interactive exploration of B200 system architecture and connectivity matrix
NCCL Performance: Protocols, Algorithms & Tuning HTML
Deep dive into NCCL communication protocols (Simple, LL, LL128), collective algorithms (Ring, Tree, NVLS), and performance tuning

Machine learning model training workflows, memory management, and inference optimization techniques.

Paligemma Memory Calculator HTML
Memory calculation and optimization tools for Paligemma model training
Paligemma Training HTML
Training workflows and best practices for Paligemma models
System Design — ChatGPT HTML
End-to-end system architecture for a large-scale AI chat service: GPU fleet sizing, storage tiers, streaming, and fault tolerance
LLM Inference: Throughput vs Interactivity HTML
Pareto curves for Llama 3.3 70B on B200 and H100 — throughput per GPU vs per-user interactivity, calibrated to NVIDIA benchmarks

Quantization, sparsity, and optimization techniques for accelerating LLM and diffusion model inference with NVIDIA TensorRT-LLM.

Model Optimization Techniques Guide HTML
Comprehensive guide to PTQ (FP8, INT4 AWQ), QAT, and sparsity with real-world performance benchmarks from NVIDIA Model Optimizer

High-performance computing, GPU architectures, and ML system infrastructure.

Comparison of NVIDIA Data Center Chips HTML
Feature comparison: H100, H200, GH200, B200, GB200 — architecture, FP16 TFLOPS, VRAM, CPU, CPU memory, CPU-to-GPU link
NVIDIA H100 NVLink Architecture MD
Complete H100 system architecture with NVLink topology and specifications
H100 NVLink Topology MD
Detailed NVLink connection topology for H100 GPU clusters
H100 Streaming Multiprocessor Architecture MD
Internal architecture of H100 streaming multiprocessors
CPU vs GPU Architecture HTML
Comparative analysis of CPU and GPU architectures for ML workloads
High-Speed Interconnects HTML
InfiniBand, NVLink, and other high-speed interconnect technologies
InfiniBand Protocol Explorer HTML
Interactive guide to InfiniBand queue pairs and packet transmission
Nikel Network Analysis HTML
Network infrastructure and performance analysis tools
NVMe-oF Shared File Systems HTML
NVMe over Fabrics for distributed storage systems
Blueprint for Modern ML Systems PDF
Comprehensive blueprint for designing high-performance machine learning systems
DeepSeek V3 Technical Paper PDF
Technical documentation and architecture details for DeepSeek V3 model
Fire-Flyer AI-HPC Architecture PDF
High-performance computing architecture for AI workloads
DeepSeek V3 Scaling Insights PDF
Analysis of scaling challenges and hardware considerations for DeepSeek V3

CUDA programming concepts, memory hierarchy, execution models, and development resources.

CUDA Execution & Memory Hierarchy MD
Grid, block, warp, thread hierarchy with register allocation and warp scheduler explained
Cuda Programming Model HTML
CUDA programming concepts, memory hierarchy, and execution model
GPU Memory Hierarchy HTML
Interactive guide to GPU cache levels, cache lines, access patterns, and bandwidth optimization
Life of a Memory Request: HBM to Register HTML
How a load travels through the hierarchy—mermaid diagram with sizes, latencies, throughput (H100-oriented)
Roofline Model (Speed of Light Diagram) HTML
Mermaid roofline: attainable performance vs arithmetic intensity, memory-bound vs compute-bound (H100-style)
Tensor Memory Accelerator (TMA) HTML
Data path with/without TMA, NCU metrics, kernel names, and performance benefits (Hopper+)
CUDA Performance Optimization Samples HTML
Detailed explanations of CUDA performance samples: alignedTypes, transpose, UnifiedMemoryPerf, and cudaGraphsPerfScaling
CUDA Programming Model Refresher EXTERNAL
External blog: Comprehensive guide to CUDA programming fundamentals and best practices

PyTorch framework optimization, memory management, and performance tuning for training and inference.

Memory Optimization: Huge Pages & CUDA Allocator HTML
Linux Huge Pages, jemalloc integration, and PYTORCH_CUDA_ALLOC_CONF configuration for optimal memory performance

GPU profiling tools, trace capture techniques, and performance analysis workflows for containerized ML workloads.

Nsight Systems & CUPTI: Installation and Container Mounting HTML
How to download, install, and mount custom nsys and CUPTI versions into containers for GPU profiling
Nsight Compute Metrics Reference Guide HTML
Complete reference for NCU source-level metrics: instruction execution, warp stall statistics, memory access patterns, and register dependencies
GPU Performance Bottlenecks: Diagnosis & Remedies HTML
Memory bound vs compute bound vs latency bound vs underutilized GPU: profiler indicators and detailed optimization strategies
Advanced Optimization & Profiling Techniques for LLM Training HTML
AMP/mixed precision, CPU offloading, ZeRO, CUDA Graphs, KV cache offload, selective profiling (NVIDIA Grace Hopper)