AI Performance Design Guide
A comprehensive collection of system design resources, architectural patterns, and implementation guides.
Performance
Performance analysis, optimization techniques, profiling tools, and fundamental performance principles governing distributed systems.
Model Training and Inference
Machine learning model training workflows, memory management, and inference optimization techniques.
Model Optimization
Quantization, sparsity, and optimization techniques for accelerating LLM and diffusion model inference with NVIDIA TensorRT-LLM.
AI Infrastructure
High-performance computing, GPU architectures, and ML system infrastructure.
-
Comparison of NVIDIA Data Center Chips
HTML
Feature comparison: H100, H200, GH200, B200, GB200 — architecture, FP16 TFLOPS, VRAM, CPU, CPU memory, CPU-to-GPU link
-
NVIDIA H100 NVLink Architecture
MD
Complete H100 system architecture with NVLink topology and specifications
-
H100 NVLink Topology
MD
Detailed NVLink connection topology for H100 GPU clusters
-
H100 Streaming Multiprocessor Architecture
MD
Internal architecture of H100 streaming multiprocessors
-
CPU vs GPU Architecture
HTML
Comparative analysis of CPU and GPU architectures for ML workloads
-
High-Speed Interconnects
HTML
InfiniBand, NVLink, and other high-speed interconnect technologies
-
InfiniBand Protocol Explorer
HTML
Interactive guide to InfiniBand queue pairs and packet transmission
-
Nikel Network Analysis
HTML
Network infrastructure and performance analysis tools
-
NVMe-oF Shared File Systems
HTML
NVMe over Fabrics for distributed storage systems
-
Blueprint for Modern ML Systems
PDF
Comprehensive blueprint for designing high-performance machine learning systems
-
DeepSeek V3 Technical Paper
PDF
Technical documentation and architecture details for DeepSeek V3 model
-
Fire-Flyer AI-HPC Architecture
PDF
High-performance computing architecture for AI workloads
-
DeepSeek V3 Scaling Insights
PDF
Analysis of scaling challenges and hardware considerations for DeepSeek V3
CUDA Programming
CUDA programming concepts, memory hierarchy, execution models, and development resources.
-
CUDA Execution & Memory Hierarchy
MD
Grid, block, warp, thread hierarchy with register allocation and warp scheduler explained
-
Cuda Programming Model
HTML
CUDA programming concepts, memory hierarchy, and execution model
-
GPU Memory Hierarchy
HTML
Interactive guide to GPU cache levels, cache lines, access patterns, and bandwidth optimization
-
Life of a Memory Request: HBM to Register
HTML
How a load travels through the hierarchy—mermaid diagram with sizes, latencies, throughput (H100-oriented)
-
Roofline Model (Speed of Light Diagram)
HTML
Mermaid roofline: attainable performance vs arithmetic intensity, memory-bound vs compute-bound (H100-style)
-
Tensor Memory Accelerator (TMA)
HTML
Data path with/without TMA, NCU metrics, kernel names, and performance benefits (Hopper+)
-
CUDA Performance Optimization Samples
HTML
Detailed explanations of CUDA performance samples: alignedTypes, transpose, UnifiedMemoryPerf, and cudaGraphsPerfScaling
-
CUDA Programming Model Refresher
EXTERNAL
External blog: Comprehensive guide to CUDA programming fundamentals and best practices
PyTorch
PyTorch framework optimization, memory management, and performance tuning for training and inference.
Profiling
GPU profiling tools, trace capture techniques, and performance analysis workflows for containerized ML workloads.
-
Nsight Systems & CUPTI: Installation and Container Mounting
HTML
How to download, install, and mount custom nsys and CUPTI versions into containers for GPU profiling
-
Nsight Compute Metrics Reference Guide
HTML
Complete reference for NCU source-level metrics: instruction execution, warp stall statistics, memory access patterns, and register dependencies
-
GPU Performance Bottlenecks: Diagnosis & Remedies
HTML
Memory bound vs compute bound vs latency bound vs underutilized GPU: profiler indicators and detailed optimization strategies
-
Advanced Optimization & Profiling Techniques for LLM Training
HTML
AMP/mixed precision, CPU offloading, ZeRO, CUDA Graphs, KV cache offload, selective profiling (NVIDIA Grace Hopper)