Which Tool When?
An interactive guide to the synergy between NVIDIA Nsight Systems and the PyTorch Profiler.
Executive Summary
The "cooperation" between NVIDIA Nsight Systems (`nsys`) and the PyTorch Profiler isn't a direct integration. They operate at different levels: `nsys` gives a system-wide hardware view, while the PyTorch Profiler provides a framework-centric software view. True synergy comes from using **NVTX ranges**, which are semantic markers that connect high-level application logic (e.g., "forward pass") to low-level system events in the `nsys` timeline. The optimal strategy is a two-stage approach: first, use the **PyTorch Profiler** to find expensive operators in your model. Then, use **`nsys`** (ideally with the modern `--pytorch` flag) to do a deep-dive system analysis and understand *why* those operators are slow (e.g., CPU waits, data stalls, or GPU underutilization).
A Tale of Two Profilers
Nsight Systems and the PyTorch Profiler are designed to answer different questions at different levels of abstraction. This section helps you understand their unique strengths and when to use each one.
NVIDIA Nsight Systems (`nsys`)
The **system-wide detective**. It provides a holistic timeline of interactions between your application, the OS, and the hardware (CPUs/GPUs).
- ✓ Primary Use: Identifying system-level bottlenecks like I/O waits, CPU-GPU synchronization stalls, or scheduling latency.
- ✓ Answers: "Why is my GPU idle?" or "Is my data loading pipeline the bottleneck?"
- ✓ Level of Detail: Low-level (CUDA kernels, API calls, driver events, OS threads).
PyTorch Profiler
The **framework-specific analyst**. It's built into PyTorch and gives performance insights in the context of your model's code.
- ✓ Primary Use: Optimizing model architecture and identifying expensive PyTorch operators.
- ✓ Answers: "Which part of my model is slow?" or "How much memory does this layer use?"
- ✓ Level of Detail: High-level (PyTorch operators, `nn.Module` layers, Python stack traces).
Visual Comparison
Key PyTorch Threads in Nsight Systems
When you open an nsys timeline for a PyTorch training job, two threads tell most of the story. Understanding their roles is essential for diagnosing performance issues.
pt_main_thread
The orchestrator. This is your Python main thread — it drives the entire training loop: data loading, kernel launches, optimizer steps, and communication between system components.
What to look for
- ● Green bars = active GPU work initiated by this thread. High utilization is good.
- ● Gray gaps = GPU idle periods. These reveal stalls caused by: slow data loading on CPU, CPU-GPU synchronization issues, or insufficient compute/communication overlap.
-
●
Orange bars = CUDA memory transfers (
cudaMemcpy). Frequent transfers indicate excessive CPU↔GPU data movement — a common bottleneck.
Simplified Timeline
pt_autograd_0
The gradient engine. This thread runs PyTorch's autograd — automatic differentiation for computing gradients during backpropagation. It launches the backward-pass kernels asynchronously from the main thread.
What to look for
- ● Green sections = active gradient computation. The autograd engine is doing useful work.
- ● Brown dotted sections = thread preemption or context switching. The thread is paused, possibly due to resource contention or OS scheduling.
-
●
pthread_cond_waitblocks = the CPU thread is blocked, waiting for GPU kernels to finish. Frequent/long waits are a synchronization bottleneck.
Simplified Timeline
Optimization Strategies Based on Thread Analysis
Overlap Computation & Communication
Reduce synchronization points by performing gradient all-reduce concurrently with backward-pass computation. DDP's gradient_as_bucket_view helps here.
Reduce pthread_cond_wait
Frequent waits in pt_autograd_0 mean the CPU is blocked on GPU. Check for implicit syncs like .item(), print(tensor), or loss logging that forces a device sync.
Minimize GPU Idle in Main Thread
Gray gaps on the GPU row aligned with data loading on pt_main_thread mean the DataLoader is starving the GPU. Increase num_workers or use pin_memory=True.
Source: Profiling LLM Training Workflows on NVIDIA Grace Hopper — NVIDIA Technical Blog (May 2025)
Interactive Profiling Workflow
Don't guess what to do. Select your performance goal below to get the recommended tool, command, and an example of what to look for in the analysis.
Step 1: What is your performance goal?
Select a goal above to see the recommended workflow.
Strategic Recommendations
Use this decision framework to guide your performance engineering efforts.
If your goal is to find the most expensive PyTorch operator...
Then start with: The PyTorch Profiler and its TensorBoard view. It will directly point to the operators consuming the most CPU or CUDA time.
If your goal is to understand *why* an operator is slow...
Then use: Nsight Systems with the `--pytorch=autograd-nvtx` flag. This gives you the system-level context to diagnose the root cause (e.g., I/O bound, sync bound).
If your goal is to measure a specific code block...
Then use: Nsight Systems with `--capture-range=cudaProfilerApi` after bracketing your code with `cudaProfilerStart()` and `cudaProfilerStop()` calls.
If your goal is to debug a suspected CPU-bound issue...
Then use: Nsight Systems with CPU sampling (`-s cpu`). Be aware this adds significant overhead and is for diagnostics, not benchmarking.