Interactive Guide to PyTorch Profiling: Nsys vs. PyTorch Profiler

Which Tool When?

An interactive guide to the synergy between NVIDIA Nsight Systems and the PyTorch Profiler.

Executive Summary

The "cooperation" between NVIDIA Nsight Systems (`nsys`) and the PyTorch Profiler isn't a direct integration. They operate at different levels: `nsys` gives a system-wide hardware view, while the PyTorch Profiler provides a framework-centric software view. True synergy comes from using **NVTX ranges**, which are semantic markers that connect high-level application logic (e.g., "forward pass") to low-level system events in the `nsys` timeline. The optimal strategy is a two-stage approach: first, use the **PyTorch Profiler** to find expensive operators in your model. Then, use **`nsys`** (ideally with the modern `--pytorch` flag) to do a deep-dive system analysis and understand *why* those operators are slow (e.g., CPU waits, data stalls, or GPU underutilization).

A Tale of Two Profilers

Nsight Systems and the PyTorch Profiler are designed to answer different questions at different levels of abstraction. This section helps you understand their unique strengths and when to use each one.

NVIDIA Nsight Systems (`nsys`)

The **system-wide detective**. It provides a holistic timeline of interactions between your application, the OS, and the hardware (CPUs/GPUs).

✓ Primary Use: Identifying system-level bottlenecks like I/O waits, CPU-GPU synchronization stalls, or scheduling latency.
✓ Answers: "Why is my GPU idle?" or "Is my data loading pipeline the bottleneck?"
✓ Level of Detail: Low-level (CUDA kernels, API calls, driver events, OS threads).

PyTorch Profiler

The **framework-specific analyst**. It's built into PyTorch and gives performance insights in the context of your model's code.

✓ Primary Use: Optimizing model architecture and identifying expensive PyTorch operators.
✓ Answers: "Which part of my model is slow?" or "How much memory does this layer use?"
✓ Level of Detail: High-level (PyTorch operators, `nn.Module` layers, Python stack traces).

Visual Comparison

Key PyTorch Threads in Nsight Systems

When you open an nsys timeline for a PyTorch training job, two threads tell most of the story. Understanding their roles is essential for diagnosing performance issues.

1

`pt_main_thread`

The orchestrator. This is your Python main thread — it drives the entire training loop: data loading, kernel launches, optimizer steps, and communication between system components.

What to look for

● Green bars = active GPU work initiated by this thread. High utilization is good.
● Gray gaps = GPU idle periods. These reveal stalls caused by: slow data loading on CPU, CPU-GPU synchronization issues, or insufficient compute/communication overlap.
● Orange bars = CUDA memory transfers (cudaMemcpy). Frequent transfers indicate excessive CPU↔GPU data movement — a common bottleneck.

Simplified Timeline

pt_main_thread (CPU)

Data

Forward

Backward

Optim

GPU Hardware

idle

GEMM

idle

2

`pt_autograd_0`

The gradient engine. This thread runs PyTorch's autograd — automatic differentiation for computing gradients during backpropagation. It launches the backward-pass kernels asynchronously from the main thread.

What to look for

● Green sections = active gradient computation. The autograd engine is doing useful work.
● Brown dotted sections = thread preemption or context switching. The thread is paused, possibly due to resource contention or OS scheduling.
● pthread_cond_wait blocks = the CPU thread is blocked, waiting for GPU kernels to finish. Frequent/long waits are a synchronization bottleneck.

Simplified Timeline

pt_autograd_0 (CPU)

Grad

Wait

Grad

Wait

Grad

GPU Streams

Backward GEMM

AllReduce

Optimization Strategies Based on Thread Analysis

Overlap Computation & Communication

Reduce synchronization points by performing gradient all-reduce concurrently with backward-pass computation. DDP's gradient_as_bucket_view helps here.

Reduce `pthread_cond_wait`

Frequent waits in pt_autograd_0 mean the CPU is blocked on GPU. Check for implicit syncs like .item(), print(tensor), or loss logging that forces a device sync.

Minimize GPU Idle in Main Thread

Gray gaps on the GPU row aligned with data loading on pt_main_thread mean the DataLoader is starving the GPU. Increase num_workers or use pin_memory=True.

Source: Profiling LLM Training Workflows on NVIDIA Grace Hopper — NVIDIA Technical Blog (May 2025)

Interactive Profiling Workflow

Don't guess what to do. Select your performance goal below to get the recommended tool, command, and an example of what to look for in the analysis.

Step 1: What is your performance goal?

Select a goal above to see the recommended workflow.

Strategic Recommendations

Use this decision framework to guide your performance engineering efforts.

If your goal is to find the most expensive PyTorch operator...

Then start with: The PyTorch Profiler and its TensorBoard view. It will directly point to the operators consuming the most CPU or CUDA time.

If your goal is to understand why an operator is slow...

Then use: Nsight Systems with the `--pytorch=autograd-nvtx` flag. This gives you the system-level context to diagnose the root cause (e.g., I/O bound, sync bound).

If your goal is to measure a specific code block...

Then use: Nsight Systems with `--capture-range=cudaProfilerApi` after bracketing your code with `cudaProfilerStart()` and `cudaProfilerStop()` calls.

If your goal is to debug a suspected CPU-bound issue...

Then use: Nsight Systems with CPU sampling (`-s cpu`). Be aware this adds significant overhead and is for diagnostics, not benchmarking.

Which Tool When?

Executive Summary

A Tale of Two Profilers

NVIDIA Nsight Systems (`nsys`)

PyTorch Profiler

Visual Comparison

Key PyTorch Threads in Nsight Systems

pt_main_thread

What to look for

pt_autograd_0

What to look for

Optimization Strategies Based on Thread Analysis

Overlap Computation & Communication

Reduce pthread_cond_wait

Minimize GPU Idle in Main Thread

Interactive Profiling Workflow

Step 1: What is your performance goal?

Strategic Recommendations

If your goal is to find the most expensive PyTorch operator...

If your goal is to understand *why* an operator is slow...

If your goal is to measure a specific code block...

If your goal is to debug a suspected CPU-bound issue...

`pt_main_thread`

`pt_autograd_0`

Reduce `pthread_cond_wait`

If your goal is to understand why an operator is slow...