Nsight Compute Overview
Nsight Compute (NCU) is NVIDIA's GPU kernel profiler that provides detailed performance reports showing how kernels utilize GPU resources. This guide covers the metrics available on the Source Page and Details Page, with focus on understanding what each metric means and how to use them.
Details Page
High-level kernel metrics, speed-of-light analysis, and guided recommendations
Source Page
Per-line metrics correlated between high-level code and SASS assembly
Guided Analysis
Automatic detection of performance issues with speedup estimates
Source Page Features
Supported Languages
High-Level Languages
- • CUDA-C/C++
- • Fortran
- • Python (with CUDA)
Low-Level Assembly
- • PTX - Intermediate representation
- • SASS - GPU assembly (always available, where metrics are collected)
Source Correlation
Click any line in CUDA-C to see corresponding SASS instructions (and vice versa). Uses Line Table Information to correlate languages.
Visual Features
- • Heatmap/Minimap - Color gradient showing metric intensity across code
- • Navigation buttons - Jump to highest/next value quickly
- • Per-file statistics - Collapse view to see aggregated metrics per source file
- • Inline function breakdown - Split metrics by call site for inlined functions
Instruction Executed Metrics Counter-Based
These metrics tell you what code is executed and how efficiently (in terms of active threads).
Instructions Executed (Warps)
inst_executed
How many warps executed this instruction. For a kernel with 101 blocks of 32 threads (1 warp each), expect 101 warps per instruction (if no branching).
Instructions Executed (Threads)
inst_executed_threads
Total thread count executing this instruction. With 101 warps of 32 threads: 101 × 32 = 3,232 threads expected.
Predicated On Executed
inst_executed_pred_on
Thread count with predication mask enabled. Used for conditional execution (if statements).
Average Threads Executed per Warp
avg_threads_executed
Average active threads when executing. 32 = full warp, lower = divergence/predication. Key efficiency indicator!
if (threadIdx.x < 24) → average = 24 for instructions inside the if-blockAverage Predicated Threads per Warp
avg_pred_threads_executed
Same as above but accounting for predication masks. Shows true active participation.
⚠️ Compiler Optimizations Can Surprise You
Some conditions may not show reduced thread counts because the compiler used FSEL (floating-point select)
instead of predicated instructions. All 32 threads execute FSEL, just with different operands selected.
Always check generated SASS to understand the metrics.
Warp Stall Statistics PC Sampling
Collected via statistical sampling (not exact counts). Shows why warps couldn't make forward progress.
Memory-Related Stalls
Memory Input/Output pipeline is overfilled. Too many memory instructions issued, pipeline can't accept more. Common when kernel is heavily memory-bound.
Waiting for memory dependency to resolve. The instruction needs data from a previous load that hasn't completed yet. Most common memory stall.
float x = data[i]; followed by float y = x * 2; — the multiply waits for load
Local/Global memory throttle. Similar to MIO throttle but specifically for L2/global memory operations.
Synchronization Stalls
Waiting at __syncthreads() for other threads in the block to arrive.
Scheduling States
Warp was ready to execute but wasn't chosen by the scheduler this cycle. Normal behavior—doesn't indicate a problem.
Warp that actually issued an instruction this cycle. This is the "good" state—warp made forward progress.
PC Sampling: How It Works
SM Sampling Algorithm
Every N cycles: randomly pick 1 warp scheduler per SM, then randomly pick 1 active warp → record PC + stall reason
Sampling Metrics
gpu__time_duration
device__attribute_num_sm
smsp__pcsamp_interval
smsp__pcsamp_sample_count
Estimating Sample Count
Useful for configuring sample interval when you want a specific sample density.
Memory Metrics Memory
Excessive Sectors in L2
l2_excessive_sectors
Memory accesses that fetched more L2 cache sectors than needed. Indicates non-coalesced access patterns.
Excessive Sectors in L1
l1_excessive_sectors
Same as above but for L1 cache. High values indicate inefficient memory access patterns.
Excessive Memory Transactions
memory_excessive_transactions
Total count of "wasted" memory transactions. Quick way to find inefficient load/store instructions.
Shared Memory Bank Conflicts
shared_bank_conflicts
Access pattern causes multiple threads to hit same memory bank. Serializes memory accesses.
Memory Instruction Types
The Source Page shows instruction types in the SASS column:
LDG - Global Load
STG - Global Store
LDS - Shared Load
STS - Shared Store
💡 Normalize by Execution Count
An instruction executed 1000× will have more excessive memory transactions than one executed 10×. Divide excessive transactions by execution count to find the worst access patterns.
Register & Dependency Tracking
Live Registers
registers_live
Number of registers in use at each instruction. Uses max aggregation. High register pressure can limit occupancy.
Register Dependencies
The Source Page shows register read/write dependencies:
- • Output registers - Which registers this instruction writes to
- • Input registers - Which registers this instruction reads from
- • Click to follow the dependency chain and find stall sources
Predicate Dependencies
Track predicate registers (P0, P1, etc.) used for conditional execution. Click to see where predicates are set and consumed.
🔗 Finding Dependency Chains
When you see Long Scoreboard stalls, use register dependencies to trace back:
1. Find instruction with high stall count
2. Check its input registers
3. Find instruction that writes to that register
4. That's likely your slow instruction (memory load, etc.)
Details Page: Aggregate Metrics Hardware Counters
These are kernel-wide aggregates from hardware counters (not sampling).
Speed of Light Throughput
Percentage of peak compute throughput achieved
Percentage of peak memory bandwidth achieved
Higher value indicates which resource is the bottleneck.
Pipeline Utilization
Issue Slot Utilization
Percentage of cycles where warp schedulers issued instructions. Target: close to 100%
Warp Stall Reasons (Aggregated)
Same stall reasons as Source Page, but aggregated over entire kernel. Collected via hardware counters (not sampling).
Key difference: Hardware counters count every stall for every active warp, every cycle. Sum of all stall reasons = total active warp-cycles. Ratios between reasons match the sampled data.
Key Insights for Using NCU Metrics
1. Stalls Appear on Dependent Instructions
A slow load doesn't show stalls on itself—stalls appear on the instruction that uses the loaded value. A barrier shows stalls on the instruction after the barrier, not on the barrier itself.
2. Zero Stalls ≠ Perfect Kernel
The goal isn't zero stalls—it's high issue slot utilization. Some stalls are hidden by having enough warps. Only focus on stalls if your kernel is latency-limited (check Details Page first).
3. Check Generated Assembly
Click on source lines to see what assembly was actually generated. Loop unrolling, predication, and compiler optimizations can produce very different code than you expect. Metrics only make sense when you understand the generated code.
4. Use Per-File Statistics for Large Codebases
Collapse the Source Page to get per-file aggregates. Quickly identify which source files consume the most cycles or have the worst memory access patterns before diving into individual lines.
5. Inline Function Breakdown
When helper functions are inlined, all metrics aggregate to a single line. Use the inline function breakdown table to split metrics by call site and find which call is actually expensive.
Enabling Line Table Information
Required to correlate CUDA-C source with SASS assembly. Does NOT affect code optimization.
Compiler Flags
nvcc --generate-line-info ...
Enabled by default
Pass via jit options
NCU Collection
Enable Source Counter section (part of "full" metric set):
ncu --set full -o report ./my_app
Inline Function Support
For inline function breakdown to work, compile with CUDA 11.8+ toolkit (when using NVCC).
Quick Reference: Metric Categories
| Category | Source | Use For | Key Metrics |
|---|---|---|---|
| Instruction | Counter | Execution count, thread efficiency | inst_executed, avg_threads |
| Warp Stalls | Sampling | Find latency bottlenecks | Long Scoreboard, MIO Throttle, Barrier |
| Memory | Memory | Find inefficient access patterns | excessive_sectors, bank_conflicts |
| Registers | Counter | Dependency chains, pressure | registers_live, dependencies |