Nsight Compute Metrics

Complete Reference Guide for CUDA Kernel Profiling

Source-level metrics, warp stall analysis, and memory profiling

Nsight Compute Overview

Nsight Compute (NCU) is NVIDIA's GPU kernel profiler that provides detailed performance reports showing how kernels utilize GPU resources. This guide covers the metrics available on the Source Page and Details Page, with focus on understanding what each metric means and how to use them.

📊

Details Page

High-level kernel metrics, speed-of-light analysis, and guided recommendations

📝

Source Page

Per-line metrics correlated between high-level code and SASS assembly

🔍

Guided Analysis

Automatic detection of performance issues with speedup estimates

Source Page Features

Supported Languages

High-Level Languages
  • • CUDA-C/C++
  • • Fortran
  • • Python (with CUDA)
Low-Level Assembly
  • PTX - Intermediate representation
  • SASS - GPU assembly (always available, where metrics are collected)

Source Correlation

Click any line in CUDA-C to see corresponding SASS instructions (and vice versa). Uses Line Table Information to correlate languages.

High-Level → Low-Level
Click CUDA-C line to see all generated SASS instructions
Low-Level → High-Level
Click SASS instruction to see originating source line

Visual Features

  • Heatmap/Minimap - Color gradient showing metric intensity across code
  • Navigation buttons - Jump to highest/next value quickly
  • Per-file statistics - Collapse view to see aggregated metrics per source file
  • Inline function breakdown - Split metrics by call site for inlined functions

Instruction Executed Metrics Counter-Based

These metrics tell you what code is executed and how efficiently (in terms of active threads).

Instructions Executed (Warps)

inst_executed

How many warps executed this instruction. For a kernel with 101 blocks of 32 threads (1 warp each), expect 101 warps per instruction (if no branching).

Instructions Executed (Threads)

inst_executed_threads

Total thread count executing this instruction. With 101 warps of 32 threads: 101 × 32 = 3,232 threads expected.

Predicated On Executed

inst_executed_pred_on

Thread count with predication mask enabled. Used for conditional execution (if statements).

Average Threads Executed per Warp

avg_threads_executed

Average active threads when executing. 32 = full warp, lower = divergence/predication. Key efficiency indicator!

Example:
If condition if (threadIdx.x < 24) → average = 24 for instructions inside the if-block

Average Predicated Threads per Warp

avg_pred_threads_executed

Same as above but accounting for predication masks. Shows true active participation.

⚠️ Compiler Optimizations Can Surprise You

Some conditions may not show reduced thread counts because the compiler used FSEL (floating-point select) instead of predicated instructions. All 32 threads execute FSEL, just with different operands selected. Always check generated SASS to understand the metrics.

Warp Stall Statistics PC Sampling

Collected via statistical sampling (not exact counts). Shows why warps couldn't make forward progress.

Memory-Related Stalls

🔴 MIO Throttle

Memory Input/Output pipeline is overfilled. Too many memory instructions issued, pipeline can't accept more. Common when kernel is heavily memory-bound.

🔴 Long Scoreboard

Waiting for memory dependency to resolve. The instruction needs data from a previous load that hasn't completed yet. Most common memory stall.

Example: float x = data[i]; followed by float y = x * 2; — the multiply waits for load
🔴 LG Throttle

Local/Global memory throttle. Similar to MIO throttle but specifically for L2/global memory operations.

Synchronization Stalls

🟡 Barrier

Waiting at __syncthreads() for other threads in the block to arrive.

Important: Stall appears on the instruction after the barrier, not on the barrier itself!

Scheduling States

🟢 Not Selected

Warp was ready to execute but wasn't chosen by the scheduler this cycle. Normal behavior—doesn't indicate a problem.

🟢 Selected

Warp that actually issued an instruction this cycle. This is the "good" state—warp made forward progress.

PC Sampling: How It Works

SM Sampling Algorithm

Warp Scheduler 0
Warp Scheduler 1 ← Selected
Warp Scheduler 2
Warp Scheduler 3
Active Warp
Unused Slot
Sampled Warp

Every N cycles: randomly pick 1 warp scheduler per SM, then randomly pick 1 active warp → record PC + stall reason

Sampling Metrics

Kernel Duration gpu__time_duration
Number of SMs device__attribute_num_sm
Sample Interval (cycles) smsp__pcsamp_interval
Total Samples Collected smsp__pcsamp_sample_count

Estimating Sample Count

sample_count ≈ (kernel_duration / sample_interval) × num_SMs

Useful for configuring sample interval when you want a specific sample density.

Memory Metrics Memory

Excessive Sectors in L2

l2_excessive_sectors

Memory accesses that fetched more L2 cache sectors than needed. Indicates non-coalesced access patterns.

Excessive Sectors in L1

l1_excessive_sectors

Same as above but for L1 cache. High values indicate inefficient memory access patterns.

Excessive Memory Transactions

memory_excessive_transactions

Total count of "wasted" memory transactions. Quick way to find inefficient load/store instructions.

Shared Memory Bank Conflicts

shared_bank_conflicts

Access pattern causes multiple threads to hit same memory bank. Serializes memory accesses.

Memory Instruction Types

The Source Page shows instruction types in the SASS column:

LDG - Global Load
STG - Global Store
LDS - Shared Load
STS - Shared Store

💡 Normalize by Execution Count

An instruction executed 1000× will have more excessive memory transactions than one executed 10×. Divide excessive transactions by execution count to find the worst access patterns.

Register & Dependency Tracking

Live Registers

registers_live

Number of registers in use at each instruction. Uses max aggregation. High register pressure can limit occupancy.

Register Dependencies

The Source Page shows register read/write dependencies:

  • Output registers - Which registers this instruction writes to
  • Input registers - Which registers this instruction reads from
  • • Click to follow the dependency chain and find stall sources

Predicate Dependencies

Track predicate registers (P0, P1, etc.) used for conditional execution. Click to see where predicates are set and consumed.

🔗 Finding Dependency Chains

When you see Long Scoreboard stalls, use register dependencies to trace back:

1. Find instruction with high stall count
2. Check its input registers
3. Find instruction that writes to that register
4. That's likely your slow instruction (memory load, etc.)

Details Page: Aggregate Metrics Hardware Counters

These are kernel-wide aggregates from hardware counters (not sampling).

Speed of Light Throughput

Compute (%)

Percentage of peak compute throughput achieved

Memory (%)

Percentage of peak memory bandwidth achieved

Higher value indicates which resource is the bottleneck.

Pipeline Utilization

LSU (Load/Store Unit) Memory operation pipeline
FMA (Fused Multiply-Add) Math pipeline
Tensor Core Matrix operation pipeline

Issue Slot Utilization

Percentage of cycles where warp schedulers issued instructions. Target: close to 100%

Example: "Issue every 5 cycles" = ~20% utilization = room for optimization

Warp Stall Reasons (Aggregated)

Same stall reasons as Source Page, but aggregated over entire kernel. Collected via hardware counters (not sampling).

Key difference: Hardware counters count every stall for every active warp, every cycle. Sum of all stall reasons = total active warp-cycles. Ratios between reasons match the sampled data.

Key Insights for Using NCU Metrics

1. Stalls Appear on Dependent Instructions

A slow load doesn't show stalls on itself—stalls appear on the instruction that uses the loaded value. A barrier shows stalls on the instruction after the barrier, not on the barrier itself.

2. Zero Stalls ≠ Perfect Kernel

The goal isn't zero stalls—it's high issue slot utilization. Some stalls are hidden by having enough warps. Only focus on stalls if your kernel is latency-limited (check Details Page first).

3. Check Generated Assembly

Click on source lines to see what assembly was actually generated. Loop unrolling, predication, and compiler optimizations can produce very different code than you expect. Metrics only make sense when you understand the generated code.

4. Use Per-File Statistics for Large Codebases

Collapse the Source Page to get per-file aggregates. Quickly identify which source files consume the most cycles or have the worst memory access patterns before diving into individual lines.

5. Inline Function Breakdown

When helper functions are inlined, all metrics aggregate to a single line. Use the inline function breakdown table to split metrics by call site and find which call is actually expensive.

Enabling Line Table Information

Required to correlate CUDA-C source with SASS assembly. Does NOT affect code optimization.

Compiler Flags

NVCC
nvcc --generate-line-info ...
HPC SDK
Enabled by default
Python/Numba/Triton
Pass via jit options

NCU Collection

Enable Source Counter section (part of "full" metric set):

ncu --set full -o report ./my_app

Inline Function Support

For inline function breakdown to work, compile with CUDA 11.8+ toolkit (when using NVCC).

Quick Reference: Metric Categories

Category Source Use For Key Metrics
Instruction Counter Execution count, thread efficiency inst_executed, avg_threads
Warp Stalls Sampling Find latency bottlenecks Long Scoreboard, MIO Throttle, Barrier
Memory Memory Find inefficient access patterns excessive_sectors, bank_conflicts
Registers Counter Dependency chains, pressure registers_live, dependencies