Tensor Memory Accelerator (TMA)

Data path with and without TMA, what to measure in NCU, and expected performance benefits (Hopper+)

Overview

Tensor Memory Accelerator (TMA) is a dedicated hardware unit on NVIDIA Hopper (and later) GPUs that performs asynchronous bulk copies between global memory (HBM) and shared memory. A single thread can initiate a large tensor transfer; the TMA unit performs the copy in the background while warps continue with compute or other work.

Without TMA, warps move data using regular load/store instructions (or cp.async on Ampere), tying up the SM’s memory pipeline and often stalling on memory latency. With TMA, bulk copies are offloaded to the TMA unit, improving overlap and effective bandwidth. PTX uses cp.async.bulk.tensor; synchronization relies on mbarrier and fence.proxy.async.

Register bypass: TMA moves data directly between global memory and shared memory—it does not use the register file as an intermediate. Older approaches required many threads and registers to hold data during loads/stores. TMA avoids that entirely for bulk transfers, which reduces register pressure and frees registers for compute.

Data Path Without TMA

Warps issue loads/stores (or small async copies) through the SM’s memory pipeline. Data flows from global memory through L2 into L1/shared. Warps often stall on memory until data arrives.

flowchart TB subgraph SM["SM (Streaming Multiprocessor)"] W["Warp(s)"] MP["Memory Pipeline\n(loads/stores)"] W -->|"Issue load/store"| MP end GM["Global Memory (HBM)"] L2["L2 Cache"] L2 --> GM MP -->|"Request"| L2 L2 -->|"Data (high latency if miss)"| MP MP -->|"Data ready"| W W -->|"Stall until data"| W style W fill:#1e293b,stroke:#64748b style MP fill:#475569,stroke:#64748b style L2 fill:#3b82f6,stroke:#64748b style GM fill:#6366f1,stroke:#64748b

Warp issues load → waits on memory pipeline → L2/global → data returns → warp resumes. Limited overlap; memory-bound kernels show high warp stalls on memory.

Data Path With TMA

One thread (or a few) issues a TMA copy (bulk tensor). A dedicated TMA unit performs the transfer asynchronously. Warps can do compute or issue more TMA copies while previous copies are in flight. Synchronization is explicit (mbarrier/fence).

flowchart TB subgraph SM["SM (Streaming Multiprocessor)"] W["Warp(s)"] TMA["TMA Unit\n(bulk copy engine)"] COMPUTE["Compute / more TMA ops"] W -->|"Issue TMA copy"| TMA W -->|"Continue work"| COMPUTE TMA -->|"Async bulk copy"| SHARED TMA -.->|"Overlaps with"| COMPUTE end GM["Global Memory (HBM)"] L2["L2 Cache"] SHARED["Shared Memory"] GM --> L2 TMA -->|"Read/write via L2"| L2 L2 --> TMA TMA --> SHARED SHARED --> COMPUTE style W fill:#1e293b,stroke:#94a3b8 style TMA fill:#76B900,stroke:#76B900,color:#000 style COMPUTE fill:#22c55e,stroke:#22c55e,color:#000 style SHARED fill:#3b82f6,stroke:#64748b style L2 fill:#3b82f6,stroke:#64748b style GM fill:#6366f1,stroke:#64748b

TMA unit handles bulk transfer; warps overlap copy with compute. Data path is global ↔ L2 ↔ shared with no register file in between (register bypass). Better utilization and higher effective memory throughput when copy and compute are balanced.

Clarifications

  • Global = HBM: In GPU terms, “global memory” is device DRAM—on modern GPUs that’s HBM (High Bandwidth Memory / VRAM). So the TMA path is HBM ↔ L2 ↔ shared.
  • What we’re saving (L1 / registers): We’re not thrashing L1 or the register file. Without TMA, many small loads go through the SM’s memory pipeline into L1 and registers, which can thrash L1 and burn registers. TMA does one bulk move directly into shared memory, bypassing registers and avoiding that thrashing pattern.
  • L2 is unchanged: Data still goes through L2. All the usual L2 performance issues remain: limited capacity, finite bandwidth, and cache misses (L2 miss → HBM latency). TMA doesn’t remove or bypass L2; it only changes how the copy is initiated (bulk, async) and avoids registers and the L1-thrashing load pattern.

What to Expect in Metrics When TMA Is Present

When kernels use TMA effectively, you typically see:

Category Metric / Observation
Memory throughput Higher achieved DRAM throughput (bytes/sec) and better L2 bandwidth utilization; bulk transfers are more efficient than many small loads.
Warp stall reasons Lower fraction of cycles stalled on memory throttle or memory dependency (data not ready). More cycles in “issued” or “executing” when TMA hides latency.
SM utilization Issue slot utilization (scheduler busy) can be higher; copy and compute overlap keeps warps active instead of waiting on loads.
Instructions Presence of TMA / bulk-copy related SASS (e.g. CP_ASYNC_BULK_*, or proxy instructions). Fewer generic LDG/STG for the same logical data movement.
Occupancy TMA reduces register and thread usage for data movement (one thread can initiate a large copy), so kernels may sustain good occupancy with less register pressure.

Kernel Names Where TMA Often Appears

TMA is used in libraries and frameworks that target Hopper (H100, H200) and later. Typical kernel name patterns or sources:

  • CUTLASS 3.x / CuTe: GEMM and related kernels (e.g. gemm_*, cutlass_*, *_tma). CUTLASS uses TMA for bulk tile loads/stores.
  • cuBLAS / cuBLASLt (Hopper): GEMM and batched GEMM kernels; names often include blas, gemm, or internal identifiers. TMA is used in optimized GEMM paths.
  • FlashAttention / fused attention: Attention kernels that move large blocks of Q/K/V; names may include flash_attn, attention, fmha.
  • Triton (experimental TMA): Autotuned kernels (often long hashed names) that enable TMA for certain ops (e.g. matmul, attention).
  • Custom Hopper kernels: Any kernel using cp.async.bulk.tensor (PTX) or equivalent APIs (e.g. CuTe, libcudacxx TMA helpers) will show TMA in the SASS/NCU view.

Filter in NCU by kernel name to focus on GEMM, attention, or your custom kernels; then inspect the Source/SASS view and memory metrics for TMA usage.

How to See TMA in Nsight Compute (NCU)

  • Source / SASS view: Look for instructions related to bulk async copy (e.g. CP_ASYNC_BULK_TENSOR, BULK_*, or TMA descriptor setup). NCU’s Source page correlates high-level code with SASS; TMA shows up as few instructions initiating large transfers.
  • Memory section: Check “Memory Workload Analysis” and throughput metrics. Kernels using TMA often show high memory throughput with relatively fewer explicit load/store instructions in the listing.
  • Warp Stall Statistics: Compare “Memory Throttle” and “Memory Dependency” stall reasons. With TMA, these can be lower (more overlap); without TMA, memory-bound kernels often show high stalls here.
  • Roofline / speed-of-light: NCU’s roofline view shows achieved vs theoretical memory bandwidth. TMA can help get closer to the memory roofline when the kernel is memory-bound.
  • NVIDIA docs: For exact metric names and TMA-specific counters on your GPU (e.g. H100), check the latest Nsight Compute documentation and the “Metrics Reference” for your architecture.

Performance Benefits When TMA Is Present

  • Higher effective memory bandwidth: Bulk transfers reduce per-byte overhead and allow the memory system to run at higher utilization.
  • Copy–compute overlap: Warps do useful work while TMA moves data, reducing time spent stalled on memory.
  • Register bypass / less register pressure: TMA transfers data directly between global and shared memory without using registers. One thread can initiate a large copy instead of many threads each doing small loads through the register file; this frees registers and occupancy for compute.
  • Reported speedups: On Hopper, well-tuned GEMM kernels using TMA (e.g. CUTLASS 3 FP8) have shown on the order of 1.4×–2.2× over non-TMA baselines (e.g. cuBLAS) for small-to-medium problem sizes, where memory and overlap matter most.
  • Multicast: TMA can multicast from global to shared memory of multiple SMs in a cluster, reducing total traffic and improving efficiency in producer–consumer patterns.