CUDA Performance Optimization Samples Explorer

Memory Alignment & Structure Padding

Understanding the huge access speed gap between aligned and misaligned data structures

Key Insight

This sample demonstrates that memory alignment can affect throughput by 2-10x. GPUs achieve peak memory bandwidth only when memory accesses are properly aligned and coalesced.

What is Memory Alignment?

Memory alignment refers to placing data at memory addresses that are multiples of the data's size or a specific boundary (like 128 bytes for optimal GPU coalescing).

✓ Aligned access: Address is a multiple of data size
✗ Misaligned access: Address is not a multiple of data size

Memory Transaction Visualization

Aligned (1 transaction):

Misaligned (2+ transactions):

Code Examples: Aligned vs Misaligned Structures

Aligned Structure (Good)

// Properly aligned structure
struct __align__(16) AlignedStruct {
    float4 data;  // 16 bytes, naturally aligned
};

// Or using built-in aligned types
float4* aligned_array;
cudaMalloc(&aligned_array, N * sizeof(float4));

Misaligned Structure (Bad)

// Misaligned structure - padding issues
struct MisalignedStruct {
    char c;       // 1 byte
    float f;      // 4 bytes at offset 1 (bad!)
    char c2;      // 1 byte
    double d;     // 8 bytes at offset 6 (bad!)
};
// Total: 15 bytes, but sizeof = 24 with padding

Performance Impact

Per-Element Copy Throughput Comparison

Aligned float4 (16-byte aligned) ~800 GB/s

Aligned float2 (8-byte aligned) ~700 GB/s

Misaligned 12-byte struct ~400 GB/s

Misaligned 9-byte struct ~150 GB/s

* Representative values on H100 GPU. Actual performance varies by hardware.

Best Practices

• Use __align__(N) specifier for custom alignment
• Prefer built-in vector types: float2, float4, int4
• Ensure structure sizes are multiples of largest member's alignment
• Order structure members from largest to smallest alignment
• Use cudaMallocPitch() for 2D arrays to ensure row alignment

Unified Memory Performance Comparison

Comparing memory types using matrix multiplication as benchmark

What This Sample Tests

This sample compares Unified Memory (with and without hints) against zero-copy buffers, pageable memory, and page-locked memory for both synchronous and asynchronous transfers.

Memory Types Explained

Pageable Memory

Standard CPU memory allocated with malloc()

• Can be swapped to disk by OS
• Requires staging buffer for GPU transfer
• Slowest transfer performance

Page-Locked (Pinned) Memory

Allocated with cudaMallocHost()

• Cannot be swapped - always in RAM
• Direct DMA transfers to GPU
• Enables async transfers with streams

Zero-Copy Memory

Pinned memory mapped to GPU: cudaHostAlloc(..., cudaHostAllocMapped)

• GPU accesses host memory directly
• No explicit copy needed
• Good for sparse access patterns

Unified Memory

Allocated with cudaMallocManaged()

• Single pointer for CPU and GPU
• Automatic data migration on access
• Can use hints for optimization

Unified Memory Performance Hints

cudaMemPrefetchAsync()

// Prefetch data to GPU before kernel
cudaMemPrefetchAsync(ptr, size, deviceId, stream);
myKernel<<<...>>>(...);

// Prefetch back to CPU
cudaMemPrefetchAsync(ptr, size, cudaCpuDeviceId, stream);

Initiates migration before data is needed

cudaMemAdvise()

// Hint: data will be read by GPU
cudaMemAdvise(ptr, size, 
    cudaMemAdviseSetReadMostly, deviceId);

// Hint: GPU will access this frequently
cudaMemAdvise(ptr, size,
    cudaMemAdviseSetPreferredLocation, deviceId);

Gives runtime hints about access patterns

Memory Type Comparison

Memory Type	Transfer	Async Support	Best Use Case
Pageable	Slowest	No	Simple prototypes
Page-Locked	Fast	Yes	Streaming, overlapping
Zero-Copy	No copy, direct	N/A	Sparse access, integrated GPUs
Unified Memory	Automatic	With hints	Ease of use, complex data
Unified + Hints	Near optimal	Yes	Production code

Relative Performance (Matrix Multiplication)

Page-Locked + Async Transfer 100% (baseline)

Unified Memory + Prefetch Hints ~95%

Unified Memory (no hints) ~70%

Zero-Copy (discrete GPU) ~30%

Pageable + Sync Transfer ~50%

* Performance varies by GPU, system, and workload characteristics

Recommendations

• Development: Start with Unified Memory for simplicity
• Production: Add prefetch hints for near-optimal performance
• Streaming workloads: Use page-locked memory with async transfers
• Integrated GPUs: Zero-copy can be optimal (shared physical memory)
• Avoid: Pageable memory for performance-critical code

Memory Alignment & Structure Padding

Key Insight

What is Memory Alignment?

Memory Transaction Visualization

Code Examples: Aligned vs Misaligned Structures

Aligned Structure (Good)

Misaligned Structure (Bad)

Performance Impact

Per-Element Copy Throughput Comparison

Best Practices

Matrix Transpose Optimization

The Problem

The Transpose Challenge

Input Matrix (Row-major)

Output (Transposed)

Three Implementation Approaches

1. Naive Transpose

2. Shared Memory

3. Conflict-Free

Understanding Shared Memory Bank Conflicts

Column Access Pattern

Performance Comparison

Optimized Transpose Kernel

Unified Memory Performance Comparison

What This Sample Tests

Memory Types Explained

Pageable Memory

Page-Locked (Pinned) Memory

Zero-Copy Memory

Unified Memory

Unified Memory Performance Hints

cudaMemPrefetchAsync()

cudaMemAdvise()

Memory Type Comparison

Relative Performance (Matrix Multiplication)

Recommendations

CUDA Graphs Performance Scaling

The Problem CUDA Graphs Solve

Traditional Launch vs Graph Launch

Traditional Kernel Launch

CUDA Graph Launch

CUDA Graph Lifecycle

CUDA Graph API Example

Performance Scaling with Graph Size

Launch Overhead Comparison

Speedup Factor

When to Use CUDA Graphs

Ideal Use Cases

Less Suitable For

API Scaling Characteristics