Memory Alignment & Structure Padding
Understanding the huge access speed gap between aligned and misaligned data structures
Key Insight
This sample demonstrates that memory alignment can affect throughput by 2-10x. GPUs achieve peak memory bandwidth only when memory accesses are properly aligned and coalesced.
What is Memory Alignment?
Memory alignment refers to placing data at memory addresses that are multiples of the data's size or a specific boundary (like 128 bytes for optimal GPU coalescing).
- ✓ Aligned access: Address is a multiple of data size
- ✗ Misaligned access: Address is not a multiple of data size
Memory Transaction Visualization
Aligned (1 transaction):
0
4
8
12
Misaligned (2+ transactions):
0
3
7
11
Code Examples: Aligned vs Misaligned Structures
Aligned Structure (Good)
// Properly aligned structure
struct __align__(16) AlignedStruct {
float4 data; // 16 bytes, naturally aligned
};
// Or using built-in aligned types
float4* aligned_array;
cudaMalloc(&aligned_array, N * sizeof(float4));
Misaligned Structure (Bad)
// Misaligned structure - padding issues
struct MisalignedStruct {
char c; // 1 byte
float f; // 4 bytes at offset 1 (bad!)
char c2; // 1 byte
double d; // 8 bytes at offset 6 (bad!)
};
// Total: 15 bytes, but sizeof = 24 with padding
Performance Impact
Per-Element Copy Throughput Comparison
Aligned float4 (16-byte aligned)
~800 GB/s
Aligned float2 (8-byte aligned)
~700 GB/s
Misaligned 12-byte struct
~400 GB/s
Misaligned 9-byte struct
~150 GB/s
* Representative values on H100 GPU. Actual performance varies by hardware.
Best Practices
- • Use
__align__(N)specifier for custom alignment - • Prefer built-in vector types:
float2,float4,int4 - • Ensure structure sizes are multiples of largest member's alignment
- • Order structure members from largest to smallest alignment
- • Use
cudaMallocPitch()for 2D arrays to ensure row alignment