PyTorch Memory Optimization: Huge Pages & CUDA Allocator

What are Huge Pages?

Linux Huge Pages use larger memory pages (2MB or 1GB instead of 4KB) to reduce TLB (Translation Lookaside Buffer) misses. This can significantly improve memory access performance for large allocations typical in deep learning.

CPU Request

→

TLB Lookup

→

Page Table

→

Physical Memory

1. Check and Configure Huge Pages

First, check your current huge pages configuration:

# View current huge pages configuration
cat /proc/meminfo | grep -i huge

# Check available huge page sizes
ls /sys/kernel/mm/hugepages/

# Allocate 1024 huge pages of 2MB each (2GB total)
sudo sysctl -w vm.nr_hugepages=1024

# Make it persistent across reboots
echo "vm.nr_hugepages=1024" | sudo tee -a /etc/sysctl.conf

2. Transparent Huge Pages (THP)

The easiest approach — THP handles huge pages automatically:

# Enable THP (if not already enabled)
echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Check current THP status
cat /sys/kernel/mm/transparent_hugepage/enabled

# Options: [always] madvise never

3. Explicit Huge Pages via mmap

Allocate PyTorch tensors on huge pages using memory-mapped files:

import torch
import ctypes
import os

def allocate_huge_pages_tensor(shape, dtype=torch.float32):
    """Allocate a PyTorch tensor backed by huge pages."""
    
    # Calculate size in bytes
    numel = 1
    for s in shape:
        numel *= s
    dtype_size = torch.tensor([], dtype=dtype).element_size()
    size_bytes = numel * dtype_size
    
    # Round up to huge page size (2MB)
    HUGE_PAGE_SIZE = 2 * 1024 * 1024
    aligned_size = ((size_bytes + HUGE_PAGE_SIZE - 1) // HUGE_PAGE_SIZE) * HUGE_PAGE_SIZE
    
    # mmap flags
    MAP_HUGETLB = 0x40000
    MAP_ANONYMOUS = 0x20
    MAP_PRIVATE = 0x02
    PROT_READ = 0x1
    PROT_WRITE = 0x2
    
    libc = ctypes.CDLL("libc.so.6", use_errno=True)
    libc.mmap.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int, 
                          ctypes.c_int, ctypes.c_int, ctypes.c_long]
    libc.mmap.restype = ctypes.c_void_p
    
    ptr = libc.mmap(None, aligned_size, 
                    PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 
                    -1, 0)
    
    if ptr == ctypes.c_void_p(-1).value:
        raise OSError(f"mmap failed: {os.strerror(ctypes.get_errno())}")
    
    # Create tensor from pointer
    tensor = torch.frombuffer(
        (ctypes.c_char * size_bytes).from_address(ptr),
        dtype=dtype
    ).reshape(shape)
    
    return tensor, ptr, aligned_size

# Usage
tensor, ptr, size = allocate_huge_pages_tensor((1024, 1024, 1024), torch.float32)

4. NUMA-Aware Huge Pages

For multi-socket systems, bind memory to specific NUMA nodes:

# Allocate huge pages on specific NUMA node
echo 512 | sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 512 | sudo tee /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

# Run with numactl
numactl --cpunodebind=0 --membind=0 python train.py

Why jemalloc?

jemalloc is a high-performance memory allocator that can significantly reduce memory fragmentation and improve allocation speed. It has built-in support for Transparent Huge Pages (THP) and provides better multi-threaded performance than glibc's default allocator.

Installation

# Ubuntu/Debian
sudo apt-get install libjemalloc2 libjemalloc-dev

# CentOS/RHEL
sudo yum install jemalloc jemalloc-devel

# Verify installation
ls -la /usr/lib/x86_64-linux-gnu/libjemalloc.so*

Running PyTorch with jemalloc

Use LD_PRELOAD to replace the default allocator:

# Basic usage
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 python train.py

# With optimized configuration for ML workloads
MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000,thp:always" \
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
python train.py

MALLOC_CONF Options Explained

Option	Value	Purpose
`oversize_threshold`	1	Allocations ≥1 byte bypass tcache for large allocs
`background_thread`	true	Enable background thread for memory management
`metadata_thp`	auto	Use THP for jemalloc's internal metadata
`dirty_decay_ms`	9000000000	Delay returning dirty pages to OS (keeps memory hot)
`muzzy_decay_ms`	9000000000	Delay returning muzzy pages to OS
`thp`	always	Always use THP when available

Complete Training Script Example

#!/bin/bash
# train_optimized.sh

# Enable Transparent Huge Pages
echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Set jemalloc configuration
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000,thp:always"

# Preload jemalloc
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

# PyTorch optimizations
export OMP_NUM_THREADS=$(nproc)
export MKL_NUM_THREADS=$(nproc)

# Run training
python train.py "$@"

PYTORCH_CUDA_ALLOC_CONF

This environment variable controls PyTorch's CUDA caching memory allocator. Instead of calling slow cudaMalloc/cudaFree directly, PyTorch caches freed GPU memory for reuse. This variable lets you tune the allocator's behavior.

PyTorch Tensor

→

Caching Allocator

→

Memory Pool

→

CUDA Memory

Configuration Options

Option	Default	Purpose
`max_split_size_mb`	Unlimited	Max size (MB) of memory block that can be split
`expandable_segments`	False	Use expandable segments to reduce fragmentation
`garbage_collection_threshold`	0.0	Trigger GC when this fraction of memory is cached
`roundup_power2_divisions`	0	How to round up allocation sizes
`backend`	native	Allocator backend (native, cudaMallocAsync)

Reduce Fragmentation

OOM errors even when nvidia-smi shows free memory

export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

Prevent Large Block Splitting

Preserve large blocks for future allocations

export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512"

Enable Garbage Collection

Release cached memory when threshold reached

export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.8"

Use CUDA Async Allocator

CUDA 11.4+ native async allocator

export PYTORCH_CUDA_ALLOC_CONF="backend:cudaMallocAsync"

Combined Configuration Examples

# For OOM with "free memory" available (fragmentation)
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"

# For large model training
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512,expandable_segments:True"

# For inference servers (minimize memory footprint)
export PYTORCH_CUDA_ALLOC_CONF="garbage_collection_threshold:0.7,expandable_segments:True"

# All options combined
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:256,expandable_segments:True,garbage_collection_threshold:0.9"

Debugging Memory Issues

Record memory allocation snapshots for analysis:

import torch

# Record memory history for debugging
torch.cuda.memory._record_memory_history(max_entries=100000)

# ... run your code ...

# Save snapshot for analysis
torch.cuda.memory._dump_snapshot("memory_snapshot.pickle")

# Visualize at: https://pytorch.org/memory_viz

⚠️ Important Notes

GPU memory only — This is for CUDA memory, not CPU memory
Not related to Huge Pages — CPU memory uses system allocator
Per-process setting — Must be set before PyTorch imports CUDA
Order matters — Set environment variable before import torch

Performance Impact Summary

Optimization	Best For	Typical Speedup	Memory Type
THP (Transparent)	General use, easy setup	5-15%	CPU
Explicit 2MB pages	Large batch training	10-20%	CPU
1GB pages	Very large models (LLMs)	15-25%	CPU
jemalloc + THP	Production deployments	10-20%	CPU
expandable_segments	Variable batch sizes	Prevents OOM	GPU

✅ When to Use

Large batch training
Large model inference (LLMs, ViTs)
Multi-GPU training with CPU memory bottleneck
Data loading with large prefetch buffers
Production inference servers

❌ May Not Help

Small models that fit in L3 cache
GPU-bound workloads
Systems with limited RAM
Development/debugging (adds complexity)

Complete Production Setup

#!/bin/bash
# production_train.sh - Optimized PyTorch training script

# === CPU Memory Optimizations ===

# Enable Transparent Huge Pages
echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

# Configure jemalloc
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:9000000000,muzzy_decay_ms:9000000000,thp:always"
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

# === GPU Memory Optimizations ===

# CUDA allocator configuration
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:512,expandable_segments:True,garbage_collection_threshold:0.9"

# === CPU Threading ===

export OMP_NUM_THREADS=$(nproc)
export MKL_NUM_THREADS=$(nproc)
export NUMEXPR_NUM_THREADS=$(nproc)

# === NUMA Binding (multi-socket systems) ===

# Uncomment for NUMA-aware execution
# numactl --cpunodebind=0 --membind=0 python train.py "$@"

# Run training
python train.py "$@"

Monitoring Commands

# Monitor huge page usage in real-time
watch -n 1 'cat /proc/meminfo | grep -i huge'

# Check per-process huge page usage
grep -i huge /proc/$(pgrep -f "python train.py")/smaps | head -20

# Monitor GPU memory with PyTorch
python -c "import torch; print(torch.cuda.memory_summary())"

# Check jemalloc stats (requires jemalloc with stats enabled)
MALLOC_CONF="stats_print:true" python -c "import torch; x = torch.randn(1000,1000)"