Low Memory Mode¶

Smart-Diffusion provides multiple memory optimization strategies to run diffusion models on GPUs with limited VRAM.

Overview¶

Low memory mode uses a multi-level approach to reduce VRAM usage through model offloading, VAE tiling, and mixed precision.

Memory Levels¶

Smart-Diffusion implements four memory levels (0-3):

Level 0: Default (No Optimization)¶

VRAM: Baseline (highest)
Speed: Fastest
Configuration: Default

All components stay on GPU: - Text encoder (T5) - DiT models - VAE decoder - All activations

Use Case: When you have abundant VRAM

python test_generate.py infer.diffusion.low_mem_level=0

Level 1: VAE Tiling¶

VRAM: Memory reduction expected
Speed: Performance testing in progress
Optimization: VAE processes video in tiles

What Changes: - VAE decoder operates on tiles instead of full video - Reduces peak memory during decoding - Minimal quality impact

Use Case: Slightly constrained VRAM

python test_generate.py infer.diffusion.low_mem_level=1

Level 2: CPU Offloading (Text Encoder)¶

VRAM: Memory reduction expected
Speed: Performance testing in progress
Optimization: T5 encoder on CPU

What Changes: - Text encoder moved to CPU - Embeddings transferred to GPU after encoding - One-time cost at start

Use Case: Limited VRAM

python test_generate.py infer.diffusion.low_mem_level=2

Level 3: Aggressive Offloading¶

VRAM: Maximum memory reduction
Speed: Performance testing in progress
Optimization: DiT models on CPU, moved to GPU per-layer

What Changes: - DiT model parameters on CPU - Layers moved to GPU one at a time - Significant CPU-GPU transfer overhead

Use Case: Very limited VRAM

python test_generate.py infer.diffusion.low_mem_level=3

Memory Comparison¶

Memory and performance benchmarking in progress for all optimization levels.

Level	VRAM (14B)	Speed	Components on GPU
0	Baseline	1.0x	All
1	To be tested	To be tested	All (VAE tiled)
2	To be tested	To be tested	DiT + VAE
3	To be tested	To be tested	Active layer only

Configuration¶

Via Command Line¶

# Set memory level
python test_generate.py infer.diffusion.low_mem_level=2

Via Python API¶

from hydra import compose, initialize

initialize(config_path="config", version_base=None)
args = compose(config_name="wan")

# Set memory level
args.infer.diffusion.low_mem_level = 2

# Initialize with low memory
chitu_init(args)

Via Config File¶

# chitu_core/config/infer.yaml
diffusion:
  low_mem_level: 2

Memory Optimization Strategies¶

1. Model Offloading¶

Move unused components to CPU:

# Automatic based on low_mem_level
if low_mem_level >= 2:
    text_encoder.to("cpu")

if low_mem_level >= 3:
    for model in model_pool:
        model.to("cpu")

2. VAE Tiling¶

Process video in spatial tiles:

def decode_vae_tiled(latent, tile_size=8):
    """Decode VAE in tiles to reduce memory"""
    B, C, T, H, W = latent.shape

    # Split into tiles
    tiles = []
    for h in range(0, H, tile_size):
        for w in range(0, W, tile_size):
            tile = latent[:, :, :, h:h+tile_size, w:w+tile_size]
            tiles.append(tile)

    # Decode each tile
    decoded_tiles = [vae.decode(tile) for tile in tiles]

    # Merge tiles
    video = merge_tiles(decoded_tiles, (H, W))
    return video

3. Gradient Checkpointing¶

Recompute activations instead of storing:

# Enable for training/fine-tuning
model.enable_gradient_checkpointing()

4. Mixed Precision¶

Use FP16/BF16 instead of FP32:

with torch.amp.autocast(device_type="cuda", dtype=torch.float16):
    output = model(input)

5. Activation Offloading¶

Move activations to CPU between layers:

# Only for level 3
if low_mem_level >= 3:
    for layer in model.layers:
        output = layer(input.to("cuda"))
        output = output.to("cpu")

Advanced Configuration¶

Custom Tile Size¶

# Smaller tiles = less memory, more compute
args.infer.vae_tile_size = 4  # Default: 8

Offloading Strategy¶

# Fine-tune what to offload
args.infer.offload_text_encoder = True
args.infer.offload_vae = False  # Keep VAE on GPU
args.infer.offload_dit = False

Combining with Other Optimizations¶

Low Memory + SageAttention¶

Combine memory level with quantized attention:

python test_generate.py \
    infer.diffusion.low_mem_level=2 \
    infer.attn_type=sage

VRAM: Performance testing in progress
Speed: Performance testing in progress

Low Memory + FlexCache¶

Enable caching to offset speed loss:

python test_generate.py \
    infer.diffusion.low_mem_level=2

VRAM: Performance testing in progress
Speed: Performance testing in progress

params = DiffusionUserParams(
    prompt="A cat",
    flexcache="teacache"
)

Low Memory + Context Parallelism¶

Split across GPUs for even lower per-GPU memory:

torchrun --nproc_per_node=2 test_generate.py \
    infer.diffusion.low_mem_level=2 \
    infer.diffusion.cp_size=2

VRAM per GPU: Performance testing in progress
Speed: Performance testing in progress

Monitoring Memory Usage¶

Real-Time Monitoring¶

# In another terminal
watch -n 0.5 nvidia-smi

Python Monitoring¶

import torch

def print_memory_stats():
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    max_allocated = torch.cuda.max_memory_allocated() / 1024**3

    print(f"Allocated: {allocated:.2f} GB")
    print(f"Reserved: {reserved:.2f} GB")
    print(f"Peak: {max_allocated:.2f} GB")

# Call during generation
print_memory_stats()

Memory Profiling¶

import torch.cuda.memory

# Start profiling
torch.cuda.memory._record_memory_history()

# Run generation
chitu_generate()

# Export profile
torch.cuda.memory._dump_snapshot("memory_profile.pickle")

Troubleshooting¶

Still Running Out of Memory¶

Solutions: 1. Increase low_mem_level 2. Reduce resolution or num_frames 3. Use SageAttention 4. Enable context parallelism 5. Close other GPU processes

# Emergency memory clearing
torch.cuda.empty_cache()

Slow Generation¶

Solutions: 1. Lower low_mem_level if you have VRAM 2. Enable FlexCache 3. Use SageAttention 4. Check CPU-GPU transfer is not bottleneck

# Profile to find bottleneck
with torch.profiler.profile() as prof:
    chitu_generate()

print(prof.key_averages().table(sort_by="cuda_time_total"))

CPU Bottleneck (Level 3)¶

Symptoms: High CPU usage, GPU underutilized

Solutions: 1. Use faster CPU 2. Enable pinned memory 3. Increase batch size (if supported) 4. Use NVMe SSD for faster disk I/O

Best Practices¶

1. Start Conservative¶

Begin with level 2 and adjust:

args.infer.diffusion.low_mem_level = 2

2. Profile First¶

Measure actual VRAM usage before optimizing:

print_memory_stats()

3. Balance Memory and Speed¶

Find optimal level for your use case: - Interactive: Level 1-2 - Batch processing: Level 2-3 - Production: Level 1 with monitoring

4. Combine Techniques¶

Use multiple optimizations together: - Low memory + quantized attention - Low memory + caching - Low memory + parallelism

5. Monitor in Production¶

Track memory usage over time:

# Log peak memory per request
max_mem = torch.cuda.max_memory_allocated() / 1024**3
logging.info(f"Peak memory: {max_mem:.2f} GB")
torch.cuda.reset_peak_memory_stats()

Hardware Recommendations¶

Performance benchmarking in progress. Hardware recommendations will be provided once comprehensive testing is completed.

By VRAM Size¶

VRAM	Recommended Level	Max Resolution
16GB	Level 3 + Sage	To be tested
24GB	Level 2	To be tested
40GB	Level 1	To be tested
80GB	Level 0	To be tested

By Use Case¶

Use Case	Configuration
Development/Testing	Level 2, low resolution
Production (latency-sensitive)	Level 1, SageAttention
Production (throughput-focused)	Level 2, batch processing
Research/Experimentation	Level 0, full precision