Performance Tuning¶

Optimize Smart-Diffusion for maximum performance.

Quick Wins¶

1. Use SageAttention¶

infer.attn_type=sage

Speedup: Performance testing in progress Quality loss: Minimal

2. Enable FlexCache¶

user_params = DiffusionUserParams(
    prompt="...",
    flexcache='teacache'
)

Speedup: Performance testing in progress Quality loss: Minimal

3. Reduce Inference Steps¶

num_inference_steps=30  # Instead of 50

Speedup: Performance testing in progress Quality loss: Slight

GPU Utilization¶

Check Utilization¶

nvidia-smi dmon -s u

Target: GPU utilization benchmarks in progress

If low: 1. Increase batch size (future feature) 2. Use context parallelism 3. Check CPU bottlenecks

Memory Optimization¶

Reduce Memory Usage¶

Strategy 1: Low memory mode

infer.diffusion.low_mem_level=2

Strategy 2: Lower resolution

height=480, width=848, num_frames=61

Strategy 3: SageAttention

infer.attn_type=sage

Benchmarking¶

Measure Performance¶

import time

start = time.time()
while not DiffusionTaskPool.all_finished():
    chitu_generate()
elapsed = time.time() - start

print(f"Generation took {elapsed:.2f} seconds")

Expected Performance¶

Performance benchmarking is in progress. Results will be published once comprehensive testing is completed across different hardware configurations.

Model	Resolution	Frames	Steps	A100 (40GB)	H100 (80GB)
1.3B	480x848	81	50	To be tested	To be tested
14B	480x848	81	50	To be tested	To be tested
14B	720x1280	121	50	To be tested	To be tested

Performance improvements with optimizations will be benchmarked

Multi-GPU Scaling¶

Context Parallelism Efficiency¶

Multi-GPU scaling benchmarks in progress.

GPUs	Speedup	Efficiency
1	1.0x	100%
2	To be tested	To be tested
4	To be tested	To be tested
8	To be tested	To be tested

CFG Parallelism¶

CFG parallelism performance testing in progress.

GPUs	Speedup
2	To be tested

Profiling¶

Enable Debug Mode¶

export CHITU_DEBUG=1
python test_generate.py

Shows detailed timing information.