Architecture Overview¶
This document provides an overview of Smart-Diffusion's architecture and design principles.
System Architecture¶
Smart-Diffusion follows a modular, pipeline-based architecture optimized for high-performance diffusion inference:
graph TB
subgraph "User Layer"
A[User Request] --> B[DiffusionUserParams]
end
subgraph "Task Management"
B --> C[DiffusionTask]
C --> D[DiffusionTaskPool]
end
subgraph "Scheduling Layer"
D --> E[DiffusionScheduler]
E --> F{Task Ready?}
F -->|Yes| G[Select Task]
F -->|No| E
end
subgraph "Execution Layer"
G --> H[Generator]
H --> I[Text Encoding]
H --> J[Iterative Denoising]
H --> K[VAE Decoding]
end
subgraph "Backend Layer"
I --> L[T5 Encoder]
J --> M[DiT Model]
K --> N[VAE]
L --> O[DiffusionBackend]
M --> O
N --> O
end
subgraph "Acceleration Layer"
O --> P[Attention Backend]
O --> Q[FlexCache Manager]
O --> R[Distributed Groups]
P --> S[FlashAttn/Sage/Sparge]
end
K --> T[Output Video]
Core Components¶
1. Task Management System¶
Purpose: Manage user requests and track generation progress.
Key Classes:
- DiffusionUserParams: User-facing parameters for generation
- DiffusionTask: Internal task representation with buffers
- DiffusionTaskPool: Global task pool manager
Features: - Task serialization for distributed execution - Progress tracking - Buffer management for intermediate states
2. Scheduler¶
Purpose: Select and order tasks for execution.
Key Class: DiffusionScheduler
Strategy: FIFO (First-In-First-Out)
Responsibilities: - Task selection - Resource allocation - Fairness enforcement
3. Generator¶
Purpose: Execute the generation pipeline.
Key Class: Generator
Pipeline Stages: 1. Text Encoding: Convert prompt to embeddings using T5 2. Denoising: Iteratively denoise latent through DiT model 3. VAE Decoding: Convert latent to pixel space
Features: - Multi-stage pipeline management - Distributed communication handling - Memory-efficient processing
4. Backend¶
Purpose: Manage model loading, parallelism, and resources.
Key Class: DiffusionBackend
Responsibilities: - Model checkpoint loading - Distributed group initialization - Memory management - Component coordination
Components Managed: - Text encoder (T5) - DiT models (single or multi-stage) - VAE decoder - FlexCache manager - Attention backend
Design Patterns¶
1. Static Singleton Pattern¶
The DiffusionBackend uses static class attributes to maintain global state:
class DiffusionBackend:
model_pool = [] # Shared across all instances
scheduler = None
generator = None
# ...
Rationale: Simplifies distributed coordination and resource sharing.
2. Factory Pattern¶
Model creation uses factory methods:
@staticmethod
def _build_model_architecture(args, attn_backend, rope_impl):
model_type = ModelType(args.type)
model_cls = get_model_class(model_type)
return model_cls(...)
Rationale: Flexible model selection and instantiation.
3. Strategy Pattern¶
Attention backends use strategy pattern:
class DiffusionAttnBackend:
def __init__(self, attn_type):
if attn_type == "flash_attn":
self.impl = FlashAttention()
elif attn_type == "sage":
self.impl = SageAttention()
# ...
Rationale: Easy swapping of attention implementations.
Data Flow¶
Generation Request Flow¶
sequenceDiagram
participant User
participant Task as DiffusionTask
participant Pool as TaskPool
participant Sched as Scheduler
participant Gen as Generator
participant Backend as DiffusionBackend
User->>Task: Create with params
Task->>Pool: Add to pool
loop Until all finished
Sched->>Pool: Get pending tasks
Pool-->>Sched: Return task IDs
Sched->>Gen: Pass selected task
Gen->>Backend: Request text encoding
Backend-->>Gen: Return embeddings
loop Denoising steps
Gen->>Backend: Request DiT forward
Backend-->>Gen: Return denoised latent
end
Gen->>Backend: Request VAE decoding
Backend-->>Gen: Return video frames
Gen->>Task: Update buffer & status
end
Task-->>User: Save video to disk
Memory Management¶
Memory Hierarchy¶
Smart-Diffusion manages memory across multiple levels:
┌─────────────────────────────────────┐
│ GPU VRAM (Fastest) │
│ - Active DiT model │
│ - Activations │
│ - KV cache (if enabled) │
└─────────────────────────────────────┘
↕ (Offload)
┌─────────────────────────────────────┐
│ CPU RAM (Fast) │
│ - Text encoder (low_mem_level≥2) │
│ - Inactive DiT models (≥3) │
│ - VAE (optional) │
└─────────────────────────────────────┘
↕ (Swap)
┌─────────────────────────────────────┐
│ Disk (Slow) │
│ - Model checkpoints │
│ - Output videos │
└─────────────────────────────────────┘
Memory Optimization Strategies¶
- Model Offloading: Move unused models to CPU
- VAE Tiling: Process video in tiles to reduce peak memory
- Gradient Checkpointing: Recompute activations during backward pass
- Mixed Precision: Use FP16/BF16 where possible
Parallelism Strategy¶
Smart-Diffusion supports multiple parallelism dimensions:
1. Context Parallelism (CP)¶
Split sequence dimension across GPUs:
Benefits: - Handle longer sequences - Linear memory scaling
Communication: All-to-all for attention
2. CFG Parallelism¶
Split positive/negative prompts:
Benefits: - 2x speedup for CFG - No extra memory overhead
Communication: All-gather for combining predictions
3. Data Parallelism (Future)¶
Process multiple requests in parallel:
Benefits: - Higher throughput - Better resource utilization
Attention Mechanisms¶
Backend Selection¶
Smart-Diffusion supports multiple attention implementations:
| Backend | Precision | Speed | Memory |
|---|---|---|---|
| FlashAttention | FP16/BF16 | Baseline | Baseline |
| SageAttention | INT8 | To be tested | To be tested |
| SpargeAttention | INT8 + Sparse | To be tested | To be tested |
Attention Flow with CP¶
graph LR
A[Input Sequence] --> B[Split by CP]
B --> C[Local Attention]
C --> D[All-to-All Comm]
D --> E[Merge Results]
E --> F[Output Sequence]
FlexCache System¶
FlexCache enables feature reuse across denoising steps:
Architecture¶
FlexCacheManager
├── Strategy (TeaCache / PAB)
├── Cache Buffer (GPU/CPU)
└── Indexer (which layers to cache)
Cache Decision Flow¶
graph TD
A[Denoising Step] --> B{Check Strategy}
B -->|TeaCache| C{Temporal Distance}
B -->|PAB| D{Pyramid Level}
C -->|Close| E[Reuse Cache]
C -->|Far| F[Recompute]
D -->|High| E
D -->|Low| F
E --> G[Update Cache]
F --> G
Configuration Taxonomy¶
Smart-Diffusion uses a three-level configuration system:
1. Model Parameters (Static)¶
Location: chitu_core/config/models/<model>.yaml
Content: Architecture-specific parameters (layers, heads, hidden size)
Cannot be changed after checkpoint creation.
2. User Parameters (Dynamic)¶
Location: DiffusionUserParams
Content: Per-request parameters (prompt, steps, CFG scale)
Can be changed for each generation request.
3. System Parameters (Semi-static)¶
Location: Launch arguments
Content: Parallelism, operators, memory mode
Cannot be changed after initialization (requires restart).
Extension Points¶
Smart-Diffusion is designed for extensibility:
Adding New Models¶
- Create model class in
chitu_core/models/ - Register in
ModelTypeenum - Add configuration in
config/models/
Adding New Attention Backends¶
- Implement attention interface
- Register in
DiffusionAttnBackend - Add type to configuration
Adding New Cache Strategies¶
- Implement strategy class
- Register in
FlexCacheManager - Add user parameter option
Performance Characteristics¶
Bottleneck Analysis¶
For typical text-to-video workloads:
Attention: ~50-80% of total time
Linear layers: ~10-20%
VAE decoding: ~5-10%
Communication: ~5-10%
Others: <5%
Scaling Behavior¶
Context Parallelism: - Near-linear speedup up to 8 GPUs - Communication overhead increases beyond 8 GPUs
CFG Parallelism: - 2x speedup for 2 GPUs - No benefit beyond 2 GPUs
Memory Scaling: - O(n) with sequence length (n = num_frames) - O(1) with batch size (single request at a time)