Parallelism Strategy Decision Guide¶
Use this guide to choose the right distributed training strategy for your workload on Derecho.
Decision Flowchart¶
Single GPU Training
│
┌─────────────────────┴─────────────────────┐
▼ ▼
Wall 1: Fits in memory, Doesn't fit in memory
but too slow? │
│ │
▼ ▼
Data Parallelism (DDP) What is too large?
(Replicate Model, │
Split Batch) ┌─────────────────────┴─────────────────────┐
│ ▼ ▼
│ Wall 3: Model Weights Wall 2: Spatial/Input
│ & Optimizer State Dimensions (Activations)
│ │ │
│ ┌───────────┼───────────┐ ┌───────────┴───────────┐
│ ▼ ▼ ▼ ▼ ▼
│ FSDP TP PP Domain Parallel Sequence Parallel
│ (Shard All) (Split Wts) (Split Depth) (Halo Exchange) (Ulysses/Ring)
│ │ │ │ │ │
└─────────┴───────────┴───────────┴───────────────────┴───────────────────────┘
│
▼
Hybrid Parallelism
(e.g., FSDP + TP + Sequence Parallel simultaneously)
Strategy Comparison Table¶
| Strategy | What's Split | Communication | Memory Savings | Best For |
|---|---|---|---|---|
| DDP | Data (batches) | Gradient all-reduce | None (full replica) | Most workloads |
| FSDP | Params + grads + optimizer | All-gather + reduce-scatter | High (params) | Large models |
| TP | Weight matrices | All-reduce on activations | Medium (weights) | Wide layers |
| PP | Model layers | Send/recv between stages | High (layers) | Very deep models |
| SP | Sequence dimension | All-gather + reduce-scatter | Medium (activations) | Long sequences |
| Domain | Spatial data | Halo exchange (P2P) | High (spatial) | High-res spatial data |
Quick Reference by Use Case¶
"I want to train faster on more GPUs"¶
→ DDP (01_data_parallel_ddp/)
Your model fits on 1 GPU, you just want to process more data per step. Each GPU gets a different batch, gradients are averaged. Linear speedup.
"My model doesn't fit on 1 GPU"¶
→ FSDP (02_fully_sharded_fsdp/)
FSDP shards parameters, gradients, and optimizer states across GPUs. A 10B parameter model that needs 80 GB can train on 4× 40 GB A100s.
"I have very large linear layers (e.g., LLM)"¶
→ TP (03_tensor_parallel_tp/)
Split large weight matrices across GPUs. Good for transformers where individual layers (attention, FFN) are too large.
"My model has 100+ layers"¶
→ PP (04_pipeline_parallel_pp/)
Assign different layers to different GPUs. Use micro-batching to keep GPUs busy. Often combined with TP for very large models.
"Long sequences are blowing up my memory"¶
→ SP (05_sequence_parallel_sp/)
Split the sequence dimension for LayerNorm/Dropout. Saves activation memory. Usually combined with TP (SP is an extension of TP).
"My training already uses TP + FSDP"¶
→ Hybrid (06_hybrid_parallelism/)
TP within nodes (fast PCIe), FSDP across nodes. The standard approach for training large language models at scale.
"My input images/grids are too large for 1 GPU --> Most SciML workflows"¶
→ Domain Parallel (07_domain_parallel_shardtensor/)
Split spatial dimensions across GPUs. Critical for weather/climate models and CFD workflows with high-resolution grids.
Derecho-Specific Recommendations¶
Derecho: 82 nodes × 4 A100 (40 GB) — PCIe (no NVLink)
Recommended configurations:
─────────────────────────────────────────────────────
GPUs Nodes Strategy Notes
─────────────────────────────────────────────────────
4 1 DDP Start here
4 1 FSDP If model > 40 GB
4 1 TP (tp=4) If layers > 40 GB
8 2 DDP Scale data parallel
8 2 TP=4 + FSDP=2 TP within node, FSDP across
16 4 TP=4 + FSDP=4 Standard LLM training
32 8 TP=4 + FSDP=8 Large-scale training
─────────────────────────────────────────────────────