Chapter 11: Choosing a Strategy¶

You now know seven distributed training strategies. This chapter helps you pick the right one — or the right combination — for your workload.

Decision Flowchart¶

                     Does your model fit on 1 GPU?
                     ┌───────────┴───────────┐
                    YES                      NO
                     │                        │
               ┌─────┴──────┐          What doesn't fit?
               │  Use DDP   │          ┌──────┬──────────┐
               │  (Ch 4)    │        Params  Activations  Input
               │            │          │         │          │
               │  Fastest,  │          │         │     Is it spatial
               │  simplest  │          │         │     (grid/image)?
               └────────────┘          │         │     ┌────┴────┐
                                       │         │    YES       NO
                                       │         │     │         │
                                  Is it deep    Use   Domain    Use SP
                                  (many layers)? FSDP  Parallel  (Ch 8)
                                  ┌────┴────┐  (Ch 5)  (Ch 10)
                                 YES       NO
                                  │         │
                               Use PP    Use TP
                               (Ch 7)   (Ch 6)
                                  │         │
                            ┌─────┴─────────┴──────┐
                            │  Still doesn't fit?   │
                            │  Combine strategies:  │
                            │                       │
                            │  TP + FSDP (Ch 9)     │
                            │  TP + PP + FSDP       │
                            │  TP + SP (Ch 8)       │
                            └───────────────────────┘

Quick Reference: "I Want To..."¶

Goal	Strategy	Chapter
Train faster on more GPUs	DDP	4
Fit a model that's 1-3× too large	FSDP	5
Fit a model with huge linear layers	TP	6
Fit a model with 100+ layers	PP	7
Handle very long sequences	SP	8
Train a 7B+ parameter LLM	TP + FSDP	9
Process high-resolution spatial data	Domain Parallel	10

Concrete Scenarios¶

Scenario 1: Fine-tuning a vision model¶

Model: ResNet-50 (25M params, ~400 MB with optimizer) Data: ImageNet (1.2M images) Goal: Train faster

Recommendation: DDP on 4-16 GPUs. The model easily fits on one GPU. DDP gives near-linear scaling. Start with 1 node (4 GPUs), scale to 4 nodes if you need more speed.

Scenario 2: Pre-training a 7B LLM from scratch¶

Model: LLaMA-7B architecture (7B params, ~112 GB with Adam FP32) Data: Large text corpus Goal: Train the model

Recommendation: TP=4 + FSDP=N. Use TP within each node (4 GPUs) to split the large attention and FFN layers. Use FSDP across nodes to shard the optimizer state. With BF16 mixed precision and 4 nodes (16 GPUs), this fits comfortably.

Scenario 3: Global weather prediction at 0.25 degrees¶

Model: U-Net variant (50M params, fits on 1 GPU) Data: 1440×720 grid, 100+ channels Goal: Fit the forward pass in memory

Recommendation: Domain Parallel (4 GPUs). Split the grid into 4 tiles, use halo exchange for convolution boundaries. The model is small but the input is huge — domain parallel directly addresses this. Add FSDP if the model grows.

Scenario 4: Long-context document understanding¶

Model: Transformer (1B params) Data: 128K token documents Goal: Handle long sequences without running out of memory

Recommendation: SP + TP. Use Megatron-SP or Ulysses to split the sequence dimension. Combine with TP to split the attention heads. With 4 GPUs, each GPU handles 32K tokens — much more manageable.

Scenario 5: Very deep diffusion model¶

Model: 200-layer diffusion model (each layer small, total ~5B params) Data: High-resolution images Goal: Train the deep model

Recommendation: PP + FSDP. Pipeline parallelism splits the 200 layers across stages (e.g., 50 layers per GPU with 4 stages). FSDP shards the optimizer across additional GPUs. Use 16+ micro-batches to keep pipeline bubbles small.

Derecho Configuration Reference¶

NCAR Derecho: 82 nodes × 4 A100 (40 GB HBM2), NVLink (600 GB/s), Slingshot 11

 GPUs    Nodes    Strategy              Config
─────────────────────────────────────────────────────
  4       1       DDP                   Start here for small models
  4       1       FSDP                  Model 40-160 GB
  4       1       TP (degree=4)         Individual layers > 40 GB
  8       2       DDP                   More data throughput
  8       2       TP=4 + FSDP=2         Model 100-300 GB
 16       4       TP=4 + FSDP=4         7B-13B models
 32       8       TP=4 + FSDP=8         13B-30B models
 80      20       TP=4 + FSDP=20        70B models
328      82       TP=4 + FSDP=82        Full cluster
─────────────────────────────────────────────────────

Key constraint: TP degree ≤ 4 (one node, 4 GPUs with NVLink).
Always keep TP within a single node.

Common Mistakes¶

Over-engineering for small models¶

If your model fits on one GPU, use DDP. Don't add FSDP "just in case" — it adds communication overhead with no memory benefit.

TP across nodes¶

On Derecho, each node has 4 GPUs connected via NVLink (600 GB/s). Keep TP=4 (one node) to stay on NVLink, and use FSDP for cross-node sharding over Slingshot.

Too few micro-batches with PP¶

Pipeline bubble = (stages - 1) / micro-batches. With 4 stages and 4 micro-batches, 75% of time is bubble. Use at least 4× stages.

Ignoring activation memory¶

FSDP shards parameters and optimizer state but not activations. For very long sequences or high-resolution spatial data, you still need SP or domain parallelism for activation memory.

What's Next?¶

You know what to use. Chapter 12 covers the how — PBS job scripts, NCCL tuning, and debugging on Derecho.

Next: Chapter 12 — HPC Operations

See also: Strategy Decision Guide — the quick-reference version of this chapter.