Troubleshooting Distributed PyTorch on Derecho¶
Common errors and their solutions.
NCCL Errors¶
NCCL Timeout / Hang at Startup¶
Error:
or the job hangs indefinitely atdist.init_process_group().
Cause: NCCL cannot establish communication between GPUs. Usually a network interface issue on Derecho.
Solution:
# Always set on Derecho — this is the #1 fix
export NCCL_SOCKET_IFNAME=hsn
export NCCL_IB_DISABLE=1
# If still hanging, increase timeout
export NCCL_TIMEOUT=600000 # 10 minutes (default is 5 min)
# Debug: see what NCCL is doing
export NCCL_DEBUG=INFO
NCCL Error: unhandled system error¶
Error:
Cause: Slingshot network issue or misconfigured NCCL.
Solution:
export NCCL_SHM_DISABLE=1
export NCCL_CROSS_NIC=1
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_CXI_OPTIMIZED_MRS=false
CUDA Errors¶
CUDA Out of Memory (OOM)¶
Error:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate X MiB (GPU 0; 40.00 GiB total capacity; ...)
Cause: Model + activations + optimizer states exceed GPU memory (40 GB on A100).
Solutions: 1. Reduce batch size — simplest fix 2. Use FSDP — shards model across GPUs 3. Enable mixed precision — halves activation memory
model = FSDP(model, mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
))
for micro_step in range(accum_steps):
loss = model(data) / accum_steps
loss.backward()
optimizer.step()
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
apply_activation_checkpointing,
)
apply_activation_checkpointing(model)
CUDA Error: device-side assert triggered¶
Error:
Cause: Usually an out-of-bounds index (e.g., label >= num_classes, token ID >= vocab_size).
Solution:
Check that: - Labels are in range[0, num_classes)
- Token IDs are in range [0, vocab_size)
- Tensor dimensions match what the model expects
Distributed / Rank Errors¶
Rank Mismatch / Wrong World Size¶
Error:
Cause: Mismatch between the number of processes launched and what the script expects.
Solution: Check your launch command:
# This launches 4 processes (single node):
mpiexec -n 4 --ppn 4 --cpu-bind none python train.py
torchrun --standalone --nproc_per_node=4 train.py
# This launches 8 processes (2 nodes x 4 GPUs):
mpiexec -n 8 --ppn 4 --cpu-bind none python train.py
Verify in the script:
"Can't find environment variables for local rank"¶
Error:
Cause: Script uses only one method for rank detection but was launched with a different launcher.
Solution: Use utils/distributed.py which handles all launchers:
Address Already in Use¶
Error:
Cause: A previous job is still using the MASTER_PORT.
Solution:
PBS / Job Scheduling Errors¶
Job Stuck in Queue¶
Cause: Not enough GPU nodes available.
Solution:
# Check queue status
qstat -Q main
# Check your allocation
sacctmgr show associations user=$USER
# Try a shorter walltime (easier to schedule)
#PBS -l walltime=00:30:00
"module: command not found" in PBS Script¶
Cause: PBS doesn't source your shell profile by default.
Solution: Add at the top of your PBS script:
source /etc/profile.d/modules.sh # or wherever modules.sh lives
module purge
module load nvhpc cuda cray-mpich conda
scripts/ already includes these module loads — copy
any one as a starting template.
Wrong Conda Environment¶
Cause: PBS jobs don't inherit your shell's conda activation.
Solution:
module load conda
conda activate ${CONDA_ENV:-pytorch-derecho}
# Verify:
which python
python -c "import torch; print(torch.__version__)"
Performance Issues¶
Slow Training (Low GPU Utilization)¶
Diagnosis:
# Add to your training loop:
print(f"GPU util: {torch.cuda.utilization()}%")
print(f"Memory: {torch.cuda.memory_allocated()/1e9:.1f} GB / "
f"{torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
Common causes:
1. Data loading bottleneck — increase num_workers in DataLoader
2. Small batch size — GPU is underutilized
3. Excessive communication — reduce TP degree or use hybrid parallelism
4. CPU-GPU transfer — use pin_memory=True in DataLoader
Communication Bottleneck¶
Diagnosis:
# Profile communication vs compute
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
train_step()
print(prof.key_averages().table(sort_by="cuda_time_total"))
Solutions: 1. Use TP within nodes, FSDP across nodes (minimize cross-node communication) 2. Increase batch size to amortize communication overhead 3. Use gradient accumulation 4. Enable NCCL OFI plugin for Slingshot