Chapter 12: HPC Operations¶
This chapter covers the practical side of running distributed PyTorch on an HPC cluster, with Derecho-specific details throughout for PBSPro job scripts, launch methods, NCCL tuning, and debugging.
Topics include:
- PBS job scripts
- Launch methods: torchrun, mpiexec + python, mpiexec + torchrun
- NCCL tuning for Slingshot
- Debugging distributed jobs
PBS Basics¶
PBS Pro (Portable Batch System) is Derecho's job scheduler.
You write a shell script with #PBS directives and submit it with qsub.
Minimal GPU Job Script¶
#!/bin/bash
#PBS -A YOUR_PROJECT # allocation code
#PBS -N my_training # job name
#PBS -l walltime=01:00:00 # max runtime
#PBS -l select=1:ncpus=64:ngpus=4:mem=480GB # 1 node, 4 GPUs
#PBS -q main # GPU queue
#PBS -j oe # merge stdout + stderr
# --- Environment ---
module purge
module load nvhpc cuda cray-mpich conda
conda activate pytorch-derecho
cd $PBS_O_WORKDIR
# --- Launch ---
torchrun --standalone --nproc_per_node=4 train.py
Multi-Node Job¶
and update to usempiexec instead:
NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
NGPUS=$((NNODES * 4))
mpiexec -n $NGPUS --ppn 4 --cpu-bind none python train.py
Key PBS Directives¶
| Directive | Meaning |
|---|---|
#PBS -A PROJ |
Charge hours to this project |
#PBS -l select=N |
Number of nodes |
#PBS -l ncpus=64 |
CPU cores per node (Derecho: always 64) |
#PBS -l ngpus=4 |
GPUs per node (Derecho: always 4) |
#PBS -l mem=480GB |
RAM per node (max ~480 GB usable) |
#PBS -l walltime=HH:MM:SS |
Maximum wall clock time |
#PBS -q main |
GPU queue |
#PBS -j oe |
Combine stdout and stderr into one file |
PBS Commands¶
qsub script.sh # submit a job
qstat -u $USER # check your jobs
qdel JOB_ID # cancel a job
qstat -f JOB_ID # full job details
Three Launch Methods¶
1. torchrun (single node)¶
Best for single-node jobs. Handles rank assignment and master addr/port automatically:
- Sets
LOCAL_RANK,RANK,WORLD_SIZEenvironment variables - Your script reads these to initialize the process group
2. mpiexec + Python (multi-node -- recommended).¶
Uses MPI to launch one process per GPU. Your script must detect ranks
from MPI environment variables or use mpi4py:
NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
NGPUS=$((NNODES * 4))
mpiexec -n $NGPUS --ppn 4 --cpu-bind none python train.py
--ppn 4: 4 processes per node--cpu-bind none: don't pin to specific CPU cores- Your script detects
OMPI_COMM_WORLD_RANKor usesmpi4py
3. mpiexec + torchrun (multi-node)¶
Combines MPI for node discovery with torchrun for per-node process management. Best for multi-node jobs:
NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
HEAD_NODE=$(head -1 $PBS_NODEFILE)
HEAD_ADDR=$(hostname -i)
mpiexec -n $NNODES --ppn 1 \
torchrun \
--nnodes=$NNODES \
--nproc_per_node=4 \
--rdzv_backend=c10d \
--rdzv_endpoint=$HEAD_ADDR:29500 \
train.py
- MPI launches one torchrun per node
- torchrun spawns 4 workers per node
- Rendezvous via c10d on the head node
Comparison¶
| Method | Nodes | Rank Detection | Complexity |
|---|---|---|---|
torchrun --standalone |
1 | Automatic | Simple |
mpiexec + python |
1+ | Manual (env vars / mpi4py) | Medium |
mpiexec + torchrun |
1+ | Error-prone | Medium |
NCCL Tuning for Slingshot¶
NCCL (NVIDIA Collective Communications Library) handles GPU-to-GPU communication. On Derecho's Slingshot network, you need specific settings to ensure it uses the correct transport and interfaces.
The default NCCL configuration may not work correctly on Slingshot, leading to hangs or errors. You must set the following environment variables:
export NCCL_SOCKET_IFNAME=hsn # use Slingshot interfaces
export NCCL_IB_DISABLE=1 # no InfiniBand on Derecho
Without these, NCCL will try to use loopback or non-existent InfiniBand and either hang or crash.
AWS_OFI (OpenFabrics Interfaces) is the underlying transport for Slingshot. You can further optimize NCCL with OFI-specific settings:
### Recommended
export NCCL_SHM_DISABLE=1 # disable shared memory (stability)
export NCCL_CROSS_NIC=1 # enable cross-NIC communication
export CUDA_VISIBLE_DEVICES=0,1,2,3 # expose all 4 GPUs
Libfabric / CXI (Slingshot transport)¶
export FI_CXI_DISABLE_HOST_REGISTER=1 # prevent CUDA deadlocks
export FI_CXI_DEFAULT_CQ_SIZE=131072 # larger completion queue for big jobs
export FI_MR_CACHE_MONITOR=userfaultfd # memory registration cache
All together¶
Every PBS script in scripts/ includes these. Here's the full block:
# NCCL settings for Derecho
export NCCL_SOCKET_IFNAME=hsn
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_CROSS_NIC=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Libfabric / CXI for Slingshot
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_MR_CACHE_MONITOR=userfaultfd
For the full NCCL reference including OFI transport and advanced tuning,
see nccl_tuning.md.
Debugging Distributed Jobs¶
Step 1: Check the basics¶
Most distributed failures come from a few common issues:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Hangs at init | Wrong network interface | Set NCCL_SOCKET_IFNAME=hsn |
| Hangs at init | Firewall / port blocked | Try different MASTER_PORT |
NCCL error: unhandled system error |
Missing NCCL config | Add all NCCL exports |
CUDA error: invalid device ordinal |
Wrong GPU count | Check CUDA_VISIBLE_DEVICES |
RuntimeError: address already in use |
Port conflict | Change port or wait for old job to end |
| Incorrect results | Ranks out of sync | Check set_epoch() on sampler |
| OOM on rank 0 only | Data download on rank 0 | Use barrier after download |
Step 2: Enable debug logging¶
# NCCL debug output (shows connection setup)
export NCCL_DEBUG=INFO
# Very verbose (for deep debugging)
export NCCL_DEBUG=TRACE
# PyTorch distributed debug (shows collective operations)
export TORCH_DISTRIBUTED_DEBUG=DETAIL
Step 3: Test communication in isolation¶
Before debugging your training script, verify that GPUs can talk at all:
# All-reduce test
mpiexec -n 4 --ppn 4 --cpu-bind none python tests/all_reduce_test.py
# Point-to-point test
mpiexec -n 4 --ppn 4 --cpu-bind none python tests/send_recv_test.py
# Full benchmark
mpiexec -n 4 --ppn 4 --cpu-bind none python tests/torch_comm_bench.py
If these fail, the problem is in the environment, not your code.
Step 4: Check environment¶
# Verify GPU visibility
python -c "import torch; print(torch.cuda.device_count(), 'GPUs')"
# Verify NCCL version
python -c "import torch; print('NCCL:', torch.cuda.nccl.version())"
# Run the environment check script
python tests/check_environment.py
Common Debugging Patterns¶
Hang during training (not init): Usually a rank mismatch — one rank takes a different code path (e.g., different batch count due to uneven data). Ensure all ranks execute the same number of collective operations.
OOM during training:
- Check batch size (effective = per_gpu × world_size)
- Try mixed precision (--use-amp)
- Switch from DDP to FSDP
- Reduce micro-batch size for PP
Slow performance:
- Check NCCL transport: NCCL_DEBUG=INFO should show ofi or socket
- Verify GPU utilization: nvidia-smi during training
- Profile with torch.profiler (see utils/profiling.py)
Putting It All Together¶
Here's a complete multi-node PBS template that incorporates everything:
#!/bin/bash
#PBS -A YOUR_PROJECT
#PBS -N distributed_training
#PBS -l walltime=02:00:00
#PBS -l select=2:ncpus=64:ngpus=4:mem=480GB
#PBS -q main
#PBS -j oe
# --- Environment ---
module purge
module load nvhpc cuda cray-mpich conda
conda activate pytorch-derecho
# --- NCCL ---
export NCCL_SOCKET_IFNAME=hsn
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_CROSS_NIC=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_MR_CACHE_MONITOR=userfaultfd
# --- Launch ---
cd $PBS_O_WORKDIR
NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
NGPUS=$((NNODES * 4))
mpiexec -n $NGPUS --ppn 4 --cpu-bind none \
python your_training_script.py \
--batch-size 64 \
--epochs 10
Copy any .sh file from scripts/ as a starting point — they all
follow this pattern.
Deep-dive references:
- derecho_guide.md — full hardware specs, PBS reference, launch patterns
- nccl_tuning.md — OFI transport, advanced NCCL settings
- troubleshooting.md — comprehensive error catalog with solutions
This concludes the guide. You now have the conceptual foundations and practical tools to run distributed PyTorch training — from a single GPU to hundreds. Return to the table of contents to review any chapter, or explore the scripts in each strategy directory for working code.