Skip to content

Chapter 12: HPC Operations

This chapter covers the practical side of running distributed PyTorch on an HPC cluster, with Derecho-specific details throughout for PBSPro job scripts, launch methods, NCCL tuning, and debugging.

Topics include: - PBS job scripts - Launch methods: torchrun, mpiexec + python, mpiexec + torchrun - NCCL tuning for Slingshot - Debugging distributed jobs

PBS Basics

PBS Pro (Portable Batch System) is Derecho's job scheduler.

You write a shell script with #PBS directives and submit it with qsub.

Minimal GPU Job Script

#!/bin/bash
#PBS -A YOUR_PROJECT                           # allocation code
#PBS -N my_training                            # job name
#PBS -l walltime=01:00:00                      # max runtime
#PBS -l select=1:ncpus=64:ngpus=4:mem=480GB    # 1 node, 4 GPUs
#PBS -q main                                   # GPU queue
#PBS -j oe                                     # merge stdout + stderr

# --- Environment ---
module purge
module load nvhpc cuda cray-mpich conda
conda activate pytorch-derecho

cd $PBS_O_WORKDIR

# --- Launch ---
torchrun --standalone --nproc_per_node=4 train.py

Multi-Node Job

#PBS -l select=2:ncpus=64:ngpus=4:mem=480GB   # 2 nodes = 8 GPUs
and update to use mpiexec instead:

NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
NGPUS=$((NNODES * 4))

mpiexec -n $NGPUS --ppn 4 --cpu-bind none python train.py

Key PBS Directives

Directive Meaning
#PBS -A PROJ Charge hours to this project
#PBS -l select=N Number of nodes
#PBS -l ncpus=64 CPU cores per node (Derecho: always 64)
#PBS -l ngpus=4 GPUs per node (Derecho: always 4)
#PBS -l mem=480GB RAM per node (max ~480 GB usable)
#PBS -l walltime=HH:MM:SS Maximum wall clock time
#PBS -q main GPU queue
#PBS -j oe Combine stdout and stderr into one file

PBS Commands

qsub script.sh           # submit a job
qstat -u $USER           # check your jobs
qdel JOB_ID              # cancel a job
qstat -f JOB_ID          # full job details

Three Launch Methods

1. torchrun (single node)

Best for single-node jobs. Handles rank assignment and master addr/port automatically:

torchrun --standalone --nproc_per_node=4 train.py
  • Sets LOCAL_RANK, RANK, WORLD_SIZE environment variables
  • Your script reads these to initialize the process group

Uses MPI to launch one process per GPU. Your script must detect ranks from MPI environment variables or use mpi4py:

NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
NGPUS=$((NNODES * 4))

mpiexec -n $NGPUS --ppn 4 --cpu-bind none python train.py
  • --ppn 4: 4 processes per node
  • --cpu-bind none: don't pin to specific CPU cores
  • Your script detects OMPI_COMM_WORLD_RANK or uses mpi4py

3. mpiexec + torchrun (multi-node)

Combines MPI for node discovery with torchrun for per-node process management. Best for multi-node jobs:

NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
HEAD_NODE=$(head -1 $PBS_NODEFILE)
HEAD_ADDR=$(hostname -i)

mpiexec -n $NNODES --ppn 1 \
    torchrun \
    --nnodes=$NNODES \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$HEAD_ADDR:29500 \
    train.py
  • MPI launches one torchrun per node
  • torchrun spawns 4 workers per node
  • Rendezvous via c10d on the head node

Comparison

Method Nodes Rank Detection Complexity
torchrun --standalone 1 Automatic Simple
mpiexec + python 1+ Manual (env vars / mpi4py) Medium
mpiexec + torchrun 1+ Error-prone Medium

NCCL Tuning for Slingshot

NCCL (NVIDIA Collective Communications Library) handles GPU-to-GPU communication. On Derecho's Slingshot network, you need specific settings to ensure it uses the correct transport and interfaces.

The default NCCL configuration may not work correctly on Slingshot, leading to hangs or errors. You must set the following environment variables:

export NCCL_SOCKET_IFNAME=hsn          # use Slingshot interfaces
export NCCL_IB_DISABLE=1               # no InfiniBand on Derecho

Without these, NCCL will try to use loopback or non-existent InfiniBand and either hang or crash.

AWS_OFI (OpenFabrics Interfaces) is the underlying transport for Slingshot. You can further optimize NCCL with OFI-specific settings:

### Recommended


export NCCL_SHM_DISABLE=1              # disable shared memory (stability)
export NCCL_CROSS_NIC=1                # enable cross-NIC communication
export CUDA_VISIBLE_DEVICES=0,1,2,3    # expose all 4 GPUs

Libfabric / CXI (Slingshot transport)

export FI_CXI_DISABLE_HOST_REGISTER=1  # prevent CUDA deadlocks
export FI_CXI_DEFAULT_CQ_SIZE=131072   # larger completion queue for big jobs
export FI_MR_CACHE_MONITOR=userfaultfd # memory registration cache

All together

Every PBS script in scripts/ includes these. Here's the full block:

# NCCL settings for Derecho
export NCCL_SOCKET_IFNAME=hsn
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_CROSS_NIC=1
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Libfabric / CXI for Slingshot
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_MR_CACHE_MONITOR=userfaultfd

For the full NCCL reference including OFI transport and advanced tuning, see nccl_tuning.md.

Debugging Distributed Jobs

Step 1: Check the basics

Most distributed failures come from a few common issues:

Symptom Likely Cause Fix
Hangs at init Wrong network interface Set NCCL_SOCKET_IFNAME=hsn
Hangs at init Firewall / port blocked Try different MASTER_PORT
NCCL error: unhandled system error Missing NCCL config Add all NCCL exports
CUDA error: invalid device ordinal Wrong GPU count Check CUDA_VISIBLE_DEVICES
RuntimeError: address already in use Port conflict Change port or wait for old job to end
Incorrect results Ranks out of sync Check set_epoch() on sampler
OOM on rank 0 only Data download on rank 0 Use barrier after download

Step 2: Enable debug logging

# NCCL debug output (shows connection setup)
export NCCL_DEBUG=INFO

# Very verbose (for deep debugging)
export NCCL_DEBUG=TRACE

# PyTorch distributed debug (shows collective operations)
export TORCH_DISTRIBUTED_DEBUG=DETAIL

Step 3: Test communication in isolation

Before debugging your training script, verify that GPUs can talk at all:

# All-reduce test
mpiexec -n 4 --ppn 4 --cpu-bind none python tests/all_reduce_test.py

# Point-to-point test
mpiexec -n 4 --ppn 4 --cpu-bind none python tests/send_recv_test.py

# Full benchmark
mpiexec -n 4 --ppn 4 --cpu-bind none python tests/torch_comm_bench.py

If these fail, the problem is in the environment, not your code.

Step 4: Check environment

# Verify GPU visibility
python -c "import torch; print(torch.cuda.device_count(), 'GPUs')"

# Verify NCCL version
python -c "import torch; print('NCCL:', torch.cuda.nccl.version())"

# Run the environment check script
python tests/check_environment.py

Common Debugging Patterns

Hang during training (not init): Usually a rank mismatch — one rank takes a different code path (e.g., different batch count due to uneven data). Ensure all ranks execute the same number of collective operations.

OOM during training: - Check batch size (effective = per_gpu × world_size) - Try mixed precision (--use-amp) - Switch from DDP to FSDP - Reduce micro-batch size for PP

Slow performance: - Check NCCL transport: NCCL_DEBUG=INFO should show ofi or socket - Verify GPU utilization: nvidia-smi during training - Profile with torch.profiler (see utils/profiling.py)

Putting It All Together

Here's a complete multi-node PBS template that incorporates everything:

#!/bin/bash
#PBS -A YOUR_PROJECT
#PBS -N distributed_training
#PBS -l walltime=02:00:00
#PBS -l select=2:ncpus=64:ngpus=4:mem=480GB
#PBS -q main
#PBS -j oe

# --- Environment ---
module purge
module load nvhpc cuda cray-mpich conda
conda activate pytorch-derecho

# --- NCCL ---
export NCCL_SOCKET_IFNAME=hsn
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=1
export NCCL_CROSS_NIC=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
export FI_CXI_DISABLE_HOST_REGISTER=1
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_MR_CACHE_MONITOR=userfaultfd

# --- Launch ---
cd $PBS_O_WORKDIR

NNODES=$(cat $PBS_NODEFILE | sort -u | wc -l)
NGPUS=$((NNODES * 4))

mpiexec -n $NGPUS --ppn 4 --cpu-bind none \
    python your_training_script.py \
    --batch-size 64 \
    --epochs 10

Copy any .sh file from scripts/ as a starting point — they all follow this pattern.


Deep-dive references: - derecho_guide.md — full hardware specs, PBS reference, launch patterns - nccl_tuning.md — OFI transport, advanced NCCL settings - troubleshooting.md — comprehensive error catalog with solutions


This concludes the guide. You now have the conceptual foundations and practical tools to run distributed PyTorch training — from a single GPU to hundreds. Return to the table of contents to review any chapter, or explore the scripts in each strategy directory for working code.