InfiniBand setup and benchmarking

Goal

By the end of this tutorial you will have:

Built a virtual cluster with InfiniBand
Verified IB link status and measured bandwidth
Run a multi-node NCCL communication test

Step 1: Build the virtual cluster

Under Compute > Virtual Clusters, click Create Virtual Cluster to create a cluster. Pick InfiniBand as the fabric.
Under Compute > Virtual Machines, click Create VM to create the member VMs. Pick an image that includes OFED (InfiniBand drivers) and the same GPU instance type as the cluster. The cluster needs at least two VMs to start.
Attach the VMs you created to the cluster. VMs can only be attached in the Stopped state, so either don't start them after creation, or stop them first if you already did.
Click Start cluster on the virtual cluster's detail page.
Start each VM individually.

Step 2: Verify InfiniBand

After SSHing into a VM:

# HCA info
ibstat
ibv_devinfo

# IB port state (should be Active)
ibstat | grep -A5 "Port 1"

# List connected nodes
ibhosts

Step 3: Bandwidth benchmark

Run on node A (server) and node B (client):

# Node A (server)
ib_write_bw -d mlx5_0

# Node B (client): use node A's IB IP
ib_write_bw -d mlx5_0 <NODE_A_IB_IP>

A healthy run should show bandwidth close to the cluster's IB fabric spec. If the measurement is significantly below spec, check cables, drivers, and cluster configuration.

Step 4: NCCL All-Reduce test

# Install nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi

# 2-node All-Reduce test
mpirun -np 2 \
  -H <NODE_A_IP>:1,<NODE_B_IP>:1 \
  -x NCCL_IB_DISABLE=0 \
  -x NCCL_DEBUG=INFO \
  ./build/all_reduce_perf -b 512M -e 4G -f 2 -g 1

Step 5: torchrun distributed training test

# Node A (master)
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=0 \
  --master_addr=<NODE_A_IP> \
  --master_port=29500 \
  train_distributed.py

# Node B (worker)
torchrun \
  --nproc_per_node=8 \
  --nnodes=2 \
  --node_rank=1 \
  --master_addr=<NODE_A_IP> \
  --master_port=29500 \
  train_distributed.py

Troubleshooting

Symptom	Cause	Fix
`ibstat` not found	OFED not in the image	Recreate the VM with an OFED-enabled image
IB Port State is Down	Cluster not started	Start the virtual cluster and restart the VMs
NCCL timeouts	Firewall blocking traffic	Allow all intra-cluster traffic

Next steps

Virtual cluster: build an InfiniBand-based HPC environment
Parallel file system: shared storage for multi-node training

Goal​

Step 1: Build the virtual cluster​

Step 2: Verify InfiniBand​

Step 3: Bandwidth benchmark​

Step 4: NCCL All-Reduce test​

Step 5: torchrun distributed training test​

Troubleshooting​

Next steps​