InfiniBand setup and benchmarking
Goal
By the end of this tutorial you will have:
- Built a virtual cluster with InfiniBand
- Verified IB link status and measured bandwidth
- Run a multi-node NCCL communication test
Step 1: Build the virtual cluster
- Under Compute > Virtual Clusters, click Create Virtual Cluster to create a cluster. Pick InfiniBand as the fabric.
- Under Compute > Virtual Machines, click Create VM to create the member VMs. Pick an image that includes OFED (InfiniBand drivers) and the same GPU instance type as the cluster. The cluster needs at least two VMs to start.
- Attach the VMs you created to the cluster. VMs can only be attached in the Stopped state, so either don't start them after creation, or stop them first if you already did.
- Click Start cluster on the virtual cluster's detail page.
- Start each VM individually.
Step 2: Verify InfiniBand
After SSHing into a VM:
# HCA info
ibstat
ibv_devinfo
# IB port state (should be Active)
ibstat | grep -A5 "Port 1"
# List connected nodes
ibhosts
Step 3: Bandwidth benchmark
Run on node A (server) and node B (client):
# Node A (server)
ib_write_bw -d mlx5_0
# Node B (client): use node A's IB IP
ib_write_bw -d mlx5_0 <NODE_A_IB_IP>
A healthy run should show bandwidth close to the cluster's IB fabric spec. If the measurement is significantly below spec, check cables, drivers, and cluster configuration.
Step 4: NCCL All-Reduce test
# Install nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
# 2-node All-Reduce test
mpirun -np 2 \
-H <NODE_A_IP>:1,<NODE_B_IP>:1 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_DEBUG=INFO \
./build/all_reduce_perf -b 512M -e 4G -f 2 -g 1
Step 5: torchrun distributed training test
# Node A (master)
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=0 \
--master_addr=<NODE_A_IP> \
--master_port=29500 \
train_distributed.py
# Node B (worker)
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=1 \
--master_addr=<NODE_A_IP> \
--master_port=29500 \
train_distributed.py
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
ibstat not found | OFED not in the image | Recreate the VM with an OFED-enabled image |
| IB Port State is Down | Cluster not started | Start the virtual cluster and restart the VMs |
| NCCL timeouts | Firewall blocking traffic | Allow all intra-cluster traffic |
Next steps
- Virtual cluster: build an InfiniBand-based HPC environment
- Parallel file system: shared storage for multi-node training