Skip to main content

InfiniBand setup and benchmarking

Goal

By the end of this tutorial you will have:

  • Built a virtual cluster with InfiniBand
  • Verified IB link status and measured bandwidth
  • Run a multi-node NCCL communication test

Step 1: Build the virtual cluster

  1. Under Compute > Virtual Clusters, click Create Virtual Cluster to create a cluster. Pick InfiniBand as the fabric.
  2. Under Compute > Virtual Machines, click Create VM to create the member VMs. Pick an image that includes OFED (InfiniBand drivers) and the same GPU instance type as the cluster. The cluster needs at least two VMs to start.
  3. Attach the VMs you created to the cluster. VMs can only be attached in the Stopped state, so either don't start them after creation, or stop them first if you already did.
  4. Click Start cluster on the virtual cluster's detail page.
  5. Start each VM individually.

Step 2: Verify InfiniBand

After SSHing into a VM:

# HCA info
ibstat
ibv_devinfo

# IB port state (should be Active)
ibstat | grep -A5 "Port 1"

# List connected nodes
ibhosts

Step 3: Bandwidth benchmark

Run on node A (server) and node B (client):

# Node A (server)
ib_write_bw -d mlx5_0

# Node B (client): use node A's IB IP
ib_write_bw -d mlx5_0 <NODE_A_IB_IP>

A healthy run should show bandwidth close to the cluster's IB fabric spec. If the measurement is significantly below spec, check cables, drivers, and cluster configuration.


Step 4: NCCL All-Reduce test

# Install nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi

# 2-node All-Reduce test
mpirun -np 2 \
-H <NODE_A_IP>:1,<NODE_B_IP>:1 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_DEBUG=INFO \
./build/all_reduce_perf -b 512M -e 4G -f 2 -g 1

Step 5: torchrun distributed training test

# Node A (master)
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=0 \
--master_addr=<NODE_A_IP> \
--master_port=29500 \
train_distributed.py

# Node B (worker)
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--node_rank=1 \
--master_addr=<NODE_A_IP> \
--master_port=29500 \
train_distributed.py

Troubleshooting

SymptomCauseFix
ibstat not foundOFED not in the imageRecreate the VM with an OFED-enabled image
IB Port State is DownCluster not startedStart the virtual cluster and restart the VMs
NCCL timeoutsFirewall blocking trafficAllow all intra-cluster traffic

Next steps