Virtual cluster
Overview
A virtual cluster is an HPC environment in which multiple VMs are connected over a dedicated high-speed network fabric. It is used to relieve GPU-to-GPU communication bottlenecks in large-scale distributed training.
The flow is: create the cluster first, then attach VMs you've created separately to it.
Resource.VirtualCluster.CREATEpermission (attaching a VM also requiresResource.VirtualMachine.UPDATE)- VMs to be attached must be in the same zone and have the same instance type as the cluster
- A cluster needs at least two VMs to start
Fabric type
A virtual cluster supports two network fabrics.
| Fabric | Description |
|---|---|
| InfiniBand | A dedicated high-speed network on the InfiniBand protocol. For workloads that need the lowest latency and the highest bandwidth |
| Ethernet (RoCEv2) | A dedicated high-speed network using RoCEv2. Prioritizes compatibility with Ethernet-based environments |
Which one to use depends on workload requirements and which instance types are available.
Step 1: Create the cluster
- Go to Compute > Virtual Clusters.
- Click the Create Virtual Cluster button in the top right.
- Configure the following:
| Field | Description |
|---|---|
| Cluster name | An identifying name |
| Zone | The zone the cluster will run in |
| Instance type | The instance type for cluster members |
| Fabric type | InfiniBand or Ethernet (RoCEv2) |
- Click Create.
The newly created cluster starts out empty. Member VMs are created separately and then attached.
Step 2: Attach VMs
VMs added to the cluster must be created with the same zone and same instance type.
- In Compute > Virtual Machines, create a VM to use as a member, or open an existing one.
- In the VM edit screen, under Cluster, set Action to Attach to cluster.
- Pick the target cluster and save.
To detach, choose Detach from cluster in the same screen.
A VM can only be attached when it is in the Stopped state (the portal shows: "Only stopped VMs with the same instance type can be attached"). Spot and disaster recovery (DR) VMs cannot be attached to a cluster.
Step 3: Start the cluster
Once all member VMs are attached, start the cluster.
A cluster with fewer than two attached members cannot be started; the portal will show "At least 2 VMs are required".
- Pick the target cluster in Compute > Virtual Clusters.
- On the detail page, click Start cluster.
- Once the status changes to Running, start the individual member VMs.
To stop, click Stop cluster. All running VMs must be shut down first.
Cluster states
| State | Meaning |
|---|---|
| Preparing | Provisioning |
| Unallocated | The cluster is created but no resources are allocated |
| Allocated / Pending | Resources are allocated but not yet running |
| Running | The cluster is operating |
| Stopped / Deleted | Stopped or deleted |
Cluster detail page
| Tab | Contents |
|---|---|
| Info | General info, fabric type, instance type, zone, creation date |
| Attached VMs | List of member VMs and the state of each |
| Metrics | Cluster-level GPU, CPU, network, and memory usage |
Using InfiniBand
After the cluster starts, member VMs are automatically connected over the fabric. MPI-based distributed training uses this for GPU-to-GPU communication.
# Check InfiniBand devices
ibstat
# Multi-node PyTorch distributed training (torchrun) example
torchrun --nproc_per_node=8 --nnodes=4 \
--node_rank=0 \
--master_addr=<master-VM-IP> \
--master_port=29500 \
train.py
See InfiniBand setup and benchmark for detailed configuration and benchmarks.
Use cases
- Large-scale LLM pretraining: distributed training over tens to hundreds of GPUs
- Multi-node fine-tuning: faster gradient synchronization
- HPC scientific computing: MPI-based parallel simulation
Create member VMs on on-demand or reserved capacity.
Next steps
- InfiniBand setup and benchmark: how to validate performance
- Parallel file system: shared storage for multi-node training
- GPU ML training environment: from CUDA/PyTorch install to training