Skip to main content

Virtual cluster

Overview

A virtual cluster is an HPC environment in which multiple VMs are connected over a dedicated high-speed network fabric. It is used to relieve GPU-to-GPU communication bottlenecks in large-scale distributed training.

The flow is: create the cluster first, then attach VMs you've created separately to it.

Prerequisites
  • Resource.VirtualCluster.CREATE permission (attaching a VM also requires Resource.VirtualMachine.UPDATE)
  • VMs to be attached must be in the same zone and have the same instance type as the cluster
  • A cluster needs at least two VMs to start

Fabric type

A virtual cluster supports two network fabrics.

FabricDescription
InfiniBandA dedicated high-speed network on the InfiniBand protocol. For workloads that need the lowest latency and the highest bandwidth
Ethernet (RoCEv2)A dedicated high-speed network using RoCEv2. Prioritizes compatibility with Ethernet-based environments

Which one to use depends on workload requirements and which instance types are available.


Step 1: Create the cluster

  1. Go to Compute > Virtual Clusters.
  2. Click the Create Virtual Cluster button in the top right.
  3. Configure the following:
FieldDescription
Cluster nameAn identifying name
ZoneThe zone the cluster will run in
Instance typeThe instance type for cluster members
Fabric typeInfiniBand or Ethernet (RoCEv2)
  1. Click Create.

The newly created cluster starts out empty. Member VMs are created separately and then attached.


Step 2: Attach VMs

VMs added to the cluster must be created with the same zone and same instance type.

  1. In Compute > Virtual Machines, create a VM to use as a member, or open an existing one.
  2. In the VM edit screen, under Cluster, set Action to Attach to cluster.
  3. Pick the target cluster and save.

To detach, choose Detach from cluster in the same screen.

Eligibility for attachment

A VM can only be attached when it is in the Stopped state (the portal shows: "Only stopped VMs with the same instance type can be attached"). Spot and disaster recovery (DR) VMs cannot be attached to a cluster.


Step 3: Start the cluster

Once all member VMs are attached, start the cluster.

Starting requires at least 2 VMs

A cluster with fewer than two attached members cannot be started; the portal will show "At least 2 VMs are required".

  1. Pick the target cluster in Compute > Virtual Clusters.
  2. On the detail page, click Start cluster.
  3. Once the status changes to Running, start the individual member VMs.

To stop, click Stop cluster. All running VMs must be shut down first.


Cluster states

StateMeaning
PreparingProvisioning
UnallocatedThe cluster is created but no resources are allocated
Allocated / PendingResources are allocated but not yet running
RunningThe cluster is operating
Stopped / DeletedStopped or deleted

Cluster detail page

TabContents
InfoGeneral info, fabric type, instance type, zone, creation date
Attached VMsList of member VMs and the state of each
MetricsCluster-level GPU, CPU, network, and memory usage

Using InfiniBand

After the cluster starts, member VMs are automatically connected over the fabric. MPI-based distributed training uses this for GPU-to-GPU communication.

# Check InfiniBand devices
ibstat

# Multi-node PyTorch distributed training (torchrun) example
torchrun --nproc_per_node=8 --nnodes=4 \
--node_rank=0 \
--master_addr=<master-VM-IP> \
--master_port=29500 \
train.py

See InfiniBand setup and benchmark for detailed configuration and benchmarks.


Use cases

  • Large-scale LLM pretraining: distributed training over tens to hundreds of GPUs
  • Multi-node fine-tuning: faster gradient synchronization
  • HPC scientific computing: MPI-based parallel simulation
Virtual clusters cannot use spot capacity

Create member VMs on on-demand or reserved capacity.


Next steps