Virtual cluster

Overview

A virtual cluster is an HPC environment in which multiple VMs are connected over a dedicated high-speed network fabric. It is used to relieve GPU-to-GPU communication bottlenecks in large-scale distributed training.

The flow is: create the cluster first, then attach VMs you've created separately to it.

Prerequisites

Resource.VirtualCluster.CREATE permission (attaching a VM also requires Resource.VirtualMachine.UPDATE)
VMs to be attached must be in the same zone and have the same instance type as the cluster
A cluster needs at least two VMs to start

Fabric type

A virtual cluster supports two network fabrics.

Fabric	Description
InfiniBand	A dedicated high-speed network on the InfiniBand protocol. For workloads that need the lowest latency and the highest bandwidth
Ethernet (RoCEv2)	A dedicated high-speed network using RoCEv2. Prioritizes compatibility with Ethernet-based environments

Which one to use depends on workload requirements and which instance types are available.

Step 1: Create the cluster

Go to Compute > Virtual Clusters.
Click the Create Virtual Cluster button in the top right.
Configure the following:

Field	Description
Cluster name	An identifying name
Zone	The zone the cluster will run in
Instance type	The instance type for cluster members
Fabric type	InfiniBand or Ethernet (RoCEv2)

Click Create.

The newly created cluster starts out empty. Member VMs are created separately and then attached.

Step 2: Attach VMs

VMs added to the cluster must be created with the same zone and same instance type.

In Compute > Virtual Machines, create a VM to use as a member, or open an existing one.
In the VM edit screen, under Cluster, set Action to Attach to cluster.
Pick the target cluster and save.

To detach, choose Detach from cluster in the same screen.

Eligibility for attachment

A VM can only be attached when it is in the Stopped state (the portal shows: "Only stopped VMs with the same instance type can be attached"). Spot and disaster recovery (DR) VMs cannot be attached to a cluster.

Step 3: Start the cluster

Once all member VMs are attached, start the cluster.

Starting requires at least 2 VMs

A cluster with fewer than two attached members cannot be started; the portal will show "At least 2 VMs are required".

Pick the target cluster in Compute > Virtual Clusters.
On the detail page, click Start cluster.
Once the status changes to Running, start the individual member VMs.

To stop, click Stop cluster. All running VMs must be shut down first.

Cluster states

State	Meaning
Preparing	Provisioning
Unallocated	The cluster is created but no resources are allocated
Allocated / Pending	Resources are allocated but not yet running
Running	The cluster is operating
Stopped / Deleted	Stopped or deleted

Cluster detail page

Tab	Contents
Info	General info, fabric type, instance type, zone, creation date
Attached VMs	List of member VMs and the state of each
Metrics	Cluster-level GPU, CPU, network, and memory usage

Using InfiniBand

After the cluster starts, member VMs are automatically connected over the fabric. MPI-based distributed training uses this for GPU-to-GPU communication.

# Check InfiniBand devices
ibstat

# Multi-node PyTorch distributed training (torchrun) example
torchrun --nproc_per_node=8 --nnodes=4 \
  --node_rank=0 \
  --master_addr=<master-VM-IP> \
  --master_port=29500 \
  train.py

See InfiniBand setup and benchmark for detailed configuration and benchmarks.

Use cases

Large-scale LLM pretraining: distributed training over tens to hundreds of GPUs
Multi-node fine-tuning: faster gradient synchronization
HPC scientific computing: MPI-based parallel simulation

Virtual clusters cannot use spot capacity

Create member VMs on on-demand or reserved capacity.

Next steps

InfiniBand setup and benchmark: how to validate performance
Parallel file system: shared storage for multi-node training
GPU ML training environment: from CUDA/PyTorch install to training

Overview​

Fabric type​

Step 1: Create the cluster​

Step 2: Attach VMs​

Step 3: Start the cluster​

Cluster states​

Cluster detail page​

Using InfiniBand​

Use cases​

Next steps​