Skip to main content

Operating spot VMs

Overview

Spot VMs let you run on ECI's idle GPU capacity at a discount versus on-demand. The discount varies with supply and demand, and a VM may be reclaimed when capacity tightens, so checkpointing and reclamation detection are mandatory.


Creating a spot VM

  1. Under Compute > Virtual Machines, click Create VM.
  2. On the Basic Info step, set the Pricing type to Spot. (Selecting spot automatically disables the Always-On and Disaster Recovery options.)
  3. In the instance-type list, pick a GPU type whose availability is Currently available.
  4. Fill in the remaining settings and create the VM.

Checking availability

Spot uses idle GPU capacity, so availability fluctuates frequently.

Option 1: From the VM creation screen

Once you select the spot pricing type, availability is shown inline next to each instance type.

Option 2: Infrastructure > Resource Status > Spot menu

AvailabilityDescription
Currently available ({n})Capacity is reserved; you can create and run VMs
No capacity currently availableCapacity is exhausted; creation and run will fail

Reclamation

When capacity runs short, ECI force-reclaims spot VMs.

Reclamation process

  1. Once reclamation is decided, the metadata API exposes the scheduled termination time.
  2. After a 1–2 minute grace period, the VM is force-reclaimed.

Detecting reclamation: the metadata API

From inside the VM, the command below tells you whether reclamation is scheduled.

curl -s --unix-socket /run/eci-guest-agent.sock \
http://localhost/vm/metadata?key=spot_termination_time
  • No reclamation scheduled: 404 page not found
  • Reclamation scheduled: returns the termination time (e.g. "2026-04-08T05:00:00+00:00")

Reclamation watcher script (run in background)

#!/bin/bash
# spot_watcher.sh

while true; do
HTTP_CODE=$(curl -s -o /tmp/spot_response -w "%{http_code}" \
--unix-socket /run/eci-guest-agent.sock \
http://localhost/vm/metadata?key=spot_termination_time)

if [ "$HTTP_CODE" -eq 200 ]; then
echo "[$(date)] Reclamation scheduled: $(cat /tmp/spot_response)"
/path/to/save_checkpoint.sh # call your checkpoint script
break
fi
sleep 5
done
chmod +x spot_watcher.sh
nohup ./spot_watcher.sh &

Saving checkpoints

Data inside the VM is lost on reclamation

Anything in VM memory or local ephemeral storage disappears when reclamation happens. Always save checkpoints to block storage on a regular schedule.

PyTorch checkpoint example

import torch, os

CHECKPOINT_PATH = "/data/checkpoints/checkpoint.pt"

def save_checkpoint(model, optimizer, epoch, step, loss):
os.makedirs(os.path.dirname(CHECKPOINT_PATH), exist_ok=True)
torch.save({
'epoch': epoch,
'step': step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, CHECKPOINT_PATH)
print(f"[epoch {epoch}] checkpoint saved")

def load_checkpoint(model, optimizer):
if os.path.exists(CHECKPOINT_PATH):
ckpt = torch.load(CHECKPOINT_PATH)
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
return ckpt['epoch'], ckpt['step']
return 0, 0 # start from scratch

# Training loop
start_epoch, start_step = load_checkpoint(model, optimizer)
for epoch in range(start_epoch, total_epochs):
for step, batch in enumerate(dataloader, start=start_step):
train_step(model, batch)
if step % 100 == 0:
save_checkpoint(model, optimizer, epoch, step, loss) # save every 100 steps
Saving only on reclamation detection can be too late

The grace period (1–2 minutes) may not be enough to finish writing. Combine reclamation detection with regular (every N steps) checkpointing.


Limitations

  • Applies only to GPU instance types. CPU-only instances do not have a spot option, and attached block storage, public IPs, and networking are billed at standard rates
  • Always-On cannot be used
  • Disaster Recovery (DR) cannot be used
  • Cannot be joined to a virtual cluster

FAQ

My VM shut down out of nowhere.

It was reclaimed because of spot capacity pressure. Set up metadata-API polling and save checkpoints regularly so you can resume after restarting.

The Run button is disabled.

There's no spot capacity right now (you'll see "Spot capacity is currently insufficient to run"). Check availability under Infrastructure > Resource Status > Spot and try again later.

The spot price changed.

The spot price is applied at VM start time, so a restart may pick up a different price.


Next steps