Operating spot VMs
Overview
Spot VMs let you run on ECI's idle GPU capacity at a discount versus on-demand. The discount varies with supply and demand, and a VM may be reclaimed when capacity tightens, so checkpointing and reclamation detection are mandatory.
Creating a spot VM
- Under Compute > Virtual Machines, click Create VM.
- On the Basic Info step, set the Pricing type to Spot. (Selecting spot automatically disables the Always-On and Disaster Recovery options.)
- In the instance-type list, pick a GPU type whose availability is Currently available.
- Fill in the remaining settings and create the VM.
Checking availability
Spot uses idle GPU capacity, so availability fluctuates frequently.
Option 1: From the VM creation screen
Once you select the spot pricing type, availability is shown inline next to each instance type.
Option 2: Infrastructure > Resource Status > Spot menu
| Availability | Description |
|---|---|
Currently available ({n}) | Capacity is reserved; you can create and run VMs |
| No capacity currently available | Capacity is exhausted; creation and run will fail |
Reclamation
When capacity runs short, ECI force-reclaims spot VMs.
Reclamation process
- Once reclamation is decided, the metadata API exposes the scheduled termination time.
- After a 1–2 minute grace period, the VM is force-reclaimed.
Detecting reclamation: the metadata API
From inside the VM, the command below tells you whether reclamation is scheduled.
curl -s --unix-socket /run/eci-guest-agent.sock \
http://localhost/vm/metadata?key=spot_termination_time
- No reclamation scheduled:
404 page not found - Reclamation scheduled: returns the termination time (e.g.
"2026-04-08T05:00:00+00:00")
Reclamation watcher script (run in background)
#!/bin/bash
# spot_watcher.sh
while true; do
HTTP_CODE=$(curl -s -o /tmp/spot_response -w "%{http_code}" \
--unix-socket /run/eci-guest-agent.sock \
http://localhost/vm/metadata?key=spot_termination_time)
if [ "$HTTP_CODE" -eq 200 ]; then
echo "[$(date)] Reclamation scheduled: $(cat /tmp/spot_response)"
/path/to/save_checkpoint.sh # call your checkpoint script
break
fi
sleep 5
done
chmod +x spot_watcher.sh
nohup ./spot_watcher.sh &
Saving checkpoints
Anything in VM memory or local ephemeral storage disappears when reclamation happens. Always save checkpoints to block storage on a regular schedule.
PyTorch checkpoint example
import torch, os
CHECKPOINT_PATH = "/data/checkpoints/checkpoint.pt"
def save_checkpoint(model, optimizer, epoch, step, loss):
os.makedirs(os.path.dirname(CHECKPOINT_PATH), exist_ok=True)
torch.save({
'epoch': epoch,
'step': step,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, CHECKPOINT_PATH)
print(f"[epoch {epoch}] checkpoint saved")
def load_checkpoint(model, optimizer):
if os.path.exists(CHECKPOINT_PATH):
ckpt = torch.load(CHECKPOINT_PATH)
model.load_state_dict(ckpt['model_state_dict'])
optimizer.load_state_dict(ckpt['optimizer_state_dict'])
return ckpt['epoch'], ckpt['step']
return 0, 0 # start from scratch
# Training loop
start_epoch, start_step = load_checkpoint(model, optimizer)
for epoch in range(start_epoch, total_epochs):
for step, batch in enumerate(dataloader, start=start_step):
train_step(model, batch)
if step % 100 == 0:
save_checkpoint(model, optimizer, epoch, step, loss) # save every 100 steps
The grace period (1–2 minutes) may not be enough to finish writing. Combine reclamation detection with regular (every N steps) checkpointing.
Limitations
- Applies only to GPU instance types. CPU-only instances do not have a spot option, and attached block storage, public IPs, and networking are billed at standard rates
- Always-On cannot be used
- Disaster Recovery (DR) cannot be used
- Cannot be joined to a virtual cluster
FAQ
My VM shut down out of nowhere.
It was reclaimed because of spot capacity pressure. Set up metadata-API polling and save checkpoints regularly so you can resume after restarting.
The Run button is disabled.
There's no spot capacity right now (you'll see "Spot capacity is currently insufficient to run"). Check availability under Infrastructure > Resource Status > Spot and try again later.
The spot price changed.
The spot price is applied at VM start time, so a restart may pick up a different price.
Next steps
- Terraform spot VM guide: automating spot VMs as IaC
- Pricing model: spot vs on-demand vs reserved