Terraform guide (spot VM)
Overview
Spot VMs are offered at a discount versus on-demand (the discount fluctuates with supply and demand), but they can be reclaimed when capacity tightens. Managing spot VMs as Terraform code gives you consistent reprovisioning and checkpoint restoration after reclamation.
VMs created at the spot price cannot use the following:
- Always-On:
always_on = truereturnsSPOT_ALWAYS_ON_NOT_ALLOWED - Disaster recovery (DR):
dr = truereturnsSPOT_DR_NOT_ALLOWED - Virtual cluster: cannot be attached (single-VM only)
- Applies to GPU instances only: CPU-only instance types do not have a spot price
For the underlying behavior, see Operating spot VMs.
Defining a VM with spot pricing
The basic flow is the same as in the Terraform overview, but set the pricing_type of the eci_pricing data source to "spot".
# spot_vm.tf
data "eci_instance_type" "h100_1" {
name = "G-NHHS-80" # H100 SXM × 1
}
data "eci_pricing" "spot_pricing" {
name = "G-NHHS-80"
pricing_type = "spot" # ← spot pricing
}
data "eci_pricing" "storage_pricing" {
name = "Block Storage"
pricing_type = "ondemand"
}
data "eci_block_storage_image" "ubuntu" {
name = "Ubuntu 24.04 LTS (AI/GPU) (50 GiB)" # GPU workload: CUDA and drivers preinstalled
}
resource "eci_virtual_machine" "spot_training" {
name = "spot-training-01"
instance_type_id = data.eci_instance_type.h100_1.id
pricing_id = data.eci_pricing.spot_pricing.id
username = "elice"
password = var.vm_password
# Spot constraints: both must be false
always_on = false
dr = false
# Restore checkpoint + resume training
on_init_script = <<-EOT
#!/bin/bash
set -e
# 1) Restore the latest checkpoint from object storage
aws s3 sync s3://${var.checkpoint_bucket}/latest /workspace/checkpoints \
--endpoint-url ${var.s3_endpoint} || true
# 2) Resume training (in the background)
nohup python /workspace/train.py \
--resume /workspace/checkpoints/latest.pt \
>> /workspace/train.log 2>&1 &
# 3) Start the reclamation watcher
/workspace/spot-watcher.sh &
EOT
tags = { managed-by = "terraform", workload = "spot-training" }
}
resource "eci_block_storage" "boot_disk" {
attached_machine_id = eci_virtual_machine.spot_training.id
name = "spot-training-01-boot"
size_gib = 200
pricing_id = data.eci_pricing.storage_pricing.id
image_id = data.eci_block_storage_image.ubuntu.id
dr = false
tags = { managed-by = "terraform" }
}
resource "eci_network_interface" "ni" {
attached_subnet_id = var.subnet_id
attached_machine_id = eci_virtual_machine.spot_training.id
name = "spot-training-01-ni"
dr = false
tags = { managed-by = "terraform" }
}
resource "eci_virtual_machine_allocation" "run" {
machine_id = eci_virtual_machine.spot_training.id
tags = { managed-by = "terraform" }
depends_on = [
eci_block_storage.boot_disk,
eci_network_interface.ni,
]
}
Reclamation watcher
The spot-watcher.sh called by on_init_script should either be baked into the image or downloaded at boot.
#!/bin/bash
# spot-watcher.sh: on reclamation notice (1-2 min before), save checkpoint
while true; do
HTTP_CODE=$(curl -s -o /tmp/spot_response -w "%{http_code}" \
--unix-socket /run/eci-guest-agent.sock \
http://localhost/vm/metadata?key=spot_termination_time)
if [ "$HTTP_CODE" -eq 200 ]; then
echo "[$(date)] reclamation scheduled: $(cat /tmp/spot_response): saving checkpoint"
# SIGUSR1 to the training process → checkpoint save (handled by the app)
pkill -SIGUSR1 -f "python.*train.py" || true
sleep 30
# Upload to object storage
aws s3 sync /workspace/checkpoints \
s3://CHECKPOINT_BUCKET/latest \
--endpoint-url https://s3.elice.cloud
break
fi
sleep 5
done
For the reclamation detection mechanism, see Operating spot VMs.
Variable definitions
# variables.tf
variable "vm_password" { description = "VM password (10-256 chars, 3+ character classes)"; sensitive = true }
variable "subnet_id" { description = "Subnet UUID" }
variable "checkpoint_bucket" { description = "Object storage bucket name" }
variable "s3_endpoint" {
description = "Object storage endpoint"
default = "https://s3.elice.cloud"
}
Reprovisioning after reclamation
When a spot VM is reclaimed, you can run it again by recreating only the eci_virtual_machine_allocation resource. The VM resource and disk are preserved, so on_init_script runs again and the checkpoint is restored automatically.
# Recreate just the allocation (VM and disk are kept)
terraform apply -replace="eci_virtual_machine_allocation.run"
Allocation can fail again if capacity is still tight. Check availability under Infrastructure > Resource Status > Spot in the portal and retry.
During hours when reclamation happens frequently, run terraform apply -replace=... every 10–30 minutes from CI to keep training running. Configure reclamation notification emails as a separate alert.
Next steps
- Operating spot VMs: reclamation detection and checkpoint patterns
- Terraform cheatsheet: common commands and resource patterns
- Official provider docs