Skip to main content

Terraform guide (spot VM)

Overview

Spot VMs are offered at a discount versus on-demand (the discount fluctuates with supply and demand), but they can be reclaimed when capacity tightens. Managing spot VMs as Terraform code gives you consistent reprovisioning and checkpoint restoration after reclamation.

Spot VM constraints

VMs created at the spot price cannot use the following:

  • Always-On: always_on = true returns SPOT_ALWAYS_ON_NOT_ALLOWED
  • Disaster recovery (DR): dr = true returns SPOT_DR_NOT_ALLOWED
  • Virtual cluster: cannot be attached (single-VM only)
  • Applies to GPU instances only: CPU-only instance types do not have a spot price

For the underlying behavior, see Operating spot VMs.


Defining a VM with spot pricing

The basic flow is the same as in the Terraform overview, but set the pricing_type of the eci_pricing data source to "spot".

# spot_vm.tf

data "eci_instance_type" "h100_1" {
name = "G-NHHS-80" # H100 SXM × 1
}

data "eci_pricing" "spot_pricing" {
name = "G-NHHS-80"
pricing_type = "spot" # ← spot pricing
}

data "eci_pricing" "storage_pricing" {
name = "Block Storage"
pricing_type = "ondemand"
}

data "eci_block_storage_image" "ubuntu" {
name = "Ubuntu 24.04 LTS (AI/GPU) (50 GiB)" # GPU workload: CUDA and drivers preinstalled
}

resource "eci_virtual_machine" "spot_training" {
name = "spot-training-01"
instance_type_id = data.eci_instance_type.h100_1.id
pricing_id = data.eci_pricing.spot_pricing.id
username = "elice"
password = var.vm_password

# Spot constraints: both must be false
always_on = false
dr = false

# Restore checkpoint + resume training
on_init_script = <<-EOT
#!/bin/bash
set -e

# 1) Restore the latest checkpoint from object storage
aws s3 sync s3://${var.checkpoint_bucket}/latest /workspace/checkpoints \
--endpoint-url ${var.s3_endpoint} || true

# 2) Resume training (in the background)
nohup python /workspace/train.py \
--resume /workspace/checkpoints/latest.pt \
>> /workspace/train.log 2>&1 &

# 3) Start the reclamation watcher
/workspace/spot-watcher.sh &
EOT

tags = { managed-by = "terraform", workload = "spot-training" }
}

resource "eci_block_storage" "boot_disk" {
attached_machine_id = eci_virtual_machine.spot_training.id
name = "spot-training-01-boot"
size_gib = 200
pricing_id = data.eci_pricing.storage_pricing.id
image_id = data.eci_block_storage_image.ubuntu.id
dr = false
tags = { managed-by = "terraform" }
}

resource "eci_network_interface" "ni" {
attached_subnet_id = var.subnet_id
attached_machine_id = eci_virtual_machine.spot_training.id
name = "spot-training-01-ni"
dr = false
tags = { managed-by = "terraform" }
}

resource "eci_virtual_machine_allocation" "run" {
machine_id = eci_virtual_machine.spot_training.id
tags = { managed-by = "terraform" }

depends_on = [
eci_block_storage.boot_disk,
eci_network_interface.ni,
]
}

Reclamation watcher

The spot-watcher.sh called by on_init_script should either be baked into the image or downloaded at boot.

#!/bin/bash
# spot-watcher.sh: on reclamation notice (1-2 min before), save checkpoint

while true; do
HTTP_CODE=$(curl -s -o /tmp/spot_response -w "%{http_code}" \
--unix-socket /run/eci-guest-agent.sock \
http://localhost/vm/metadata?key=spot_termination_time)

if [ "$HTTP_CODE" -eq 200 ]; then
echo "[$(date)] reclamation scheduled: $(cat /tmp/spot_response): saving checkpoint"

# SIGUSR1 to the training process → checkpoint save (handled by the app)
pkill -SIGUSR1 -f "python.*train.py" || true
sleep 30

# Upload to object storage
aws s3 sync /workspace/checkpoints \
s3://CHECKPOINT_BUCKET/latest \
--endpoint-url https://s3.elice.cloud
break
fi
sleep 5
done

For the reclamation detection mechanism, see Operating spot VMs.


Variable definitions

# variables.tf
variable "vm_password" { description = "VM password (10-256 chars, 3+ character classes)"; sensitive = true }
variable "subnet_id" { description = "Subnet UUID" }
variable "checkpoint_bucket" { description = "Object storage bucket name" }
variable "s3_endpoint" {
description = "Object storage endpoint"
default = "https://s3.elice.cloud"
}

Reprovisioning after reclamation

When a spot VM is reclaimed, you can run it again by recreating only the eci_virtual_machine_allocation resource. The VM resource and disk are preserved, so on_init_script runs again and the checkpoint is restored automatically.

# Recreate just the allocation (VM and disk are kept)
terraform apply -replace="eci_virtual_machine_allocation.run"

Allocation can fail again if capacity is still tight. Check availability under Infrastructure > Resource Status > Spot in the portal and retry.

Drive automatic reallocation from CI

During hours when reclamation happens frequently, run terraform apply -replace=... every 10–30 minutes from CI to keep training running. Configure reclamation notification emails as a separate alert.


Next steps