Terraform guide (spot VM)

Overview

Spot VMs are offered at a discount versus on-demand (the discount fluctuates with supply and demand), but they can be reclaimed when capacity tightens. Managing spot VMs as Terraform code gives you consistent reprovisioning and checkpoint restoration after reclamation.

Spot VM constraints

VMs created at the spot price cannot use the following:

Always-On: always_on = true returns SPOT_ALWAYS_ON_NOT_ALLOWED
Disaster recovery (DR): dr = true returns SPOT_DR_NOT_ALLOWED
Virtual cluster: cannot be attached (single-VM only)
Applies to GPU instances only: CPU-only instance types do not have a spot price

For the underlying behavior, see Operating spot VMs.

Defining a VM with spot pricing

The basic flow is the same as in the Terraform overview, but set the pricing_type of the eci_pricing data source to "spot".

# spot_vm.tf

data "eci_instance_type" "h100_1" {
  name = "G-NHHS-80"   # H100 SXM × 1
}

data "eci_pricing" "spot_pricing" {
  name         = "G-NHHS-80"
  pricing_type = "spot"     # ← spot pricing
}

data "eci_pricing" "storage_pricing" {
  name         = "Block Storage"
  pricing_type = "ondemand"
}

data "eci_block_storage_image" "ubuntu" {
  name = "Ubuntu 24.04 LTS (AI/GPU) (50 GiB)"   # GPU workload: CUDA and drivers preinstalled
}

resource "eci_virtual_machine" "spot_training" {
  name             = "spot-training-01"
  instance_type_id = data.eci_instance_type.h100_1.id
  pricing_id       = data.eci_pricing.spot_pricing.id
  username         = "elice"
  password         = var.vm_password

  # Spot constraints: both must be false
  always_on = false
  dr        = false

  # Restore checkpoint + resume training
  on_init_script = <<-EOT
    #!/bin/bash
    set -e

    # 1) Restore the latest checkpoint from object storage
    aws s3 sync s3://${var.checkpoint_bucket}/latest /workspace/checkpoints \
      --endpoint-url ${var.s3_endpoint} || true

    # 2) Resume training (in the background)
    nohup python /workspace/train.py \
      --resume /workspace/checkpoints/latest.pt \
      >> /workspace/train.log 2>&1 &

    # 3) Start the reclamation watcher
    /workspace/spot-watcher.sh &
  EOT

  tags = { managed-by = "terraform", workload = "spot-training" }
}

resource "eci_block_storage" "boot_disk" {
  attached_machine_id = eci_virtual_machine.spot_training.id
  name                = "spot-training-01-boot"
  size_gib            = 200
  pricing_id          = data.eci_pricing.storage_pricing.id
  image_id            = data.eci_block_storage_image.ubuntu.id
  dr                  = false
  tags                = { managed-by = "terraform" }
}

resource "eci_network_interface" "ni" {
  attached_subnet_id  = var.subnet_id
  attached_machine_id = eci_virtual_machine.spot_training.id
  name                = "spot-training-01-ni"
  dr                  = false
  tags                = { managed-by = "terraform" }
}

resource "eci_virtual_machine_allocation" "run" {
  machine_id = eci_virtual_machine.spot_training.id
  tags       = { managed-by = "terraform" }

  depends_on = [
    eci_block_storage.boot_disk,
    eci_network_interface.ni,
  ]
}

Reclamation watcher

The spot-watcher.sh called by on_init_script should either be baked into the image or downloaded at boot.

#!/bin/bash
# spot-watcher.sh: on reclamation notice (1-2 min before), save checkpoint

while true; do
  HTTP_CODE=$(curl -s -o /tmp/spot_response -w "%{http_code}" \
    --unix-socket /run/eci-guest-agent.sock \
    http://localhost/vm/metadata?key=spot_termination_time)

  if [ "$HTTP_CODE" -eq 200 ]; then
    echo "[$(date)] reclamation scheduled: $(cat /tmp/spot_response): saving checkpoint"

    # SIGUSR1 to the training process → checkpoint save (handled by the app)
    pkill -SIGUSR1 -f "python.*train.py" || true
    sleep 30

    # Upload to object storage
    aws s3 sync /workspace/checkpoints \
      s3://CHECKPOINT_BUCKET/latest \
      --endpoint-url https://s3.elice.cloud
    break
  fi
  sleep 5
done

For the reclamation detection mechanism, see Operating spot VMs.

Variable definitions

# variables.tf
variable "vm_password"        { description = "VM password (10-256 chars, 3+ character classes)"; sensitive = true }
variable "subnet_id"          { description = "Subnet UUID" }
variable "checkpoint_bucket"  { description = "Object storage bucket name" }
variable "s3_endpoint" {
  description = "Object storage endpoint"
  default     = "https://s3.elice.cloud"
}

Reprovisioning after reclamation

When a spot VM is reclaimed, you can run it again by recreating only the eci_virtual_machine_allocation resource. The VM resource and disk are preserved, so on_init_script runs again and the checkpoint is restored automatically.

# Recreate just the allocation (VM and disk are kept)
terraform apply -replace="eci_virtual_machine_allocation.run"

Allocation can fail again if capacity is still tight. Check availability under Infrastructure > Resource Status > Spot in the portal and retry.

Drive automatic reallocation from CI

During hours when reclamation happens frequently, run terraform apply -replace=... every 10–30 minutes from CI to keep training running. Configure reclamation notification emails as a separate alert.

Next steps

Operating spot VMs: reclamation detection and checkpoint patterns
Terraform cheatsheet: common commands and resource patterns
Official provider docs

Overview​

Defining a VM with spot pricing​

Reclamation watcher​

Variable definitions​

Reprovisioning after reclamation​

Next steps​