Run ML training on a GPU VM

Goal

By the end of this tutorial you will have:

Created an ECI GPU VM and connected to it via SSH
Verified the CUDA environment and run GPU operations from PyTorch
Run a simple training loop and saved a checkpoint

Prerequisites

An ECI account with quota for a GPU instance
An SSH client (built-in on macOS/Linux; WSL or PuTTY on Windows)

Step 1: Create the GPU VM

Under Compute > Virtual Machines, click Create VM.

Use the following settings:

Field	Recommended
Disk type	`Image`
Image	An image with CUDA/PyTorch preinstalled (Ubuntu + CUDA)
Instance type	`G-NHHS-80` (H100 80GB SXM × 1)
Public IP	Create new
Username/password	Set in step 1 (lowercase start; password 10+ chars, 3+ classes)

Click Create and wait until the status becomes Running (usually 1–2 minutes).

Step 2: SSH in and verify the environment

ssh <username>@<PUBLIC_IP>

If password authentication doesn't work

Ubuntu's default cloud-init disables password SSH. Either enable it via the init script at VM creation, or change the setting after connecting via the web console. (SSH connection)

Verify GPU and CUDA:

nvidia-smi
# GPU name, VRAM, driver version

python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# True  NVIDIA H100 80GB HBM3

Step 3: Run a simple training loop

# train.py
import torch
import torch.nn as nn

device = torch.device("cuda")

# Simple linear regression model
model = nn.Linear(10, 1).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(100):
    x = torch.randn(64, 10).to(device)
    y = torch.randn(64, 1).to(device)

    pred = model(x)
    loss = criterion(pred, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}: loss={loss.item():.4f}")

# Save the checkpoint
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
}, 'checkpoint.pt')
print("Checkpoint saved")

python3 train.py

Step 4: Monitor GPU usage

In a separate terminal, watch GPU usage in real time:

watch -n 1 nvidia-smi

In the ECI portal you can also see GPU utilization, VRAM, CPU, and memory as charts on the VM detail page's Metrics tab. To compare multiple VMs side by side, use the Metrics Explorer.

Next steps

Run LLM inference: run a large language model with vLLM
Operating spot VMs: cut costs with reclamation detection and checkpointing
Checkpoint backup: automatic backups to object storage

Goal​

Prerequisites​

Step 1: Create the GPU VM​

Step 2: SSH in and verify the environment​

Step 3: Run a simple training loop​

Step 4: Monitor GPU usage​

Next steps​