Run ML training on a GPU VM
Goal
By the end of this tutorial you will have:
- Created an ECI GPU VM and connected to it via SSH
- Verified the CUDA environment and run GPU operations from PyTorch
- Run a simple training loop and saved a checkpoint
Prerequisites
- An ECI account with quota for a GPU instance
- An SSH client (built-in on macOS/Linux; WSL or PuTTY on Windows)
Step 1: Create the GPU VM
-
Under Compute > Virtual Machines, click Create VM.
-
Use the following settings:
Field Recommended Disk type ImageImage An image with CUDA/PyTorch preinstalled (Ubuntu + CUDA) Instance type G-NHHS-80(H100 80GB SXM × 1)Public IP Create new Username/password Set in step 1 (lowercase start; password 10+ chars, 3+ classes) -
Click Create and wait until the status becomes Running (usually 1–2 minutes).
Step 2: SSH in and verify the environment
ssh <username>@<PUBLIC_IP>
If password authentication doesn't work
Ubuntu's default cloud-init disables password SSH. Either enable it via the init script at VM creation, or change the setting after connecting via the web console. (SSH connection)
Verify GPU and CUDA:
nvidia-smi
# GPU name, VRAM, driver version
python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# True NVIDIA H100 80GB HBM3
Step 3: Run a simple training loop
# train.py
import torch
import torch.nn as nn
device = torch.device("cuda")
# Simple linear regression model
model = nn.Linear(10, 1).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
for epoch in range(100):
x = torch.randn(64, 10).to(device)
y = torch.randn(64, 1).to(device)
pred = model(x)
loss = criterion(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}: loss={loss.item():.4f}")
# Save the checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, 'checkpoint.pt')
print("Checkpoint saved")
python3 train.py
Step 4: Monitor GPU usage
In a separate terminal, watch GPU usage in real time:
watch -n 1 nvidia-smi
In the ECI portal you can also see GPU utilization, VRAM, CPU, and memory as charts on the VM detail page's Metrics tab. To compare multiple VMs side by side, use the Metrics Explorer.
Next steps
- Run LLM inference: run a large language model with vLLM
- Operating spot VMs: cut costs with reclamation detection and checkpointing
- Checkpoint backup: automatic backups to object storage