Skip to main content

Run ML training on a GPU VM

Goal

By the end of this tutorial you will have:

  • Created an ECI GPU VM and connected to it via SSH
  • Verified the CUDA environment and run GPU operations from PyTorch
  • Run a simple training loop and saved a checkpoint

Prerequisites

  • An ECI account with quota for a GPU instance
  • An SSH client (built-in on macOS/Linux; WSL or PuTTY on Windows)

Step 1: Create the GPU VM

  1. Under Compute > Virtual Machines, click Create VM.

  2. Use the following settings:

    FieldRecommended
    Disk typeImage
    ImageAn image with CUDA/PyTorch preinstalled (Ubuntu + CUDA)
    Instance typeG-NHHS-80 (H100 80GB SXM × 1)
    Public IPCreate new
    Username/passwordSet in step 1 (lowercase start; password 10+ chars, 3+ classes)
  3. Click Create and wait until the status becomes Running (usually 1–2 minutes).


Step 2: SSH in and verify the environment

ssh <username>@<PUBLIC_IP>
If password authentication doesn't work

Ubuntu's default cloud-init disables password SSH. Either enable it via the init script at VM creation, or change the setting after connecting via the web console. (SSH connection)

Verify GPU and CUDA:

nvidia-smi
# GPU name, VRAM, driver version

python3 -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# True NVIDIA H100 80GB HBM3

Step 3: Run a simple training loop

# train.py
import torch
import torch.nn as nn

device = torch.device("cuda")

# Simple linear regression model
model = nn.Linear(10, 1).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(100):
x = torch.randn(64, 10).to(device)
y = torch.randn(64, 1).to(device)

pred = model(x)
loss = criterion(pred, y)

optimizer.zero_grad()
loss.backward()
optimizer.step()

if epoch % 10 == 0:
print(f"Epoch {epoch}: loss={loss.item():.4f}")

# Save the checkpoint
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, 'checkpoint.pt')
print("Checkpoint saved")
python3 train.py

Step 4: Monitor GPU usage

In a separate terminal, watch GPU usage in real time:

watch -n 1 nvidia-smi

In the ECI portal you can also see GPU utilization, VRAM, CPU, and memory as charts on the VM detail page's Metrics tab. To compare multiple VMs side by side, use the Metrics Explorer.


Next steps