VM diagnostic state

When the VM diagnostic state shows Unhealthy or Unknown, follow the steps below. For an overview of the states themselves, see Compute dashboard.

Unknown

The monitoring system can't receive state information from the VM.

Step 1: Check the monitoring agent

Verify that eci-guest-agent is running inside the VM:

sudo systemctl is-active eci-guest-agent.service

Output active: the service is healthy. If the state hasn't changed after 5 minutes, go to step 2
Output inactive or failed: restart it

sudo systemctl restart eci-guest-agent.service

After restarting, re-run is-active. Detailed logs are available from sudo systemctl status eci-guest-agent.service; Active: active (running) indicates a healthy service.

Step 2: Check host communication

If the agent is healthy but the state stays Unknown, test communication with the host directly.

sudo systemctl stop eci-guest-agent.service
socat VSOCK-LISTEN:11190 STDOUT

Restart the service after the test

If you don't stop the agent before testing, you'll see Address already in use. After the test, start the service again to resume monitoring.

sudo systemctl start eci-guest-agent.service

Healthy: after the command runs, the cursor waits and after a moment a request like {"command": "cpu-metric", "args": null} arrives.
Unhealthy: an error is printed immediately.

Step 3: Contact support

If the steps above don't resolve it, contact Support with:

Service logs: sudo journalctl -eu eci-guest-agent.service --no-pager | tail
The error message from the communication test (if applicable)

Unhealthy

The GPU allocated to the VM is not detected or not operating correctly.

Step 1: Check that the GPU hardware is detected

lspci | grep -i nvidia

If no NVIDIA device is listed, the hardware isn't detected.

Step 2: Check driver / CUDA compatibility

If you see CUDA initialize failed, check NVIDIA driver vs CUDA Toolkit compatibility. For detailed steps, see PyTorch CUDA compatibility issues and GPU driver FAQ.

Step 3: Contact support

If the issue persists, contact Support with:

Output of lspci | grep -i nvidia
Output of nvidia-smi
The CUDA Toolkit version in use and the full error text

Next steps

Compute dashboard: monitor unhealthy VMs in aggregate
GPU driver FAQ: nvidia-smi failures, driver / library mismatches
PyTorch CUDA compatibility issues: PyTorch can't see CUDA

Unknown​

Step 1: Check the monitoring agent​

Step 2: Check host communication​

Step 3: Contact support​

Unhealthy​

Step 1: Check that the GPU hardware is detected​

Step 2: Check driver / CUDA compatibility​

Step 3: Contact support​

Next steps​

Unknown

Step 1: Check the monitoring agent

Step 2: Check host communication

Step 3: Contact support

Unhealthy

Step 1: Check that the GPU hardware is detected

Step 2: Check driver / CUDA compatibility

Step 3: Contact support

Next steps