Skip to main content

VM diagnostic state

When the VM diagnostic state shows Unhealthy or Unknown, follow the steps below. For an overview of the states themselves, see Compute dashboard.


Unknown

The monitoring system can't receive state information from the VM.

Step 1: Check the monitoring agent

Verify that eci-guest-agent is running inside the VM:

sudo systemctl is-active eci-guest-agent.service
  • Output active: the service is healthy. If the state hasn't changed after 5 minutes, go to step 2
  • Output inactive or failed: restart it
sudo systemctl restart eci-guest-agent.service

After restarting, re-run is-active. Detailed logs are available from sudo systemctl status eci-guest-agent.service; Active: active (running) indicates a healthy service.

Step 2: Check host communication

If the agent is healthy but the state stays Unknown, test communication with the host directly.

sudo systemctl stop eci-guest-agent.service
socat VSOCK-LISTEN:11190 STDOUT
Restart the service after the test

If you don't stop the agent before testing, you'll see Address already in use. After the test, start the service again to resume monitoring.

sudo systemctl start eci-guest-agent.service
  • Healthy: after the command runs, the cursor waits and after a moment a request like {"command": "cpu-metric", "args": null} arrives.
  • Unhealthy: an error is printed immediately.

Step 3: Contact support

If the steps above don't resolve it, contact Support with:

  • Service logs: sudo journalctl -eu eci-guest-agent.service --no-pager | tail
  • The error message from the communication test (if applicable)

Unhealthy

The GPU allocated to the VM is not detected or not operating correctly.

Step 1: Check that the GPU hardware is detected

lspci | grep -i nvidia

If no NVIDIA device is listed, the hardware isn't detected.

Step 2: Check driver / CUDA compatibility

If you see CUDA initialize failed, check NVIDIA driver vs CUDA Toolkit compatibility. For detailed steps, see PyTorch CUDA compatibility issues and GPU driver FAQ.

Step 3: Contact support

If the issue persists, contact Support with:

  • Output of lspci | grep -i nvidia
  • Output of nvidia-smi
  • The CUDA Toolkit version in use and the full error text

Next steps