VM diagnostic state
When the VM diagnostic state shows Unhealthy or Unknown, follow the steps below. For an overview of the states themselves, see Compute dashboard.
Unknown
The monitoring system can't receive state information from the VM.
Step 1: Check the monitoring agent
Verify that eci-guest-agent is running inside the VM:
sudo systemctl is-active eci-guest-agent.service
- Output
active: the service is healthy. If the state hasn't changed after 5 minutes, go to step 2 - Output
inactiveorfailed: restart it
sudo systemctl restart eci-guest-agent.service
After restarting, re-run is-active. Detailed logs are available from sudo systemctl status eci-guest-agent.service; Active: active (running) indicates a healthy service.
Step 2: Check host communication
If the agent is healthy but the state stays Unknown, test communication with the host directly.
sudo systemctl stop eci-guest-agent.service
socat VSOCK-LISTEN:11190 STDOUT
If you don't stop the agent before testing, you'll see Address already in use. After the test, start the service again to resume monitoring.
sudo systemctl start eci-guest-agent.service
- Healthy: after the command runs, the cursor waits and after a moment a request like
{"command": "cpu-metric", "args": null}arrives. - Unhealthy: an error is printed immediately.
Step 3: Contact support
If the steps above don't resolve it, contact Support with:
- Service logs:
sudo journalctl -eu eci-guest-agent.service --no-pager | tail - The error message from the communication test (if applicable)
Unhealthy
The GPU allocated to the VM is not detected or not operating correctly.
Step 1: Check that the GPU hardware is detected
lspci | grep -i nvidia
If no NVIDIA device is listed, the hardware isn't detected.
Step 2: Check driver / CUDA compatibility
If you see CUDA initialize failed, check NVIDIA driver vs CUDA Toolkit compatibility. For detailed steps, see PyTorch CUDA compatibility issues and GPU driver FAQ.
Step 3: Contact support
If the issue persists, contact Support with:
- Output of
lspci | grep -i nvidia - Output of
nvidia-smi - The CUDA Toolkit version in use and the full error text
Next steps
- Compute dashboard: monitor unhealthy VMs in aggregate
- GPU driver FAQ:
nvidia-smifailures, driver / library mismatches - PyTorch CUDA compatibility issues: PyTorch can't see CUDA