GPU driver
nvidia-smi isn't working
Cause
When the Ubuntu kernel is updated by apt upgrade or unattended-upgrades and you reboot, the NVIDIA driver DKMS rebuild can fail, leaving the kernel module unable to load.
Diagnose
# Kernel version
uname -r
# Whether the driver module is loaded
lsmod | grep nvidia
# DKMS build status
sudo dkms status
Fix
# Rebuild DKMS for the current kernel
KERNEL=$(uname -r)
DRIVER=$(dkms status | grep nvidia | head -1 | awk -F'[, ]' '{print $2}')
sudo dkms install nvidia/$DRIVER -k $KERNEL
# Reload the module
sudo modprobe nvidia
nvidia-smi
If it still fails after the rebuild, restart the VM.
Prevent recurrence
-
Run
aptcommands inside tmux so a dropped SSH session doesn't interrupt the DKMS build.tmux # or tmux attachsudo apt upgrade -
Disable automatic kernel updates by removing
unattended-upgrades.sudo apt remove unattended-upgrades
Driver/Library version mismatch error
Cause
This happens when the NVIDIA driver was upgraded manually inside the VM, or a package install replaced the driver mid-flight.
Diagnose
nvidia-smi
# NVIDIA-SMI has failed because it couldn't communicate with NVIDIA driver
# or
# Driver Version: X CUDA Version: Y
cat /proc/driver/nvidia/version
python3 -c "import torch; print(torch.version.cuda)"
If the two versions disagree, you have a mismatch.
Fix
Try unloading the driver modules and reloading them in reverse order.
# Unload (reverse dependency order)
sudo modprobe -r nvidia_drm
sudo modprobe -r nvidia_modeset
sudo modprobe -r nvidia_uvm
sudo modprobe -r nvidia
# Load (reverse)
sudo modprobe nvidia
sudo modprobe nvidia_uvm
sudo modprobe nvidia_modeset
sudo modprobe nvidia_drm
nvidia-smi
If it still fails, reboot from within the VM (an OS reboot, not a stop from the portal — NVMe cache data is preserved).
sudo reboot
We recommend keeping the driver version from the ECI default image — don't upgrade it on your own. If the driver is already broken and the steps above don't recover it, the fastest fix is to recreate the VM from the default ECI image.