Run LLM inference on a GPU VM

Goal

By the end of this tutorial you will have:

Created a GPU VM from the PyTorch-preinstalled image
Run a Hugging Face LLM with vLLM
Queried the model through its OpenAI-compatible API endpoint

Prerequisites

An H100 or A100 instance (at least 40 GB of VRAM)
A Hugging Face account and token (for gated models)
A public IP

Step 1: Create the VM

Field	Recommended
Image	Ubuntu with CUDA and PyTorch preinstalled
Instance type	`G-NHHS-80` (H100 80GB SXM × 1) or higher
Block storage	500 GiB or more (for model storage)
Public IP	Create new

Step 2: Install vLLM

pip install vllm

When using the PyTorch-preinstalled image

CUDA, cuDNN, and PyTorch are already installed, so you only need to add vLLM.

Step 3: Run the model

# Example: Mistral 7B
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --port 8000 \
  --host 0.0.0.0

Gated Hugging Face models (e.g. Llama) require a token:

export HUGGING_FACE_HUB_TOKEN="hf_your_token"
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --port 8000 \
  --host 0.0.0.0

Step 4: Configure the firewall

Firewall rules are configured per virtual network. Open the detail page of the virtual network the VM belongs to and allow inbound TCP 8000.

Pick the target virtual network under Network > Virtual Networks.
In Firewall rules on the detail page, click Add rule (or Add first rule).
Add a rule with these values:

Field Value
Protocol TCP
Source <admin-IP>/32 (or 0.0.0.0/0)
Destination The VM IP or subnet CIDR
Start port 8000
Action ACCEPT

Field	Value
Protocol	`TCP`
Source	`<admin-IP>/32` (or `0.0.0.0/0`)
Destination	The VM IP or subnet CIDR
Start port	`8000`
Action	`ACCEPT`

Changes take effect within one minute.

Step 5: Query the API

# List models
curl http://<PUBLIC_IP>:8000/v1/models

# Generate text
curl http://<PUBLIC_IP>:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "What is ECI?",
    "max_tokens": 200
  }'

Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://<PUBLIC_IP>:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Next steps

Deploy an API server with FastAPI: expose the model as a REST API
InfiniBand benchmark: verify communication performance for multi-GPU distributed inference

Goal​

Prerequisites​

Step 1: Create the VM​

Step 2: Install vLLM​

Step 3: Run the model​

Step 4: Configure the firewall​

Step 5: Query the API​

Next steps​