Run LLM inference on a GPU VM
Goal
By the end of this tutorial you will have:
- Created a GPU VM from the PyTorch-preinstalled image
- Run a Hugging Face LLM with vLLM
- Queried the model through its OpenAI-compatible API endpoint
Prerequisites
- An H100 or A100 instance (at least 40 GB of VRAM)
- A Hugging Face account and token (for gated models)
- A public IP
Step 1: Create the VM
| Field | Recommended |
|---|---|
| Image | Ubuntu with CUDA and PyTorch preinstalled |
| Instance type | G-NHHS-80 (H100 80GB SXM × 1) or higher |
| Block storage | 500 GiB or more (for model storage) |
| Public IP | Create new |
Step 2: Install vLLM
pip install vllm
When using the PyTorch-preinstalled image
CUDA, cuDNN, and PyTorch are already installed, so you only need to add vLLM.
Step 3: Run the model
# Example: Mistral 7B
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--port 8000 \
--host 0.0.0.0
Gated Hugging Face models (e.g. Llama) require a token:
export HUGGING_FACE_HUB_TOKEN="hf_your_token"
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--port 8000 \
--host 0.0.0.0
Step 4: Configure the firewall
Firewall rules are configured per virtual network. Open the detail page of the virtual network the VM belongs to and allow inbound TCP 8000.
-
Pick the target virtual network under Network > Virtual Networks.
-
In Firewall rules on the detail page, click Add rule (or Add first rule).
-
Add a rule with these values:
Field Value Protocol TCPSource <admin-IP>/32(or0.0.0.0/0)Destination The VM IP or subnet CIDR Start port 8000Action ACCEPT
Changes take effect within one minute.
Step 5: Query the API
# List models
curl http://<PUBLIC_IP>:8000/v1/models
# Generate text
curl http://<PUBLIC_IP>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "What is ECI?",
"max_tokens": 200
}'
Python client:
from openai import OpenAI
client = OpenAI(
base_url="http://<PUBLIC_IP>:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Next steps
- Deploy an API server with FastAPI: expose the model as a REST API
- InfiniBand benchmark: verify communication performance for multi-GPU distributed inference