Skip to main content

Run LLM inference on a GPU VM

Goal

By the end of this tutorial you will have:

  • Created a GPU VM from the PyTorch-preinstalled image
  • Run a Hugging Face LLM with vLLM
  • Queried the model through its OpenAI-compatible API endpoint

Prerequisites

  • An H100 or A100 instance (at least 40 GB of VRAM)
  • A Hugging Face account and token (for gated models)
  • A public IP

Step 1: Create the VM

FieldRecommended
ImageUbuntu with CUDA and PyTorch preinstalled
Instance typeG-NHHS-80 (H100 80GB SXM × 1) or higher
Block storage500 GiB or more (for model storage)
Public IPCreate new

Step 2: Install vLLM

pip install vllm
When using the PyTorch-preinstalled image

CUDA, cuDNN, and PyTorch are already installed, so you only need to add vLLM.


Step 3: Run the model

# Example: Mistral 7B
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--port 8000 \
--host 0.0.0.0

Gated Hugging Face models (e.g. Llama) require a token:

export HUGGING_FACE_HUB_TOKEN="hf_your_token"
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--port 8000 \
--host 0.0.0.0

Step 4: Configure the firewall

Firewall rules are configured per virtual network. Open the detail page of the virtual network the VM belongs to and allow inbound TCP 8000.

  1. Pick the target virtual network under Network > Virtual Networks.

  2. In Firewall rules on the detail page, click Add rule (or Add first rule).

  3. Add a rule with these values:

    FieldValue
    ProtocolTCP
    Source<admin-IP>/32 (or 0.0.0.0/0)
    DestinationThe VM IP or subnet CIDR
    Start port8000
    ActionACCEPT

Changes take effect within one minute.


Step 5: Query the API

# List models
curl http://<PUBLIC_IP>:8000/v1/models

# Generate text
curl http://<PUBLIC_IP>:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "What is ECI?",
"max_tokens": 200
}'

Python client:

from openai import OpenAI

client = OpenAI(
base_url="http://<PUBLIC_IP>:8000/v1",
api_key="not-needed"
)

response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Next steps