Alerts

Overview

Alerts automatically send a notification when a metric — GPU utilization, memory, disk, etc. — crosses a threshold you've set. They are used to detect training completion, catch problems early, and prevent idle-resource cost leaks.

An alert is made up of three components:

┌─────────────┐  fires   ┌────────────────┐  runs   ┌─────────┐
│ Alert rule  │ ───────▶ │ Alert event    │ ──────▶ │ Action  │
└─────────────┘          └────────────────┘         └─────────┘
   Condition              History entry             Send email

Component	Role	Page
Alert rule	Defines the metric threshold and evaluation	Monitoring > Alerts > Alert Rules
Action	Defines who is notified and how when the rule fires	Monitoring > Alerts > Actions
Alert event	The history of when the rule fired	Monitoring > Alerts > Events

Prerequisites

Alert.AlertRule.CREATE, Alert.AlertActionTemplate.CREATE permissions
Create the action first, before creating an alert rule

Evaluation state and rule state

An alert rule has two kinds of state.

Evaluation state	Meaning
Ok	The metric is within the threshold
Alert	Threshold exceeded; an alert has fired
NoData	Not enough data to evaluate

Rule state	Meaning
Activated	Periodically evaluated; fires when conditions are met
Paused	Evaluation is suspended; no alerts will fire (settings are kept)

Step 1: Create an action

First, create the action (currently email) that fires when an alert triggers.

Go to Monitoring > Alerts > Actions > + Create Action.
Fill in:

Field Description
Action name Identifier (e.g. GPU failure alert)
Action type Email (webhooks coming soon)
Recipient email Add users via search, or enter comma-separated addresses (at least one)
Click Create.

Field	Description
Action name	Identifier (e.g. `GPU failure alert`)
Action type	`Email` (webhooks coming soon)
Recipient email	Add users via search, or enter comma-separated addresses (at least one)

Only email actions are supported today

Integrations with Slack webhooks and other external services are planned.

Step 2: Create an alert rule

Go to Monitoring > Alerts > Alert Rules > + Create Alert Rule.
Fill in the following:

Basic info

Field	Example
Name	`GPU utilization warning`
Description	`Alert when GPU utilization exceeds 90%`

Target resource

Field	Description
Resource type	`Virtual Machine` or `Object Storage`
Resource	The VM or bucket to monitor
Metric	Depends on the resource type (see below)

VM metrics

Category	Metric
GPU	Utilization (%), memory used/total (MiB), temperature (°C), power (W), clock speed (MHz)
CPU	User · System · Idle · I/O Wait · IRQ · Soft IRQ · Steal · Guest · Nice (%)
Memory	Total · Available · Active · Inactive · Cached · Buffers · Free (KiB) · Swap
Network	RX/TX bandwidth (Bps) · bytes · packets · drops · errors
Storage	Usage (Bytes) · read/write throughput (Bps) · IOPS · latency (sec)

Object storage metrics

Metric	Description
Usage	Bucket size in bytes
Object count	Number of objects in the bucket
GET / PUT / COPY / LIST	API request counts
Multipart uploads	Number of multipart upload operations

Evaluation conditions

Field	Description
Aggregation	`avg` / `sum` / `min` / `max` (`AlertAggEnum`)
Aggregation interval	`1m` / `5m` / `15m` / `30m` / `1h`
Operator	`>` / `>=` / `<` / `<=`
Threshold	The value that triggers the alert (e.g. `5`, `90`, `0.85`)
Datapoints to evaluate	How many recent datapoints to inspect (e.g. `5`): `datapoints_to_evaluate`
Datapoints to alert	How many of those must breach to fire (e.g. `3`): `datapoints_to_alert`

Example: interval 5m, evaluate 5, alert 3 → measure every 5 minutes over the last 25 minutes; if 3 or more breach the threshold, fire an alert.

Only completed intervals are evaluated

The system aggregates data over each interval into one value before evaluating. An in-progress interval has incomplete data and is excluded from evaluation.

For example, with a 5-minute interval and a 80% GPU utilization threshold, suppose the 10:15–10:20 interval just started and GPU briefly spiked to 95%. The breach is not counted until the 5-minute window closes and the actual average is computed. If that average turns out to be 72%, the interval is classified as Ok.

Preview diagram

The preview diagram on the right of the form shows how your conditions will be evaluated and updates live as you change them.

Green dots: datapoints within the threshold
Yellow dots: datapoints that breached the threshold
Red dots: points where the alert actually fired (datapoints-to-alert reached)
Dashed line: the threshold

Link the action

Pick the action you just created (you can select more than one).

Click Create.

Use-case walkthroughs

GPU overload alert

In Monitoring > Alerts > Actions, create an email action with recipients.
Click Alert Rules > + Create Alert Rule.
Pick the VM to monitor as the target resource.
Set the metric to GPU utilization, operator >, threshold 90.
Set the evaluation: interval 5 minutes, evaluate 5, alert 3.
Check the diagram on the right to confirm the rule behaves as intended.
Select the email action and click Create.

Object storage capacity alert

If you don't have one already, create an email action first.
In Create Alert Rule, set the resource type to Object Storage.
Pick the bucket to monitor.
Set the metric to Usage, operator >=, threshold to your target capacity (Bytes).
Set evaluation and action, then click Create.

Investigating an alert

In Monitoring > Alerts > Alert Rules, click a rule whose evaluation state is Alert.
Check recent state transitions on the Events tab.
Clicking an event jumps to the Chart tab, showing the metric around the firing time.
Once you've identified the cause, take the appropriate action.

Common alert rules

Goal	Metric	Agg / interval	Operator	Threshold	Evaluate / Alert
Detect training completion	GPU utilization	avg / 5m	`<`	5	2 / 2
GPU OOM risk	GPU memory usage	max / 1m	`>`	90	3 / 3
Detect VM abnormal shutdown	CPU utilization	avg / 5m	`<`	1	1 / 1
Disk-full risk	Disk utilization	max / 5m	`>`	85	1 / 1
Idle GPU (cost leak)	GPU utilization	avg / 1h	`<`	30	1 / 1

Checking alert events

When an alert rule fires, an entry is automatically recorded under Monitoring > Alerts > Events.

Column	Description
Fired at	When the alert was triggered
Rule	The alert rule that fired
State transition	`Ok → Alert` / `Alert → Ok`, etc.
Threshold / breach count	Evaluation breach details

Click a row to open the detail page with the metric chart.

Managing alert rules

In Monitoring > Alerts > Alert Rules:

Activate / Pause: toggle to suspend evaluation (the rule itself is kept)
Edit: change thresholds, conditions, or actions
Delete: remove the rule permanently

A paused rule isn't evaluated and produces no alert events. Pausing rules tied to VMs that are under maintenance is a good way to avoid false alerts.

Rule detail page tabs

Click a rule in the list to open the detail page, which has four tabs:

Tab	Contents
Overview	Evaluation state, last evaluation time, a natural-language summary of the condition, target resource (click to jump to the resource page)
Chart	Metric chart with the threshold as a dashed line and markers where alerts fired. Range: 30 minutes to 48 hours
Actions	Actions linked to this rule. Click to jump to the action's detail page
Events	State-transition history (Ok→Alert, Alert→Ok). Clicking an event jumps to the chart at that time

FAQ

Evaluation state shows "NoData"

Confirm the target VM is in the Running state
Confirm the rule is Activated (paused rules are not evaluated)
Right after creating a rule, NoData can appear briefly until enough data is collected

I'm not receiving alert emails

Confirm the rule has an action attached
Confirm the action's recipient email is correct
Check your spam / junk folder

What's the difference between "datapoints to evaluate" and "datapoints to alert"?

Datapoints to evaluate: how many recent datapoints to look at (the window size)
Datapoints to alert: how many of those must breach the threshold to fire (the trigger condition)
Setting them equal means every point must breach (strict). Setting "alert" to 1 means a single breach is enough (sensitive)

Can I delete an action?

Deleting an action that other rules are using will silence those rules. Check the Rules using this action section on the action detail page first.

Next steps

Metrics Explorer: inspect metric trends before picking a threshold
Compute dashboard: see every VM's state at a glance
Audit log: track changes to alert rules

Overview​

Evaluation state and rule state​

Step 1: Create an action​

Step 2: Create an alert rule​

Basic info​

Target resource​

Evaluation conditions​

Preview diagram​

Link the action​

Use-case walkthroughs​

GPU overload alert​

Object storage capacity alert​

Investigating an alert​

Common alert rules​

Checking alert events​

Managing alert rules​

Rule detail page tabs​

FAQ​

Next steps​