Alerts
Overview
Alerts automatically send a notification when a metric — GPU utilization, memory, disk, etc. — crosses a threshold you've set. They are used to detect training completion, catch problems early, and prevent idle-resource cost leaks.
An alert is made up of three components:
┌─────────────┐ fires ┌────────────────┐ runs ┌─────────┐
│ Alert rule │ ───────▶ │ Alert event │ ──────▶ │ Action │
└─────────────┘ └────────────────┘ └─────────┘
Condition History entry Send email
| Component | Role | Page |
|---|---|---|
| Alert rule | Defines the metric threshold and evaluation | Monitoring > Alerts > Alert Rules |
| Action | Defines who is notified and how when the rule fires | Monitoring > Alerts > Actions |
| Alert event | The history of when the rule fired | Monitoring > Alerts > Events |
Alert.AlertRule.CREATE,Alert.AlertActionTemplate.CREATEpermissions- Create the action first, before creating an alert rule
Evaluation state and rule state
An alert rule has two kinds of state.
| Evaluation state | Meaning |
|---|---|
| Ok | The metric is within the threshold |
| Alert | Threshold exceeded; an alert has fired |
| NoData | Not enough data to evaluate |
| Rule state | Meaning |
|---|---|
| Activated | Periodically evaluated; fires when conditions are met |
| Paused | Evaluation is suspended; no alerts will fire (settings are kept) |
Step 1: Create an action
First, create the action (currently email) that fires when an alert triggers.
-
Go to Monitoring > Alerts > Actions > + Create Action.
-
Fill in:
Field Description Action name Identifier (e.g. GPU failure alert)Action type Email(webhooks coming soon)Recipient email Add users via search, or enter comma-separated addresses (at least one) -
Click Create.
Integrations with Slack webhooks and other external services are planned.
Step 2: Create an alert rule
- Go to Monitoring > Alerts > Alert Rules > + Create Alert Rule.
- Fill in the following:
Basic info
| Field | Example |
|---|---|
| Name | GPU utilization warning |
| Description | Alert when GPU utilization exceeds 90% |
Target resource
| Field | Description |
|---|---|
| Resource type | Virtual Machine or Object Storage |
| Resource | The VM or bucket to monitor |
| Metric | Depends on the resource type (see below) |
VM metrics
| Category | Metric |
|---|---|
| GPU | Utilization (%), memory used/total (MiB), temperature (°C), power (W), clock speed (MHz) |
| CPU | User · System · Idle · I/O Wait · IRQ · Soft IRQ · Steal · Guest · Nice (%) |
| Memory | Total · Available · Active · Inactive · Cached · Buffers · Free (KiB) · Swap |
| Network | RX/TX bandwidth (Bps) · bytes · packets · drops · errors |
| Storage | Usage (Bytes) · read/write throughput (Bps) · IOPS · latency (sec) |
Object storage metrics
| Metric | Description |
|---|---|
| Usage | Bucket size in bytes |
| Object count | Number of objects in the bucket |
| GET / PUT / COPY / LIST | API request counts |
| Multipart uploads | Number of multipart upload operations |
Evaluation conditions
| Field | Description |
|---|---|
| Aggregation | avg / sum / min / max (AlertAggEnum) |
| Aggregation interval | 1m / 5m / 15m / 30m / 1h |
| Operator | > / >= / < / <= |
| Threshold | The value that triggers the alert (e.g. 5, 90, 0.85) |
| Datapoints to evaluate | How many recent datapoints to inspect (e.g. 5): datapoints_to_evaluate |
| Datapoints to alert | How many of those must breach to fire (e.g. 3): datapoints_to_alert |
Example: interval 5m, evaluate 5, alert 3 → measure every 5 minutes over the last 25 minutes; if 3 or more breach the threshold, fire an alert.
The system aggregates data over each interval into one value before evaluating. An in-progress interval has incomplete data and is excluded from evaluation.
For example, with a 5-minute interval and a 80% GPU utilization threshold, suppose the 10:15–10:20 interval just started and GPU briefly spiked to 95%. The breach is not counted until the 5-minute window closes and the actual average is computed. If that average turns out to be 72%, the interval is classified as Ok.
Preview diagram
The preview diagram on the right of the form shows how your conditions will be evaluated and updates live as you change them.
- Green dots: datapoints within the threshold
- Yellow dots: datapoints that breached the threshold
- Red dots: points where the alert actually fired (datapoints-to-alert reached)
- Dashed line: the threshold
Link the action
Pick the action you just created (you can select more than one).
- Click Create.
Use-case walkthroughs
GPU overload alert
- In Monitoring > Alerts > Actions, create an email action with recipients.
- Click Alert Rules > + Create Alert Rule.
- Pick the VM to monitor as the target resource.
- Set the metric to GPU utilization, operator
>, threshold90. - Set the evaluation: interval 5 minutes, evaluate 5, alert 3.
- Check the diagram on the right to confirm the rule behaves as intended.
- Select the email action and click Create.
Object storage capacity alert
- If you don't have one already, create an email action first.
- In Create Alert Rule, set the resource type to Object Storage.
- Pick the bucket to monitor.
- Set the metric to Usage, operator
>=, threshold to your target capacity (Bytes). - Set evaluation and action, then click Create.
Investigating an alert
- In Monitoring > Alerts > Alert Rules, click a rule whose evaluation state is Alert.
- Check recent state transitions on the Events tab.
- Clicking an event jumps to the Chart tab, showing the metric around the firing time.
- Once you've identified the cause, take the appropriate action.
Common alert rules
| Goal | Metric | Agg / interval | Operator | Threshold | Evaluate / Alert |
|---|---|---|---|---|---|
| Detect training completion | GPU utilization | avg / 5m | < | 5 | 2 / 2 |
| GPU OOM risk | GPU memory usage | max / 1m | > | 90 | 3 / 3 |
| Detect VM abnormal shutdown | CPU utilization | avg / 5m | < | 1 | 1 / 1 |
| Disk-full risk | Disk utilization | max / 5m | > | 85 | 1 / 1 |
| Idle GPU (cost leak) | GPU utilization | avg / 1h | < | 30 | 1 / 1 |
Checking alert events
When an alert rule fires, an entry is automatically recorded under Monitoring > Alerts > Events.
| Column | Description |
|---|---|
| Fired at | When the alert was triggered |
| Rule | The alert rule that fired |
| State transition | Ok → Alert / Alert → Ok, etc. |
| Threshold / breach count | Evaluation breach details |
Click a row to open the detail page with the metric chart.
Managing alert rules
In Monitoring > Alerts > Alert Rules:
- Activate / Pause: toggle to suspend evaluation (the rule itself is kept)
- Edit: change thresholds, conditions, or actions
- Delete: remove the rule permanently
A paused rule isn't evaluated and produces no alert events. Pausing rules tied to VMs that are under maintenance is a good way to avoid false alerts.
Rule detail page tabs
Click a rule in the list to open the detail page, which has four tabs:
| Tab | Contents |
|---|---|
| Overview | Evaluation state, last evaluation time, a natural-language summary of the condition, target resource (click to jump to the resource page) |
| Chart | Metric chart with the threshold as a dashed line and markers where alerts fired. Range: 30 minutes to 48 hours |
| Actions | Actions linked to this rule. Click to jump to the action's detail page |
| Events | State-transition history (Ok→Alert, Alert→Ok). Clicking an event jumps to the chart at that time |
FAQ
Evaluation state shows "NoData"
- Confirm the target VM is in the Running state
- Confirm the rule is Activated (paused rules are not evaluated)
- Right after creating a rule, NoData can appear briefly until enough data is collected
I'm not receiving alert emails
- Confirm the rule has an action attached
- Confirm the action's recipient email is correct
- Check your spam / junk folder
What's the difference between "datapoints to evaluate" and "datapoints to alert"?
- Datapoints to evaluate: how many recent datapoints to look at (the window size)
- Datapoints to alert: how many of those must breach the threshold to fire (the trigger condition)
- Setting them equal means every point must breach (strict). Setting "alert" to 1 means a single breach is enough (sensitive)
Can I delete an action?
Deleting an action that other rules are using will silence those rules. Check the Rules using this action section on the action detail page first.
Next steps
- Metrics Explorer: inspect metric trends before picking a threshold
- Compute dashboard: see every VM's state at a glance
- Audit log: track changes to alert rules