Skip to main content

Alerts

Overview

Alerts automatically send a notification when a metric — GPU utilization, memory, disk, etc. — crosses a threshold you've set. They are used to detect training completion, catch problems early, and prevent idle-resource cost leaks.

An alert is made up of three components:

┌─────────────┐ fires ┌────────────────┐ runs ┌─────────┐
│ Alert rule │ ───────▶ │ Alert event │ ──────▶ │ Action │
└─────────────┘ └────────────────┘ └─────────┘
Condition History entry Send email
ComponentRolePage
Alert ruleDefines the metric threshold and evaluationMonitoring > Alerts > Alert Rules
ActionDefines who is notified and how when the rule firesMonitoring > Alerts > Actions
Alert eventThe history of when the rule firedMonitoring > Alerts > Events
Prerequisites
  • Alert.AlertRule.CREATE, Alert.AlertActionTemplate.CREATE permissions
  • Create the action first, before creating an alert rule

Evaluation state and rule state

An alert rule has two kinds of state.

Evaluation stateMeaning
OkThe metric is within the threshold
AlertThreshold exceeded; an alert has fired
NoDataNot enough data to evaluate
Rule stateMeaning
ActivatedPeriodically evaluated; fires when conditions are met
PausedEvaluation is suspended; no alerts will fire (settings are kept)

Step 1: Create an action

First, create the action (currently email) that fires when an alert triggers.

  1. Go to Monitoring > Alerts > Actions > + Create Action.

  2. Fill in:

    FieldDescription
    Action nameIdentifier (e.g. GPU failure alert)
    Action typeEmail (webhooks coming soon)
    Recipient emailAdd users via search, or enter comma-separated addresses (at least one)
  3. Click Create.

Only email actions are supported today

Integrations with Slack webhooks and other external services are planned.


Step 2: Create an alert rule

  1. Go to Monitoring > Alerts > Alert Rules > + Create Alert Rule.
  2. Fill in the following:

Basic info

FieldExample
NameGPU utilization warning
DescriptionAlert when GPU utilization exceeds 90%

Target resource

FieldDescription
Resource typeVirtual Machine or Object Storage
ResourceThe VM or bucket to monitor
MetricDepends on the resource type (see below)

VM metrics

CategoryMetric
GPUUtilization (%), memory used/total (MiB), temperature (°C), power (W), clock speed (MHz)
CPUUser · System · Idle · I/O Wait · IRQ · Soft IRQ · Steal · Guest · Nice (%)
MemoryTotal · Available · Active · Inactive · Cached · Buffers · Free (KiB) · Swap
NetworkRX/TX bandwidth (Bps) · bytes · packets · drops · errors
StorageUsage (Bytes) · read/write throughput (Bps) · IOPS · latency (sec)

Object storage metrics

MetricDescription
UsageBucket size in bytes
Object countNumber of objects in the bucket
GET / PUT / COPY / LISTAPI request counts
Multipart uploadsNumber of multipart upload operations

Evaluation conditions

FieldDescription
Aggregationavg / sum / min / max (AlertAggEnum)
Aggregation interval1m / 5m / 15m / 30m / 1h
Operator> / >= / < / <=
ThresholdThe value that triggers the alert (e.g. 5, 90, 0.85)
Datapoints to evaluateHow many recent datapoints to inspect (e.g. 5): datapoints_to_evaluate
Datapoints to alertHow many of those must breach to fire (e.g. 3): datapoints_to_alert

Example: interval 5m, evaluate 5, alert 3 → measure every 5 minutes over the last 25 minutes; if 3 or more breach the threshold, fire an alert.

Only completed intervals are evaluated

The system aggregates data over each interval into one value before evaluating. An in-progress interval has incomplete data and is excluded from evaluation.

For example, with a 5-minute interval and a 80% GPU utilization threshold, suppose the 10:15–10:20 interval just started and GPU briefly spiked to 95%. The breach is not counted until the 5-minute window closes and the actual average is computed. If that average turns out to be 72%, the interval is classified as Ok.

Preview diagram

The preview diagram on the right of the form shows how your conditions will be evaluated and updates live as you change them.

  • Green dots: datapoints within the threshold
  • Yellow dots: datapoints that breached the threshold
  • Red dots: points where the alert actually fired (datapoints-to-alert reached)
  • Dashed line: the threshold

Pick the action you just created (you can select more than one).

  1. Click Create.

Use-case walkthroughs

GPU overload alert

  1. In Monitoring > Alerts > Actions, create an email action with recipients.
  2. Click Alert Rules > + Create Alert Rule.
  3. Pick the VM to monitor as the target resource.
  4. Set the metric to GPU utilization, operator >, threshold 90.
  5. Set the evaluation: interval 5 minutes, evaluate 5, alert 3.
  6. Check the diagram on the right to confirm the rule behaves as intended.
  7. Select the email action and click Create.

Object storage capacity alert

  1. If you don't have one already, create an email action first.
  2. In Create Alert Rule, set the resource type to Object Storage.
  3. Pick the bucket to monitor.
  4. Set the metric to Usage, operator >=, threshold to your target capacity (Bytes).
  5. Set evaluation and action, then click Create.

Investigating an alert

  1. In Monitoring > Alerts > Alert Rules, click a rule whose evaluation state is Alert.
  2. Check recent state transitions on the Events tab.
  3. Clicking an event jumps to the Chart tab, showing the metric around the firing time.
  4. Once you've identified the cause, take the appropriate action.

Common alert rules

GoalMetricAgg / intervalOperatorThresholdEvaluate / Alert
Detect training completionGPU utilizationavg / 5m<52 / 2
GPU OOM riskGPU memory usagemax / 1m>903 / 3
Detect VM abnormal shutdownCPU utilizationavg / 5m<11 / 1
Disk-full riskDisk utilizationmax / 5m>851 / 1
Idle GPU (cost leak)GPU utilizationavg / 1h<301 / 1

Checking alert events

When an alert rule fires, an entry is automatically recorded under Monitoring > Alerts > Events.

ColumnDescription
Fired atWhen the alert was triggered
RuleThe alert rule that fired
State transitionOk → Alert / Alert → Ok, etc.
Threshold / breach countEvaluation breach details

Click a row to open the detail page with the metric chart.


Managing alert rules

In Monitoring > Alerts > Alert Rules:

  • Activate / Pause: toggle to suspend evaluation (the rule itself is kept)
  • Edit: change thresholds, conditions, or actions
  • Delete: remove the rule permanently

A paused rule isn't evaluated and produces no alert events. Pausing rules tied to VMs that are under maintenance is a good way to avoid false alerts.

Rule detail page tabs

Click a rule in the list to open the detail page, which has four tabs:

TabContents
OverviewEvaluation state, last evaluation time, a natural-language summary of the condition, target resource (click to jump to the resource page)
ChartMetric chart with the threshold as a dashed line and markers where alerts fired. Range: 30 minutes to 48 hours
ActionsActions linked to this rule. Click to jump to the action's detail page
EventsState-transition history (Ok→Alert, Alert→Ok). Clicking an event jumps to the chart at that time

FAQ

Evaluation state shows "NoData"

  • Confirm the target VM is in the Running state
  • Confirm the rule is Activated (paused rules are not evaluated)
  • Right after creating a rule, NoData can appear briefly until enough data is collected

I'm not receiving alert emails

  • Confirm the rule has an action attached
  • Confirm the action's recipient email is correct
  • Check your spam / junk folder

What's the difference between "datapoints to evaluate" and "datapoints to alert"?

  • Datapoints to evaluate: how many recent datapoints to look at (the window size)
  • Datapoints to alert: how many of those must breach the threshold to fire (the trigger condition)
  • Setting them equal means every point must breach (strict). Setting "alert" to 1 means a single breach is enough (sensitive)

Can I delete an action?

Deleting an action that other rules are using will silence those rules. Check the Rules using this action section on the action detail page first.


Next steps