Metrics Explorer

Overview

The Metrics Explorer is a tool for comparing GPU, CPU, memory, network, and disk metrics in time-series charts. Use it to monitor training, diagnose bottlenecks, detect idle resources, and track anomalous patterns. You can build multiple queries per chart, share the view by URL, or export to CSV.

Prerequisites

Metric.Metric.READ permission

How to open it

Entry point	Purpose
Monitoring > Metrics	Compare multiple VMs and metrics on one screen (the full explorer)
Compute > Virtual Machines > pick a VM > Metrics tab	Preconfigured charts for a single VM
Compute > Dashboard > click a VM	Jumps to that VM's Metrics tab

Key metrics

Metric	Description	Use
GPU utilization (%)	GPU SM core utilization	GPU usage during training
GPU memory usage	GPU VRAM consumption	Catch OOM risk early
GPU memory / SM clock	GPU operating clocks	Diagnose throttling
GPU power / temperature	Power draw and temperature	Whether the hardware is at its limits
CPU utilization (%)	CPU core utilization	Spot data-loader bottlenecks
Memory utilization (%)	System RAM utilization	Detect memory pressure
Network I/O	Inbound / outbound traffic	Confirm data transfer rates
Block storage usage	Disk utilization	Watch for full disks

Building queries

Each chart can hold multiple queries, letting you overlay several metrics on the same chart for comparison.

1. Add a query

Click Add query in the top right of the chart to add a new query row.

Each query has these fields:

Field	Description
Resource type	Virtual Machine / Virtual Cluster, etc.
VM	The target VM (search and pick)
Metric	GPU utilization, GPU memory, etc.
Split by	`All (aggregate)` or `Per item` (e.g. plot GPU 0 and GPU 1 separately)

2. Unit consistency

A single chart can only hold metrics with the same unit (you can't mix a % metric with a MB/s metric, for example). For different units, add another chart.

3. Duplicate / delete queries

The Duplicate / Delete buttons on each query row make quick variations easy (e.g. apply the same metric to a different VM).

Managing charts

Action	Description
Add chart	Up to 8 charts per page (`MAX_CHARTS`)
Duplicate chart	Copy an existing chart's queries and settings
Move chart	Reorder with the up/down buttons
Delete chart	Remove the chart
Expand / collapse	Temporarily shrink the chart area

Each chart accepts up to 12 queries (MAX_QUERIES).

Chart settings

Use each chart's Settings panel to adjust visualization.

Chart type: Line / Area / Bar / Table
Chart title: Auto / custom / hidden
Legend position: Bottom / right
Y-axis range: Auto, or specify min and max
Y-axis order: Swap left/right axes
Hover card: Detail on mouse-over

Grouping charts to the same time range makes visual comparison easier.

Time range and zoom

Use the top global toolbar to change the time range and aggregation interval.

Range	Good for
Last 1 hour	Monitoring a training run in progress
Last 6 hours	Short training-flow review
Last 24 hours	Reviewing overnight runs
Last 7d / 30d	Long-term trend analysis

Drag on the chart to zoom in. The Revert range button steps back through your zoom history.

The Share button in the top toolbar copies a URL that encodes the current chart layout and time range. Send it to a teammate or attach it to an incident report to reproduce the same view exactly.

For incident analysis

When you spot an anomaly: zoom → copy share link → drop in Channel Talk / Slack. The recipient opens the same view in one click.

CSV download

The Download CSV button in the top toolbar exports the current chart's time series. From there you can dig in with Excel, Python pandas, or other tools.

CSV columns: timestamp + the value of each query.

Saved metrics

You can save frequently used chart layouts under Saved metrics for quick recall.

Stored in the browser

Saved metrics are stored only in your current browser and are not available on other devices or browsers. For layouts you want to share with your team, use URL sharing.

Create an alert from a chart

When you want a threshold-based alert on something like GPU utilization, click Create alert on the chart. You're sent to the alert creation page with the current query and range pre-filled.

Alert rules that already apply to the chart are drawn as threshold lines; the View alert rule → link jumps to the detail page.

Tips

When GPU utilization is low

Data loader bottleneck → raise num_workers, enable prefetching
Batch size too small → increase the batch
CPU-bound work → compare with the CPU chart, profile

When GPU memory is near 100%

Reduce batch size, enable mixed precision (fp16/bf16)
Enable gradient checkpointing
Move to a larger GPU instance type

When system memory is near 100%

Reduce data caching
Try pin_memory=False in the DataLoader
Move to a larger-memory instance

Next steps

Alerts: notifications when metrics cross a threshold
Compute dashboard: aggregated view across all VMs
Audit log: check whether a metric anomaly was caused by an intended change

Overview​

How to open it​

Key metrics​

Building queries​

1. Add a query​

2. Unit consistency​

3. Duplicate / delete queries​

Managing charts​

Chart settings​

Time range and zoom​

URL sharing​

CSV download​

Saved metrics​

Create an alert from a chart​

Tips​

Next steps​