Skip to main content

Metrics Explorer

Overview

The Metrics Explorer is a tool for comparing GPU, CPU, memory, network, and disk metrics in time-series charts. Use it to monitor training, diagnose bottlenecks, detect idle resources, and track anomalous patterns. You can build multiple queries per chart, share the view by URL, or export to CSV.

Prerequisites
  • Metric.Metric.READ permission

How to open it

Entry pointPurpose
Monitoring > MetricsCompare multiple VMs and metrics on one screen (the full explorer)
Compute > Virtual Machines > pick a VM > Metrics tabPreconfigured charts for a single VM
Compute > Dashboard > click a VMJumps to that VM's Metrics tab

Key metrics

MetricDescriptionUse
GPU utilization (%)GPU SM core utilizationGPU usage during training
GPU memory usageGPU VRAM consumptionCatch OOM risk early
GPU memory / SM clockGPU operating clocksDiagnose throttling
GPU power / temperaturePower draw and temperatureWhether the hardware is at its limits
CPU utilization (%)CPU core utilizationSpot data-loader bottlenecks
Memory utilization (%)System RAM utilizationDetect memory pressure
Network I/OInbound / outbound trafficConfirm data transfer rates
Block storage usageDisk utilizationWatch for full disks

Building queries

Each chart can hold multiple queries, letting you overlay several metrics on the same chart for comparison.

1. Add a query

Click Add query in the top right of the chart to add a new query row.

Each query has these fields:

FieldDescription
Resource typeVirtual Machine / Virtual Cluster, etc.
VMThe target VM (search and pick)
MetricGPU utilization, GPU memory, etc.
Split byAll (aggregate) or Per item (e.g. plot GPU 0 and GPU 1 separately)

2. Unit consistency

A single chart can only hold metrics with the same unit (you can't mix a % metric with a MB/s metric, for example). For different units, add another chart.

3. Duplicate / delete queries

The Duplicate / Delete buttons on each query row make quick variations easy (e.g. apply the same metric to a different VM).


Managing charts

ActionDescription
Add chartUp to 8 charts per page (MAX_CHARTS)
Duplicate chartCopy an existing chart's queries and settings
Move chartReorder with the up/down buttons
Delete chartRemove the chart
Expand / collapseTemporarily shrink the chart area

Each chart accepts up to 12 queries (MAX_QUERIES).


Chart settings

Use each chart's Settings panel to adjust visualization.

  • Chart type: Line / Area / Bar / Table
  • Chart title: Auto / custom / hidden
  • Legend position: Bottom / right
  • Y-axis range: Auto, or specify min and max
  • Y-axis order: Swap left/right axes
  • Hover card: Detail on mouse-over

Grouping charts to the same time range makes visual comparison easier.


Time range and zoom

Use the top global toolbar to change the time range and aggregation interval.

RangeGood for
Last 1 hourMonitoring a training run in progress
Last 6 hoursShort training-flow review
Last 24 hoursReviewing overnight runs
Last 7d / 30dLong-term trend analysis

Drag on the chart to zoom in. The Revert range button steps back through your zoom history.


URL sharing

The Share button in the top toolbar copies a URL that encodes the current chart layout and time range. Send it to a teammate or attach it to an incident report to reproduce the same view exactly.

For incident analysis

When you spot an anomaly: zoom → copy share link → drop in Channel Talk / Slack. The recipient opens the same view in one click.


CSV download

The Download CSV button in the top toolbar exports the current chart's time series. From there you can dig in with Excel, Python pandas, or other tools.

CSV columns: timestamp + the value of each query.


Saved metrics

You can save frequently used chart layouts under Saved metrics for quick recall.

Stored in the browser

Saved metrics are stored only in your current browser and are not available on other devices or browsers. For layouts you want to share with your team, use URL sharing.


Create an alert from a chart

When you want a threshold-based alert on something like GPU utilization, click Create alert on the chart. You're sent to the alert creation page with the current query and range pre-filled.

Alert rules that already apply to the chart are drawn as threshold lines; the View alert rule → link jumps to the detail page.


Tips

When GPU utilization is low

  • Data loader bottleneck → raise num_workers, enable prefetching
  • Batch size too small → increase the batch
  • CPU-bound work → compare with the CPU chart, profile

When GPU memory is near 100%

  • Reduce batch size, enable mixed precision (fp16/bf16)
  • Enable gradient checkpointing
  • Move to a larger GPU instance type

When system memory is near 100%

  • Reduce data caching
  • Try pin_memory=False in the DataLoader
  • Move to a larger-memory instance

Next steps

  • Alerts: notifications when metrics cross a threshold
  • Compute dashboard: aggregated view across all VMs
  • Audit log: check whether a metric anomaly was caused by an intended change