Metrics Explorer
Overview
The Metrics Explorer is a tool for comparing GPU, CPU, memory, network, and disk metrics in time-series charts. Use it to monitor training, diagnose bottlenecks, detect idle resources, and track anomalous patterns. You can build multiple queries per chart, share the view by URL, or export to CSV.
Metric.Metric.READpermission
How to open it
| Entry point | Purpose |
|---|---|
| Monitoring > Metrics | Compare multiple VMs and metrics on one screen (the full explorer) |
| Compute > Virtual Machines > pick a VM > Metrics tab | Preconfigured charts for a single VM |
| Compute > Dashboard > click a VM | Jumps to that VM's Metrics tab |
Key metrics
| Metric | Description | Use |
|---|---|---|
| GPU utilization (%) | GPU SM core utilization | GPU usage during training |
| GPU memory usage | GPU VRAM consumption | Catch OOM risk early |
| GPU memory / SM clock | GPU operating clocks | Diagnose throttling |
| GPU power / temperature | Power draw and temperature | Whether the hardware is at its limits |
| CPU utilization (%) | CPU core utilization | Spot data-loader bottlenecks |
| Memory utilization (%) | System RAM utilization | Detect memory pressure |
| Network I/O | Inbound / outbound traffic | Confirm data transfer rates |
| Block storage usage | Disk utilization | Watch for full disks |
Building queries
Each chart can hold multiple queries, letting you overlay several metrics on the same chart for comparison.
1. Add a query
Click Add query in the top right of the chart to add a new query row.
Each query has these fields:
| Field | Description |
|---|---|
| Resource type | Virtual Machine / Virtual Cluster, etc. |
| VM | The target VM (search and pick) |
| Metric | GPU utilization, GPU memory, etc. |
| Split by | All (aggregate) or Per item (e.g. plot GPU 0 and GPU 1 separately) |
2. Unit consistency
A single chart can only hold metrics with the same unit (you can't mix a % metric with a MB/s metric, for example). For different units, add another chart.
3. Duplicate / delete queries
The Duplicate / Delete buttons on each query row make quick variations easy (e.g. apply the same metric to a different VM).
Managing charts
| Action | Description |
|---|---|
| Add chart | Up to 8 charts per page (MAX_CHARTS) |
| Duplicate chart | Copy an existing chart's queries and settings |
| Move chart | Reorder with the up/down buttons |
| Delete chart | Remove the chart |
| Expand / collapse | Temporarily shrink the chart area |
Each chart accepts up to 12 queries (MAX_QUERIES).
Chart settings
Use each chart's Settings panel to adjust visualization.
- Chart type: Line / Area / Bar / Table
- Chart title: Auto / custom / hidden
- Legend position: Bottom / right
- Y-axis range: Auto, or specify min and max
- Y-axis order: Swap left/right axes
- Hover card: Detail on mouse-over
Grouping charts to the same time range makes visual comparison easier.
Time range and zoom
Use the top global toolbar to change the time range and aggregation interval.
| Range | Good for |
|---|---|
| Last 1 hour | Monitoring a training run in progress |
| Last 6 hours | Short training-flow review |
| Last 24 hours | Reviewing overnight runs |
| Last 7d / 30d | Long-term trend analysis |
Drag on the chart to zoom in. The Revert range button steps back through your zoom history.
URL sharing
The Share button in the top toolbar copies a URL that encodes the current chart layout and time range. Send it to a teammate or attach it to an incident report to reproduce the same view exactly.
When you spot an anomaly: zoom → copy share link → drop in Channel Talk / Slack. The recipient opens the same view in one click.
CSV download
The Download CSV button in the top toolbar exports the current chart's time series. From there you can dig in with Excel, Python pandas, or other tools.
CSV columns: timestamp + the value of each query.
Saved metrics
You can save frequently used chart layouts under Saved metrics for quick recall.
Saved metrics are stored only in your current browser and are not available on other devices or browsers. For layouts you want to share with your team, use URL sharing.
Create an alert from a chart
When you want a threshold-based alert on something like GPU utilization, click Create alert on the chart. You're sent to the alert creation page with the current query and range pre-filled.
Alert rules that already apply to the chart are drawn as threshold lines; the View alert rule → link jumps to the detail page.
Tips
When GPU utilization is low
- Data loader bottleneck → raise
num_workers, enable prefetching - Batch size too small → increase the batch
- CPU-bound work → compare with the CPU chart, profile
When GPU memory is near 100%
- Reduce batch size, enable mixed precision (fp16/bf16)
- Enable gradient checkpointing
- Move to a larger GPU instance type
When system memory is near 100%
- Reduce data caching
- Try
pin_memory=Falsein the DataLoader - Move to a larger-memory instance
Next steps
- Alerts: notifications when metrics cross a threshold
- Compute dashboard: aggregated view across all VMs
- Audit log: check whether a metric anomaly was caused by an intended change