← PlexusGuide · GPU on-call

50 alerts a week. Two of them matter.

On-call for a GPU fleet is a special kind of drowning: every layer — DCGM, ECC, thermal, power, NVLink, the BMC, the scheduler — alerts on its own, and almost none of it needs you. Here’s why the firehose exists, which handful of alerts actually mean a node is dying, and how to cut the rest without missing the one that matters.

The math of a GPU fleet

A GPU node is not one thing that alerts. It’s a stack of them. The GPU emits Xid errors and ECC counts through DCGM. The baseboard management controller reports thermal and power. NVLink and the fabric have their own faults. The scheduler complains when a job dies. Multiply by hundreds or thousands of nodes, across Dell/Supermicro/Pegatron hardware that each behaves a little differently, and the event volume is enormous.

And it’s not paranoia — the hardware really does fail at scale. Meta’s published Llama 3 training run saw an unexpected interruption roughly every few hours across 16,384 H100s, the large majority of them hardware. So the fleet genuinely needs watching. The problem isn’t too little signal. It’s that the signal arrives buried in a firehose, and the tools forward all of it.

A typical infrastructure on-call sees on the order of 50 alerts a week — and only a few percent are actually actionable. The other ~95% is what trains people to mute the channel.

Why the noise exists

Most of the firehose is structural, not a tuning mistake. A huge share of GPU Xids — preemptive channel teardown on every normal process exit (Xid 45), application memory faults (Xid 31), graphics exceptions (Xid 13) — are the workload, not the hardware. They fire constantly and mean nothing for fleet health. Meanwhile the monitoring tool, unable to tell which is which, forwards all of it to stay safe. That’s the deeper cause: monitoring offloads the judgment onto you, so it over-alerts. We wrote a whole manifesto on this.

The two that actually matter

Out of a noisy hour, the alerts that genuinely need a person are few and specific:

  • Uncorrectable memory + bus faults. Double-bit ECC (Xid 48), GPU fallen off the bus (Xid 79), contained/uncontained ECC (Xid 94/95), NVLink errors (Xid 74). Drain, and RMA on recurrence.
  • A thermal or power cascade. One root event that rolls up dozens of correlated downstream alerts across a rack — the kind that looks like 48 problems and is really one.
  • A rising correctable-error trend. Single-bit ECC rate or row-remap count climbing — the quiet signal that predicts an uncorrectable failure hours ahead, and the easiest to miss in the noise.

Two or three of these a week, buried in ~50 alerts. The whole job is finding them without drowning — and without muting the channel so hard you miss the Xid 79 at 3am.

Why “just tune the thresholds” doesn’t work

The usual advice — raise thresholds, add inhibition rules, group and silence in Alertmanager — helps for a week. Then an incident slips through a threshold you raised, it scares you, and you turn the sensitivity back up. That’s the tuning treadmill: every team runs it, nobody gets off it, because a static rule can’t tell a flapping sensor from the first sign of a dying HBM stack. The judgment is the hard part, and a threshold doesn’t have any.

The fix: let the system decide

The durable answer is to move the judgment into the system — without moving it out of view. Resolve the flapping, transient, and app-level noise. Trend the correctable signals instead of paging on each one. Correlate the cascade down to its root. And surface only the two or three that are real — each with its cause and a recommended next step. The default isn’t a notification; the system does the triage and shows its work. That’s what Plexus does, on the DCGM/Prometheus/Thanos/ClickHouse you already run — no migration.

The bar that makes that safe: when it does page you, the signal is trustworthy enough that you act without re-checking — and when it doesn’t, you can open any held alert and see exactly why. A quiet you can’t verify is just a missed incident waiting to happen, so every call is logged and reversible.

Questions

How many alerts does a GPU fleet actually generate?

More than you’d expect, because every layer alerts independently — DCGM/GPU (Xid, ECC, throttling), the BMC, thermal, power, NVLink, the network, the scheduler. Surveys of infrastructure on-call put a typical engineer around ~50 alerts a week, with only a few percent actually actionable. On a large fleet the raw event count is far higher — most of it absorbed before it ever should reach a person.

Which GPU alerts actually need a human?

The uncorrectable and bus-level faults — double-bit ECC (Xid 48), a GPU falling off the bus (Xid 79), contained/uncontained ECC (Xid 94/95), NVLink errors (Xid 74) — plus a thermal cascade or a rising correctable-error trend that predicts failure hours out. Everything in the 13/31/43/45 family is usually the workload, not the hardware. See the Xid reference.

How do you reduce GPU alert fatigue without missing a real failure?

Not by raising thresholds — that’s the tuning treadmill you reverse after the next scare. The durable answer is to let the system do the triage: resolve the flapping, transient, and app-level noise, trend the correctable signals, and surface only the few that are real, each root-caused — while showing its work, so you can see why anything was held. That’s transparent triage — and the bar is that when it does page you, you believe it.