GPU Xid error monitoring — which ones mean a node is dying

Common Xid codes

A representative reference. Severity depends on your driver and GPU generation — always cross-check against NVIDIA’s official Xid table.

Xid	Meaning	Action
Xid 13	Graphics Engine Exception A graphics/compute engine exception — usually an application bug, occasionally failing hardware.	usually app error
Xid 31	GPU memory page fault An illegal memory access by the workload (MMU fault) — almost always an application or framework bug.	usually app error
Xid 43	GPU stopped processing A channel was aborted after a workload-side error — the GPU itself is usually fine.	usually app error
Xid 45	Preemptive channel teardown Channel torn down on a normal process exit or kill — high-volume and usually benign.	usually benign
Xid 48	Double Bit ECC Error (DBE) An uncorrectable double-bit memory error — drain the node, RMA on recurrence.	drain / RMA
Xid 63	ECC row-remap / page-retirement recorded A memory row was successfully remapped (or page retired) — not fatal alone, but trend it.	watch — early signal
Xid 64	ECC row-remap / page-retirement failure A remap or page-retirement attempt failed — treat as failing memory and drain.	drain / RMA
Xid 74	NVLink Error An NVLink / NVSwitch fault — often hardware, can cascade across a node; isolate and drain.	drain / RMA
Xid 79	GPU has fallen off the bus The GPU is no longer reachable on PCIe — a serious hardware fault; drain immediately.	drain / RMA
Xid 92	High single-bit ECC error rate Correctable single-bit errors above threshold — a rising rate predicts failure; trend it.	watch — early signal
Xid 94	Contained ECC error An uncorrectable error contained to one application — drain after the job; RMA on recurrence.	drain / RMA
Xid 95	Uncontained ECC error An uncorrectable error that escaped containment — may have corrupted other work; drain now.	drain / RMA
Xid 119 / 120	GSP RPC timeout / error The GPU System Processor (GSP firmware) timed out or errored — sometimes transient, sometimes needs a reset.	watch — early signal

The noise problem

The trap with Xid monitoring is alerting on the wrong axis. The high-frequency codes — 45 on every process teardown, 31 on every workload memory bug — bury the rare ones that mean a card is failing. Page on all of them and on-call learns to mute the channel. Page on none and you miss Xid 79.

The fix is to absorb the benign and trend the correctable, so a human only hears about the node that’s actually dying. That is exactly what Plexus does by default — it decides signal from noise, suppresses the flapping and app-level Xids, and surfaces the uncorrectable and bus-level ones with the correlated context and a recommended action attached.

Plexus vs Datadog Watchdog →Start free →

Questions

What is an Xid error?

An Xid is an error report from the NVIDIA driver, printed to the kernel log (dmesg) and exposed through NVML/DCGM. The Xid number identifies the class of problem — anything from an application bug to failing GPU memory to a card that has dropped off the PCIe bus. The number alone doesn’t tell you severity; context does.

Which Xid errors actually require draining a node?

The uncorrectable-memory and bus-level ones: 48 (double-bit ECC), 79 (fallen off the bus), 94 / 95 (contained / uncontained ECC), 64 (remap failure), and most 74 (NVLink) events. Watch — don’t necessarily drain — on 63, 92, 119/120, where a trend is the signal. The 13/31/43/45 family is usually the workload, not the hardware.

Why are most Xid alerts noise?

Because the high-frequency Xids (45, 31, 13, 43) are dominated by application behavior and normal process teardown, while the ones that matter (48, 79, 94/95) are comparatively rare and easy to bury. A naive rule that pages on “any Xid” trains on-call to ignore the channel — which is how a real Xid 79 gets missed at 3am.

How do you catch a failing GPU before it takes a job down?

Trend the correctable signals — single-bit ECC rate (Xid 92) and row remaps (Xid 63) — rather than alerting on each one. A rising rate precedes uncorrectable failure by hours. That’s the difference between draining a node on your schedule and losing a multi-GPU training run to it.

GPU Xid errors — which ones mean a node is dying

What is an Xid error?

Which Xid errors actually require draining a node?

Why are most Xid alerts noise?

How do you catch a failing GPU before it takes a job down?