NVIDIA Xid errors are the GPU’s way of telling you something went wrong — but most of them are noise, and a few mean a node is about to take a training run down with it. Here’s what the common codes mean, which ones to act on, and how to stop drowning in the rest.
A representative reference. Severity depends on your driver and GPU generation — always cross-check against NVIDIA’s official Xid table.
| Xid | Meaning | Action |
|---|---|---|
| Xid 13 | Graphics Engine Exception A graphics/compute engine exception — usually an application bug, occasionally failing hardware. | usually app error |
| Xid 31 | GPU memory page fault An illegal memory access by the workload (MMU fault) — almost always an application or framework bug. | usually app error |
| Xid 43 | GPU stopped processing A channel was aborted after a workload-side error — the GPU itself is usually fine. | usually app error |
| Xid 45 | Preemptive channel teardown Channel torn down on a normal process exit or kill — high-volume and usually benign. | usually benign |
| Xid 48 | Double Bit ECC Error (DBE) An uncorrectable double-bit memory error — drain the node, RMA on recurrence. | drain / RMA |
| Xid 63 | ECC row-remap / page-retirement recorded A memory row was successfully remapped (or page retired) — not fatal alone, but trend it. | watch — early signal |
| Xid 64 | ECC row-remap / page-retirement failure A remap or page-retirement attempt failed — treat as failing memory and drain. | drain / RMA |
| Xid 74 | NVLink Error An NVLink / NVSwitch fault — often hardware, can cascade across a node; isolate and drain. | drain / RMA |
| Xid 79 | GPU has fallen off the bus The GPU is no longer reachable on PCIe — a serious hardware fault; drain immediately. | drain / RMA |
| Xid 92 | High single-bit ECC error rate Correctable single-bit errors above threshold — a rising rate predicts failure; trend it. | watch — early signal |
| Xid 94 | Contained ECC error An uncorrectable error contained to one application — drain after the job; RMA on recurrence. | drain / RMA |
| Xid 95 | Uncontained ECC error An uncorrectable error that escaped containment — may have corrupted other work; drain now. | drain / RMA |
| Xid 119 / 120 | GSP RPC timeout / error The GPU System Processor (GSP firmware) timed out or errored — sometimes transient, sometimes needs a reset. | watch — early signal |
The trap with Xid monitoring is alerting on the wrong axis. The high-frequency codes — 45 on every process teardown, 31 on every workload memory bug — bury the rare ones that mean a card is failing. Page on all of them and on-call learns to mute the channel. Page on none and you miss Xid 79.
The fix is to absorb the benign and trend the correctable, so a human only hears about the node that’s actually dying. That is exactly what Plexus does by default — it decides signal from noise, suppresses the flapping and app-level Xids, and surfaces the uncorrectable and bus-level ones with the correlated context and a recommended action attached.
An Xid is an error report from the NVIDIA driver, printed to the kernel log (dmesg) and exposed through NVML/DCGM. The Xid number identifies the class of problem — anything from an application bug to failing GPU memory to a card that has dropped off the PCIe bus. The number alone doesn’t tell you severity; context does.
The uncorrectable-memory and bus-level ones: 48 (double-bit ECC), 79 (fallen off the bus), 94 / 95 (contained / uncontained ECC), 64 (remap failure), and most 74 (NVLink) events. Watch — don’t necessarily drain — on 63, 92, 119/120, where a trend is the signal. The 13/31/43/45 family is usually the workload, not the hardware.
Because the high-frequency Xids (45, 31, 13, 43) are dominated by application behavior and normal process teardown, while the ones that matter (48, 79, 94/95) are comparatively rare and easy to bury. A naive rule that pages on “any Xid” trains on-call to ignore the channel — which is how a real Xid 79 gets missed at 3am.
Trend the correctable signals — single-bit ECC rate (Xid 92) and row remaps (Xid 63) — rather than alerting on each one. A rising rate precedes uncorrectable failure by hours. That’s the difference between draining a node on your schedule and losing a multi-GPU training run to it.