← PlexusGuide · GPU reliability

GPU Xid errors — which ones mean a node is dying

NVIDIA Xid errors are the GPU’s way of telling you something went wrong — but most of them are noise, and a few mean a node is about to take a training run down with it. Here’s what the common codes mean, which ones to act on, and how to stop drowning in the rest.

Common Xid codes

A representative reference. Severity depends on your driver and GPU generation — always cross-check against NVIDIA’s official Xid table.

XidMeaningAction
Xid 13Graphics Engine Exception
A graphics/compute engine exception — usually an application bug, occasionally failing hardware.
usually app error
Xid 31GPU memory page fault
An illegal memory access by the workload (MMU fault) — almost always an application or framework bug.
usually app error
Xid 43GPU stopped processing
A channel was aborted after a workload-side error — the GPU itself is usually fine.
usually app error
Xid 45Preemptive channel teardown
Channel torn down on a normal process exit or kill — high-volume and usually benign.
usually benign
Xid 48Double Bit ECC Error (DBE)
An uncorrectable double-bit memory error — drain the node, RMA on recurrence.
drain / RMA
Xid 63ECC row-remap / page-retirement recorded
A memory row was successfully remapped (or page retired) — not fatal alone, but trend it.
watch — early signal
Xid 64ECC row-remap / page-retirement failure
A remap or page-retirement attempt failed — treat as failing memory and drain.
drain / RMA
Xid 74NVLink Error
An NVLink / NVSwitch fault — often hardware, can cascade across a node; isolate and drain.
drain / RMA
Xid 79GPU has fallen off the bus
The GPU is no longer reachable on PCIe — a serious hardware fault; drain immediately.
drain / RMA
Xid 92High single-bit ECC error rate
Correctable single-bit errors above threshold — a rising rate predicts failure; trend it.
watch — early signal
Xid 94Contained ECC error
An uncorrectable error contained to one application — drain after the job; RMA on recurrence.
drain / RMA
Xid 95Uncontained ECC error
An uncorrectable error that escaped containment — may have corrupted other work; drain now.
drain / RMA
Xid 119 / 120GSP RPC timeout / error
The GPU System Processor (GSP firmware) timed out or errored — sometimes transient, sometimes needs a reset.
watch — early signal
The noise problem

The trap with Xid monitoring is alerting on the wrong axis. The high-frequency codes — 45 on every process teardown, 31 on every workload memory bug — bury the rare ones that mean a card is failing. Page on all of them and on-call learns to mute the channel. Page on none and you miss Xid 79.

The fix is to absorb the benign and trend the correctable, so a human only hears about the node that’s actually dying. That is exactly what Plexus does by default — it decides signal from noise, suppresses the flapping and app-level Xids, and surfaces the uncorrectable and bus-level ones with the correlated context and a recommended action attached.

Questions

What is an Xid error?

An Xid is an error report from the NVIDIA driver, printed to the kernel log (dmesg) and exposed through NVML/DCGM. The Xid number identifies the class of problem — anything from an application bug to failing GPU memory to a card that has dropped off the PCIe bus. The number alone doesn’t tell you severity; context does.

Which Xid errors actually require draining a node?

The uncorrectable-memory and bus-level ones: 48 (double-bit ECC), 79 (fallen off the bus), 94 / 95 (contained / uncontained ECC), 64 (remap failure), and most 74 (NVLink) events. Watch — don’t necessarily drain — on 63, 92, 119/120, where a trend is the signal. The 13/31/43/45 family is usually the workload, not the hardware.

Why are most Xid alerts noise?

Because the high-frequency Xids (45, 31, 13, 43) are dominated by application behavior and normal process teardown, while the ones that matter (48, 79, 94/95) are comparatively rare and easy to bury. A naive rule that pages on “any Xid” trains on-call to ignore the channel — which is how a real Xid 79 gets missed at 3am.

How do you catch a failing GPU before it takes a job down?

Trend the correctable signals — single-bit ECC rate (Xid 92) and row remaps (Xid 63) — rather than alerting on each one. A rising rate precedes uncorrectable failure by hours. That’s the difference between draining a node on your schedule and losing a multi-GPU training run to it.