← GPU Xid errorsdrain / RMA

Xid 95: Uncontained ECC error

An uncorrectable error that escaped containment — may have corrupted other work; drain now.

What it means

Xid 95 (Ampere and later) is an uncorrectable ECC error that the GPU could not contain to a single application — so other contexts on the device may have been affected. It is more serious than Xid 94 and demands immediate attention.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 95, pid=12345, Uncontained: ECC error
How to diagnose it
  1. 01`nvidia-smi -q -d ECC` — confirm the uncorrectable event.
  2. 02Treat results of any jobs running on the GPU at the time as suspect.
What to do

Drain the node now and reset the GPU. Re-run any work that shared the device. RMA on recurrence.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

Real, urgent signal — the kind triage must always surface, never resolve away.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.