An uncorrectable error that escaped containment — may have corrupted other work; drain now.
Xid 95 (Ampere and later) is an uncorrectable ECC error that the GPU could not contain to a single application — so other contexts on the device may have been affected. It is more serious than Xid 94 and demands immediate attention.
NVRM: Xid (PCI:0000:65:00): 95, pid=12345, Uncontained: ECC errorDrain the node now and reset the GPU. Re-run any work that shared the device. RMA on recurrence.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
Real, urgent signal — the kind triage must always surface, never resolve away.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.