An uncorrectable error contained to one application — drain after the job; RMA on recurrence.
Xid 94 (Ampere and later) is an uncorrectable ECC error that the GPU's error-containment isolated to the application that touched the bad memory, sparing other workloads. The faulting app is killed; the rest can continue until the GPU is reset.
NVRM: Xid (PCI:0000:65:00): 94, pid=12345, Contained: ECC errorDrain after the current job and reset the GPU to remap the affected row. RMA if contained errors recur on the same device.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
Real signal. Less urgent than Xid 95 (it was contained) but still a hardware fault that needs a reset.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.