← GPU Xid errorsdrain / RMA

Xid 94: Contained ECC error

An uncorrectable error contained to one application — drain after the job; RMA on recurrence.

What it means

Xid 94 (Ampere and later) is an uncorrectable ECC error that the GPU's error-containment isolated to the application that touched the bad memory, sparing other workloads. The faulting app is killed; the rest can continue until the GPU is reset.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 94, pid=12345, Contained: ECC error
How to diagnose it
  1. 01`nvidia-smi -q -d ECC` — confirm the uncorrectable (DBE) count and the contained event.
  2. 02`nvidia-smi -q -d ROW_REMAPPER` — check remap status.
What to do

Drain after the current job and reset the GPU to remap the affected row. RMA if contained errors recur on the same device.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

Real signal. Less urgent than Xid 95 (it was contained) but still a hardware fault that needs a reset.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.