An illegal memory access by the workload (MMU fault) — almost always an application or framework bug.
Xid 31 is an MMU fault: the application tried to read or write memory it didn't have a valid mapping for — a bad pointer, an out-of-bounds access, or a use-after-free in the CUDA code or framework. The GPU itself is typically healthy.
NVRM: Xid (PCI:0000:65:00): 31, pid=12345, MMU Fault ... FAULT_PDE ACCESS_TYPE_VIRT_READFix the workload. Only investigate hardware if the same GPU faults across many unrelated jobs.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
Noise for fleet health — it's a code bug surfaced as a GPU error. Absorb it unless it co-occurs with ECC/bus faults.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.