← GPU Xid errorsusually app error

Xid 31: GPU memory page fault

An illegal memory access by the workload (MMU fault) — almost always an application or framework bug.

What it means

Xid 31 is an MMU fault: the application tried to read or write memory it didn't have a valid mapping for — a bad pointer, an out-of-bounds access, or a use-after-free in the CUDA code or framework. The GPU itself is typically healthy.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 31, pid=12345, MMU Fault ... FAULT_PDE ACCESS_TYPE_VIRT_READ
How to diagnose it
  1. 01Identify the offending process/container from the pid in the log.
  2. 02Reproduce under `compute-sanitizer` to find the illegal access in the code.
  3. 03Confirm it follows the workload, not the GPU.
What to do

Fix the workload. Only investigate hardware if the same GPU faults across many unrelated jobs.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

Noise for fleet health — it's a code bug surfaced as a GPU error. Absorb it unless it co-occurs with ECC/bus faults.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.