← GPU Xid errorsdrain / RMA

Xid 64: ECC row-remap / page-retirement failure

A remap or page-retirement attempt failed — treat as failing memory and drain.

What it means

Xid 64 means the GPU tried to remap a row (or retire a page) and the operation failed. Unlike Xid 63, this is not a successful recovery — the device could not protect itself, so the memory should be considered unreliable.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 64, pid='N/A', Row Remapper: Failed to remap row
How to diagnose it
  1. 01`nvidia-smi -q -d ROW_REMAPPER` — check for a remap failure flag and pending remaps.
  2. 02Look for accompanying Xid 48/94/95.
What to do

Drain the node. The GPU's self-protection has failed; plan an RMA if it persists after a reset.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

Real signal. Rare, and consequential — must not be absorbed.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.