← GPU Xid errorsdrain / RMA

Xid 48: Double Bit ECC Error (DBE)

An uncorrectable double-bit memory error — drain the node, RMA on recurrence.

What it means

Xid 48 is a double-bit ECC error: two bit flips in the same word that ECC can detect but cannot correct. It is a genuine hardware fault. The affected memory page is retired (or row remapped on Ampere and later), and any work that touched it is suspect.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 48, pid='N/A', Double Bit ECC
How to diagnose it
  1. 01`nvidia-smi -q -d ECC` — check aggregate double-bit (uncorrectable) counts.
  2. 02`nvidia-smi -q -d ROW_REMAPPER` (Ampere+) — confirm remapping status and remaining resources.
  3. 03Note whether it's a one-off or recurring on the same GPU.
What to do

Drain the node and reset the GPU so the page is retired / row remapped. If double-bit errors recur on the same GPU, or remapping resources are exhausted, schedule an RMA.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

This is real signal — exactly the kind that must survive the noise. The risk isn't false alarms; it's a single Xid 48 getting lost among hundreds of benign Xids.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.