← GPU Xid errorswatch — early signal

Xid 63: ECC row-remap / page-retirement recorded

A memory row was successfully remapped (or page retired) — not fatal alone, but trend it.

What it means

Xid 63 records that the GPU successfully remapped a memory row (Ampere and later) or retired a page (earlier architectures) in response to errors. A single event isn't an emergency, but remapping resources are finite, and a climbing remap count is an early indicator that the device is degrading.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 63, pid='N/A', Row Remapper: New row marked for remapping
How to diagnose it
  1. 01`nvidia-smi -q -d ROW_REMAPPER` — read the remapped-row count and remaining capacity.
  2. 02Track the count over time, not the single event.
What to do

Watch. Schedule maintenance (drain at the next window) if remaps are accumulating or remaining resources are low; immediate drain isn't required for an isolated event.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

The classic case for trending instead of paging: each event is harmless, but the slope is the signal. Page on the trend, absorb the singletons.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.