A memory row was successfully remapped (or page retired) — not fatal alone, but trend it.
Xid 63 records that the GPU successfully remapped a memory row (Ampere and later) or retired a page (earlier architectures) in response to errors. A single event isn't an emergency, but remapping resources are finite, and a climbing remap count is an early indicator that the device is degrading.
NVRM: Xid (PCI:0000:65:00): 63, pid='N/A', Row Remapper: New row marked for remappingWatch. Schedule maintenance (drain at the next window) if remaps are accumulating or remaining resources are low; immediate drain isn't required for an isolated event.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
The classic case for trending instead of paging: each event is harmless, but the slope is the signal. Page on the trend, absorb the singletons.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.