← GPU Xid errorsdrain / RMA

Xid 74: NVLink Error

An NVLink / NVSwitch fault — often hardware, can cascade across a node; isolate and drain.

What it means

Xid 74 reports an error on an NVLink connection or through an NVSwitch. It can stem from a degraded link, a connector/seating issue, or a switch fault, and because NVLink ties GPUs together, one bad link can disrupt multi-GPU jobs across the whole node.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 74, pid='N/A', NVLink: ... link error
How to diagnose it
  1. 01`nvidia-smi nvlink -e` — read per-link error counters.
  2. 02`nvidia-smi nvlink -s` — check link status/state.
  3. 03On NVSwitch systems, inspect fabric-manager logs.
What to do

Isolate the affected GPU/link and drain the node. Reseat or service the link; persistent NVLink errors after a reset warrant an RMA.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

Usually real, and it can masquerade as many downstream job failures — exactly the kind of root event transparent triage should collapse into one explained signal.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

RelatedXid 79

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.