An NVLink / NVSwitch fault — often hardware, can cascade across a node; isolate and drain.
Xid 74 reports an error on an NVLink connection or through an NVSwitch. It can stem from a degraded link, a connector/seating issue, or a switch fault, and because NVLink ties GPUs together, one bad link can disrupt multi-GPU jobs across the whole node.
NVRM: Xid (PCI:0000:65:00): 74, pid='N/A', NVLink: ... link errorIsolate the affected GPU/link and drain the node. Reseat or service the link; persistent NVLink errors after a reset warrant an RMA.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
Usually real, and it can masquerade as many downstream job failures — exactly the kind of root event transparent triage should collapse into one explained signal.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.