The GPU is no longer reachable on PCIe — a serious hardware fault; drain immediately.
Xid 79 means the driver can no longer communicate with the GPU over PCIe — it has effectively disappeared from the system. It is one of the most serious Xids and frequently points to a power delivery problem, overheating, or a physical/seating fault. The GPU is unusable until recovered.
NVRM: Xid (PCI:0000:65:00): 79, pid='N/A', GPU has fallen off the busDrain the node immediately — it's effectively down. Power-cycle; reseat the card and power connectors if it recurs. Repeated Xid 79 on the same GPU is an RMA. A measurable share of large H100 fleets hit this in year one.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
Unambiguous signal. The failure mode to fear isn't a false Xid 79 — it's a real one buried under hundreds of benign Xids at 3am.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.