← GPU Xid errorsdrain / RMA

Xid 79: GPU has fallen off the bus

The GPU is no longer reachable on PCIe — a serious hardware fault; drain immediately.

What it means

Xid 79 means the driver can no longer communicate with the GPU over PCIe — it has effectively disappeared from the system. It is one of the most serious Xids and frequently points to a power delivery problem, overheating, or a physical/seating fault. The GPU is unusable until recovered.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 79, pid='N/A', GPU has fallen off the bus
How to diagnose it
  1. 01`nvidia-smi` — the GPU is typically missing or shows ERR!.
  2. 02`lspci | grep -i nvidia` — confirm whether the device still enumerates on PCIe.
  3. 03Check power draw, PSU health, and inlet/junction temperatures leading up to the event; inspect the BMC/IPMI logs.
What to do

Drain the node immediately — it's effectively down. Power-cycle; reseat the card and power connectors if it recurs. Repeated Xid 79 on the same GPU is an RMA. A measurable share of large H100 fleets hit this in year one.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

Unambiguous signal. The failure mode to fear isn't a false Xid 79 — it's a real one buried under hundreds of benign Xids at 3am.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

RelatedXid 74

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.