← GPU Xid errorsusually benign

Xid 45: Preemptive channel teardown

Channel torn down on a normal process exit or kill — high-volume and usually benign.

What it means

Xid 45 fires when channels are cleaned up as a process exits — including normal job completion, a Ctrl-C, or an OOM kill. It is one of the highest-volume Xids and on its own says nothing about hardware health.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 45, pid=12345, Ch 00000008, engmask 00000101
How to diagnose it
  1. 01Correlate with job lifecycle — does it line up with jobs ending or being killed?
  2. 02Only investigate if it appears without any corresponding process exit.
What to do

Generally ignore. Drain only if Xid 45 appears alongside genuine hardware faults.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

The textbook noise Xid — it fires constantly during normal operation. A monitor that pages on Xid 45 is broken by design.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

RelatedXid 13

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.