← GPU Xid errorsusually app error

Xid 43: GPU stopped processing

A channel was aborted after a workload-side error — the GPU itself is usually fine.

What it means

Xid 43 indicates the GPU stopped processing a particular channel after an error in the workload running on it. It generally accompanies an application fault rather than a hardware failure.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 43, pid=12345, GPU stopped processing
How to diagnose it
  1. 01Find the job that owned the channel and inspect its logs.
  2. 02Check for an accompanying Xid 13/31 from the same pid.
What to do

Treat as workload-side. The node rarely needs draining for an isolated Xid 43.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

App-level. Absorb unless paired with a hardware Xid.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.