A channel was aborted after a workload-side error — the GPU itself is usually fine.
Xid 43 indicates the GPU stopped processing a particular channel after an error in the workload running on it. It generally accompanies an application fault rather than a hardware failure.
NVRM: Xid (PCI:0000:65:00): 43, pid=12345, GPU stopped processingTreat as workload-side. The node rarely needs draining for an isolated Xid 43.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
App-level. Absorb unless paired with a hardware Xid.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.