A graphics/compute engine exception — usually an application bug, occasionally failing hardware.
Xid 13 is raised when the GPU's graphics or compute engine hits an exception — frequently an illegal instruction or out-of-bounds memory access originating in the workload. It is one of the most common Xids and is dominated by application behavior, not hardware faults.
NVRM: Xid (PCI:0000:65:00): 13, pid=12345, Graphics Engine ExceptionTreat as an application bug first — fix the workload. Suspect hardware only if Xid 13 keeps recurring across unrelated jobs on the same GPU; then drain and run extended diagnostics.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
High-frequency and almost always the workload. Paging on every Xid 13 is how on-call learns to mute the channel — absorb it unless it correlates with a hardware signal.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.