← GPU Xid errorsusually app error

Xid 13: Graphics Engine Exception

A graphics/compute engine exception — usually an application bug, occasionally failing hardware.

What it means

Xid 13 is raised when the GPU's graphics or compute engine hits an exception — frequently an illegal instruction or out-of-bounds memory access originating in the workload. It is one of the most common Xids and is dominated by application behavior, not hardware faults.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 13, pid=12345, Graphics Engine Exception
How to diagnose it
  1. 01Check whether it tracks a single job/user/container (app bug) or recurs across different workloads on the same physical GPU (possible hardware).
  2. 02Run `dcgmi diag -r 2` on the node to rule out a hardware fault.
  3. 03Correlate with any ECC or thermal Xids in the same window.
What to do

Treat as an application bug first — fix the workload. Suspect hardware only if Xid 13 keeps recurring across unrelated jobs on the same GPU; then drain and run extended diagnostics.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

High-frequency and almost always the workload. Paging on every Xid 13 is how on-call learns to mute the channel — absorb it unless it correlates with a hardware signal.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.