← GPU Xid errorswatch — early signal

Xid 119 / 120: GSP RPC timeout / error

The GPU System Processor (GSP firmware) timed out or errored — sometimes transient, sometimes needs a reset.

What it means

Xid 119 (GSP RPC timeout) and Xid 120 (GSP error) involve the GPU System Processor — the on-board microcontroller that runs GSP firmware on recent drivers. They can be transient hiccups or signal a firmware/driver problem that wedges the GPU.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 119, pid='N/A', Timeout waiting for RPC from GSP
How to diagnose it
  1. 01Check whether the GPU recovered on its own or is now unresponsive (`nvidia-smi`).
  2. 02Review driver/GSP version against NVIDIA's known-issue notes for your driver branch.
  3. 03Look for a pattern across GPUs/nodes (suggests a driver/firmware issue) vs. a single device.
What to do

If isolated and self-recovered, watch. If the GPU is wedged or it recurs, reset the GPU (drain first); persistent cases may need a driver update or, as a documented workaround, disabling GSP firmware.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

Mixed: transient ones are noise, a persistent or fleet-wide pattern is signal. A good case for correlating across the fleet before deciding.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.