The GPU System Processor (GSP firmware) timed out or errored — sometimes transient, sometimes needs a reset.
Xid 119 (GSP RPC timeout) and Xid 120 (GSP error) involve the GPU System Processor — the on-board microcontroller that runs GSP firmware on recent drivers. They can be transient hiccups or signal a firmware/driver problem that wedges the GPU.
NVRM: Xid (PCI:0000:65:00): 119, pid='N/A', Timeout waiting for RPC from GSPIf isolated and self-recovered, watch. If the GPU is wedged or it recurs, reset the GPU (drain first); persistent cases may need a driver update or, as a documented workaround, disabling GSP firmware.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
Mixed: transient ones are noise, a persistent or fleet-wide pattern is signal. A good case for correlating across the fleet before deciding.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.