← GPU Xid errorswatch — early signal

Xid 92: High single-bit ECC error rate

Correctable single-bit errors above threshold — a rising rate predicts failure; trend it.

What it means

Xid 92 fires when the rate of correctable (single-bit) ECC errors crosses a threshold. The errors themselves are corrected and don't corrupt data, but a climbing single-bit rate is one of the best early predictors of an eventual uncorrectable failure.

Typical kernel-log signature
NVRM: Xid (PCI:0000:65:00): 92, pid='N/A', High single-bit ECC error rate
How to diagnose it
  1. 01`nvidia-smi -q -d ECC` — read aggregate and volatile single-bit (correctable) counts.
  2. 02Track the rate over days; compare against fleet baseline.
What to do

Watch and trend. Plan a maintenance drain if the rate is rising sharply; no immediate action for a stable low rate.

The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.

Signal or noise?

The highest-value 'quiet' signal — easy to ignore because nothing is broken yet, but the trend is what lets you drain on your schedule instead of losing a job. Trend it; don't page on each one.

That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.

Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.