Correctable single-bit errors above threshold — a rising rate predicts failure; trend it.
Xid 92 fires when the rate of correctable (single-bit) ECC errors crosses a threshold. The errors themselves are corrected and don't corrupt data, but a climbing single-bit rate is one of the best early predictors of an eventual uncorrectable failure.
NVRM: Xid (PCI:0000:65:00): 92, pid='N/A', High single-bit ECC error rateWatch and trend. Plan a maintenance drain if the rate is rising sharply; no immediate action for a stable low rate.
The datacenter decision path for hardware-class Xids: cordon → drain → reset → RMA on recurrence. Cordon to stop new work landing, drain running jobs, reset the GPU to clear or remap the fault, and RMA if it comes back.
The highest-value 'quiet' signal — easy to ignore because nothing is broken yet, but the trend is what lets you drain on your schedule instead of losing a job. Trend it; don't page on each one.
That judgment — absorb the noise, surface the few that are real — is what Plexus does by default on your existing DCGM / Prometheus / Thanos. See why a GPU fleet throws 50 alerts a week and only two matter.
Severity here is an operations judgment for datacenter fleets and depends on driver and GPU generation — always cross-check NVIDIA’s official Xid table.