Skip to content

fix(validator): fail closed on unreadable PAT store instead of wiping it#1486

Merged
entrius merged 4 commits into
testfrom
fix/pat-store-fail-closed
Jun 25, 2026
Merged

fix(validator): fail closed on unreadable PAT store instead of wiping it#1486
entrius merged 4 commits into
testfrom
fix/pat-store-fail-closed

Conversation

@anderdc

@anderdc anderdc commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Fixes #1481

Targets test. Verified against origin/test @ 1c813f5; full suite green (uv run pytest tests/ → 940 passed), ruff lint + format clean, pyright clean on changed files.

What this fixes

The root cause in #1481 (Proof 2): a single failed read of miner_pats.json silently and permanently erases every stored PAT.

_read_file() caught (json.JSONDecodeError, OSError) and returned [], so it could not tell "the store is empty" from "the read just failed" (a corrupt file, a transient I/O blip, a partial write). The next save_pat() read that [], upserted the one incoming PAT, and atomically overwrote the file — wiping every other miner's stored PAT. Those miners are then scored 0 by that validator every round (No stored PAT for miner {uid}, inspections.py:134) until each manually re-posts, with no error and no signal.

The change (validator-side, ~a handful of lines)

  • _read_file() fails loud — returns [] only when the file genuinely does not exist; raises on a corrupt/unreadable file.
  • save_pat() / get_pat_by_uid() fail closed — they propagate that error, so the write path can no longer mistake a failed read for an empty store and overwrite it. A failed read raises before any write; the existing store is left intact.
  • handle_pat_broadcast rejects gracefully — catches the store-read error and returns a clear "PAT store temporarily unavailable; please retry" rejection, so the miner retries instead of the axon surfacing a raw error.
  • load_all_pats() stays tolerant — it is the read-only scoring snapshot; an unhandled error there would stop the validator (the run loop's except sits outside while True) or zero everyone. It now logs loudly and returns [], so a transient read failure recovers on the next round and never wipes anything.

Why this is the appropriate fix

The store loss is the actual bug, and stopping the wipe protects every miner on every validator with no miner action — it doesn't ask honest miners to run a re-broadcast daemon to paper over validator-side data loss. It applies the same principle already adopted in #931 / #932 / #1107: a transient condition must not produce a permanent wrong outcome. Failing closed and loud also surfaces a genuinely corrupt store to the operator instead of silently degrading it down to a single entry.

Tests

tests/validator/test_pat_storage.py (#1481 Proof 2 → regression coverage):

  • a transient read error makes save_pat raise and leaves the on-disk store fully intact (all prior entries survive; the incoming UID is never written);
  • a corrupt store is left as-is on save, not overwritten down to one entry.

tests/validator/test_pat_handler.py:

  • an unreadable store makes handle_pat_broadcast reject (accepted=False, "temporarily unavailable") rather than wiping the store or raising.
uv run pytest tests/   -> 940 passed
ruff check / format    -> clean
pyright (changed)      -> 0 errors

Follow-up (not in this PR)

This closes the code path that lets a well-configured validator wipe itself. It does not cover PAT loss from a validator restarting without a persistent ./data volume, or stuck in a crash loop — those are operator config, not code. The follow-up is to reach out to the validators currently missing coverage (the vali-116 / rt21-64 cases in #1481) to confirm they mount ./data:/app/data and aren't crash-looping.

A single failed read of miner_pats.json (corrupt file or transient I/O)
made _read_file return [], and the next save_pat upserted into that []
and atomically overwrote the store — erasing every other miner's stored
PAT, who were then silently scored 0 until each re-posted.

_read_file now raises on a corrupt/unreadable file (returns [] only when
the file genuinely does not exist). save_pat/get_pat_by_uid propagate, so
the write path fails closed and the broadcast handler rejects (miner
retries) instead of overwriting. load_all_pats stays tolerant so a
transient read failure cannot crash the scoring round.

Fixes #1481
anderdc and others added 3 commits June 24, 2026 16:44
handle_pat_check read the PAT store via get_pat_by_uid without guarding
it, so once get_pat_by_uid was changed to propagate read errors, a
momentarily unreadable store would throw a raw axon error out of the
check handler. Mirror handle_pat_broadcast: catch the store-read error
and report has_pat=False / pat_valid=None with a retry message, instead
of throwing or misreporting 'no PAT stored' (which would push the miner
to needlessly re-broadcast against an unreadable store).
@entrius entrius merged commit 309de8d into test Jun 25, 2026
3 checks passed
@entrius entrius deleted the fix/pat-store-fail-closed branch June 25, 2026 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] gitt miner post coverage silently and permanently erodes after validator restarts — valid miners are de-scored with no signal and no recovery

3 participants