fix(validator): fail closed on unreadable PAT store instead of wiping it by anderdc · Pull Request #1486 · entrius/gittensor

anderdc · 2026-06-16T17:47:52Z

Targets test. Verified against origin/test @ 1c813f5; full suite green (uv run pytest tests/ → 940 passed), ruff lint + format clean, pyright clean on changed files.

What this fixes

The root cause in #1481 (Proof 2): a single failed read of miner_pats.json silently and permanently erases every stored PAT.

_read_file() caught (json.JSONDecodeError, OSError) and returned [], so it could not tell "the store is empty" from "the read just failed" (a corrupt file, a transient I/O blip, a partial write). The next save_pat() read that [], upserted the one incoming PAT, and atomically overwrote the file — wiping every other miner's stored PAT. Those miners are then scored 0 by that validator every round (No stored PAT for miner {uid}, inspections.py:134) until each manually re-posts, with no error and no signal.

The change (validator-side, ~a handful of lines)

_read_file() fails loud — returns [] only when the file genuinely does not exist; raises on a corrupt/unreadable file.
save_pat() / get_pat_by_uid() fail closed — they propagate that error, so the write path can no longer mistake a failed read for an empty store and overwrite it. A failed read raises before any write; the existing store is left intact.
handle_pat_broadcast rejects gracefully — catches the store-read error and returns a clear "PAT store temporarily unavailable; please retry" rejection, so the miner retries instead of the axon surfacing a raw error.
load_all_pats() stays tolerant — it is the read-only scoring snapshot; an unhandled error there would stop the validator (the run loop's except sits outside while True) or zero everyone. It now logs loudly and returns [], so a transient read failure recovers on the next round and never wipes anything.

Why this is the appropriate fix

The store loss is the actual bug, and stopping the wipe protects every miner on every validator with no miner action — it doesn't ask honest miners to run a re-broadcast daemon to paper over validator-side data loss. It applies the same principle already adopted in #931 / #932 / #1107: a transient condition must not produce a permanent wrong outcome. Failing closed and loud also surfaces a genuinely corrupt store to the operator instead of silently degrading it down to a single entry.

Tests

tests/validator/test_pat_storage.py (#1481 Proof 2 → regression coverage):

a transient read error makes save_pat raise and leaves the on-disk store fully intact (all prior entries survive; the incoming UID is never written);
a corrupt store is left as-is on save, not overwritten down to one entry.

tests/validator/test_pat_handler.py:

an unreadable store makes handle_pat_broadcast reject (accepted=False, "temporarily unavailable") rather than wiping the store or raising.

uv run pytest tests/   -> 940 passed
ruff check / format    -> clean
pyright (changed)      -> 0 errors

Follow-up (not in this PR)

This closes the code path that lets a well-configured validator wipe itself. It does not cover PAT loss from a validator restarting without a persistent ./data volume, or stuck in a crash loop — those are operator config, not code. The follow-up is to reach out to the validators currently missing coverage (the vali-116 / rt21-64 cases in #1481) to confirm they mount ./data:/app/data and aren't crash-looping.

A single failed read of miner_pats.json (corrupt file or transient I/O) made _read_file return [], and the next save_pat upserted into that [] and atomically overwrote the store — erasing every other miner's stored PAT, who were then silently scored 0 until each re-posted. _read_file now raises on a corrupt/unreadable file (returns [] only when the file genuinely does not exist). save_pat/get_pat_by_uid propagate, so the write path fails closed and the broadcast handler rejects (miner retries) instead of overwriting. load_all_pats stays tolerant so a transient read failure cannot crash the scoring round. Fixes #1481

handle_pat_check read the PAT store via get_pat_by_uid without guarding it, so once get_pat_by_uid was changed to propagate read errors, a momentarily unreadable store would throw a raw axon error out of the check handler. Mirror handle_pat_broadcast: catch the store-read error and report has_pat=False / pat_valid=None with a retry message, instead of throwing or misreporting 'no PAT stored' (which would push the miner to needlessly re-broadcast against an unreadable store).

anderdc mentioned this pull request Jun 16, 2026

feat(cli): add gitt miner ensure and retry post broadcasts so PAT coverage self-heals #1482

Closed

anderdc and others added 3 commits June 24, 2026 16:44

Merge branch 'test' into fix/pat-store-fail-closed

673ab28

Merge branch 'test' into fix/pat-store-fail-closed

6890d98

entrius approved these changes Jun 25, 2026

View reviewed changes

entrius merged commit 309de8d into test Jun 25, 2026
3 checks passed

entrius deleted the fix/pat-store-fail-closed branch June 25, 2026 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(validator): fail closed on unreadable PAT store instead of wiping it#1486

fix(validator): fail closed on unreadable PAT store instead of wiping it#1486
entrius merged 4 commits into
testfrom
fix/pat-store-fail-closed

anderdc commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anderdc commented Jun 16, 2026

What this fixes

The change (validator-side, ~a handful of lines)

Why this is the appropriate fix

Tests

Follow-up (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants