feat(cli): add gitt miner ensure and retry post broadcasts so PAT coverage self-heals#1482
feat(cli): add gitt miner ensure and retry post broadcasts so PAT coverage self-heals#1482JSONbored wants to merge 1 commit into
gitt miner ensure and retry post broadcasts so PAT coverage self-heals#1482Conversation
…overage self-heals A single `gitt miner post` is one-shot and best-effort: a validator briefly unreachable during the broadcast is silently dropped (no retry, still reported as success), and once any validator loses its stored PAT (e.g. a restart without persistent storage) nothing restores coverage — the miner is silently scored 0 by it until a manual re-post. - post: retry validators that don't respond (--retries, default 2) so a transient blip isn't a silent, permanent coverage gap. - new `gitt miner ensure`: probe coverage, then re-broadcast the PAT ONLY to validators currently missing it (no PAT sent to ones that already have it). Cheap to cron; exits non-zero if any reachable validator is still uncovered. --watch SECONDS runs it as a self-healing loop (re-syncs the metagraph each round) so coverage recovers without external cron. - factor shared _probe_pat / _broadcast_pat_with_retry helpers; add regression tests.
|
Thanks for the detailed writeup and reproductions in #1481 — the diagnosis and telemetry are solid and were what we worked from. We're going a different direction on the fix, though. This PR makes the recovery reliable but leaves the cause in place: the store still wipes on a single failed read (your own Proof 2), and the remedy here is for every honest miner to keep re-asserting coverage — We'd rather stop the wipe at the source. #1486 makes the validator PAT store fail closed on an unreadable read instead of overwriting it, which protects every miner on every validator with no miner-side action — and turns your Proof 2 into a regression test. The remaining loss vectors (restart without a persistent Closing. |
Fixes #1481
Summary
A miner's PAT coverage silently and permanently erodes.
gitt miner postis a one-shot, best-effort broadcast: a validator briefly unreachable during it is silently dropped — no retry, and the command still reports success. And once any validator loses its stored PAT (e.g. it restarts without persistent./data), there is no mechanism to restore coverage — the miner is scored 0 by that validator every round until they remember to manually re-post. (Root cause, live evidence, and two runnable reproductions in #1481.)This makes coverage reliable and self-healing, miner-side only:
gitt miner postnow retries validators that don't respond (--retries, default 2), so a transient blip during the broadcast isn't a silent, permanent coverage gap. An explicit accept/reject is final; only no-responses are retried.gitt miner ensure— probes current coverage, then re-broadcasts the PAT only to validators that are missing a valid one (no PAT is sent to validators that already have it). It's cheap and non-spammy, safe to run on a schedule (cron), and exits non-zero if any reachable validator is still uncovered. Running it after a validator restart restores coverage automatically instead of silently de-scoring the miner.--watch SECONDSruns it as a self-healing loop (re-syncing the metagraph each round to pick up validators that (re)join), so coverage recovers without external cron._probe_pat/_broadcast_pat_with_retryhelpers;postis refactored onto the latter (net −39 lines there).This aligns with the existing "miner must re-post" remedy (surfaced in
inspections.pywhen a validator has no stored PAT) by making it reliable and automatable, and with the transient-failure-resilience already adopted in #931 / #932 / #1107 (a transient/operational event must not cause a permanent wrong outcome).Behavior
gitt miner ensure --json(validator UID 1 was missing; UIDs 0 and 2 already had a valid PAT):{ "success": true, "total_validators": 3, "already_valid": 2, "reposted": 1, "now_valid": 3, "still_missing": [], "results": [ ... ], "skipped": [] }Only the missing validator is re-sent the PAT; the two that already had it are untouched. Exits non-zero (for cron/alerting) if any reachable validator stays uncovered.
Tests
tests/cli/test_miner_commands.py:_broadcast_pat_with_retry: retries only no-response validators; an explicit rejection is final (not retried); a persistent no-response is surfaced in the result, not silently dropped.gitt miner ensure: re-broadcasts to exactly the validators missing a PAT and never to the ones that have it; exits non-zero when a validator stays uncovered.--watch: loops the coverage cycle, re-syncs the metagraph between rounds, and exits cleanly onKeyboardInterrupt.postJSON-envelope test so its dendrite mock is axon-aware (a no-response validator stays no-response across retries — previously the mock returned fixed responses regardless of axons).Notes
test. Miner-side CLI only — no validator or contract changes.ensureon a schedule is cheap: it sends no PAT to validators that already hold a valid one.