Skip to content

feat(cli): add gitt miner ensure and retry post broadcasts so PAT coverage self-heals#1482

Closed
JSONbored wants to merge 1 commit into
entrius:testfrom
JSONbored:fix/pat-coverage-recovery
Closed

feat(cli): add gitt miner ensure and retry post broadcasts so PAT coverage self-heals#1482
JSONbored wants to merge 1 commit into
entrius:testfrom
JSONbored:fix/pat-coverage-recovery

Conversation

@JSONbored

Copy link
Copy Markdown
Contributor

Fixes #1481

Targets test. Verified against origin/test @ 1c813f5 (2026-06-15); full suite green (uv run pytest tests/ → 945 passed), ruff lint + format clean.

Summary

A miner's PAT coverage silently and permanently erodes. gitt miner post is a one-shot, best-effort broadcast: a validator briefly unreachable during it is silently dropped — no retry, and the command still reports success. And once any validator loses its stored PAT (e.g. it restarts without persistent ./data), there is no mechanism to restore coverage — the miner is scored 0 by that validator every round until they remember to manually re-post. (Root cause, live evidence, and two runnable reproductions in #1481.)

This makes coverage reliable and self-healing, miner-side only:

  • gitt miner post now retries validators that don't respond (--retries, default 2), so a transient blip during the broadcast isn't a silent, permanent coverage gap. An explicit accept/reject is final; only no-responses are retried.
  • New gitt miner ensure — probes current coverage, then re-broadcasts the PAT only to validators that are missing a valid one (no PAT is sent to validators that already have it). It's cheap and non-spammy, safe to run on a schedule (cron), and exits non-zero if any reachable validator is still uncovered. Running it after a validator restart restores coverage automatically instead of silently de-scoring the miner. --watch SECONDS runs it as a self-healing loop (re-syncing the metagraph each round to pick up validators that (re)join), so coverage recovers without external cron.
  • Shared _probe_pat / _broadcast_pat_with_retry helpers; post is refactored onto the latter (net −39 lines there).

This aligns with the existing "miner must re-post" remedy (surfaced in inspections.py when a validator has no stored PAT) by making it reliable and automatable, and with the transient-failure-resilience already adopted in #931 / #932 / #1107 (a transient/operational event must not cause a permanent wrong outcome).

Behavior

gitt miner ensure --json (validator UID 1 was missing; UIDs 0 and 2 already had a valid PAT):

{ "success": true, "total_validators": 3, "already_valid": 2,
  "reposted": 1, "now_valid": 3, "still_missing": [], "results": [ ... ], "skipped": [] }

Only the missing validator is re-sent the PAT; the two that already had it are untouched. Exits non-zero (for cron/alerting) if any reachable validator stays uncovered.

Tests

tests/cli/test_miner_commands.py:

  • _broadcast_pat_with_retry: retries only no-response validators; an explicit rejection is final (not retried); a persistent no-response is surfaced in the result, not silently dropped.
  • gitt miner ensure: re-broadcasts to exactly the validators missing a PAT and never to the ones that have it; exits non-zero when a validator stays uncovered.
  • --watch: loops the coverage cycle, re-syncs the metagraph between rounds, and exits cleanly on KeyboardInterrupt.
  • Updated the existing post JSON-envelope test so its dendrite mock is axon-aware (a no-response validator stays no-response across retries — previously the mock returned fixed responses regardless of axons).
uv run pytest tests/      -> 945 passed
ruff check / ruff format  -> clean   (vulture clean on changed files)

Notes

  • Targets test. Miner-side CLI only — no validator or contract changes.
  • Re-running ensure on a schedule is cheap: it sends no PAT to validators that already hold a valid one.

…overage self-heals

A single `gitt miner post` is one-shot and best-effort: a validator briefly
unreachable during the broadcast is silently dropped (no retry, still reported as
success), and once any validator loses its stored PAT (e.g. a restart without
persistent storage) nothing restores coverage — the miner is silently scored 0 by
it until a manual re-post.

- post: retry validators that don't respond (--retries, default 2) so a transient
  blip isn't a silent, permanent coverage gap.
- new `gitt miner ensure`: probe coverage, then re-broadcast the PAT ONLY to
  validators currently missing it (no PAT sent to ones that already have it). Cheap
  to cron; exits non-zero if any reachable validator is still uncovered. --watch
  SECONDS runs it as a self-healing loop (re-syncs the metagraph each round) so
  coverage recovers without external cron.
- factor shared _probe_pat / _broadcast_pat_with_retry helpers; add regression tests.
@xiao-xiao-mao xiao-xiao-mao Bot added the feature Net-new functionality label Jun 16, 2026
@anderdc

anderdc commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the detailed writeup and reproductions in #1481 — the diagnosis and telemetry are solid and were what we worked from.

We're going a different direction on the fix, though. This PR makes the recovery reliable but leaves the cause in place: the store still wipes on a single failed read (your own Proof 2), and the remedy here is for every honest miner to keep re-asserting coverage — ensure --watch is effectively a standing miner daemon, which cuts against the "no miner neuron required" design and shifts the operational burden onto miners to paper over validator-side data loss. So the underlying bug stays open and miners are still silently de-scored between re-broadcasts.

We'd rather stop the wipe at the source. #1486 makes the validator PAT store fail closed on an unreadable read instead of overwriting it, which protects every miner on every validator with no miner-side action — and turns your Proof 2 into a regression test. The remaining loss vectors (restart without a persistent ./data, crash loops) are operator config, which we're handling by reaching out to the affected validators directly rather than in code.

Closing.

@anderdc anderdc closed this Jun 16, 2026
@JSONbored JSONbored deleted the fix/pat-coverage-recovery branch June 18, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Net-new functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] gitt miner post coverage silently and permanently erodes after validator restarts — valid miners are de-scored with no signal and no recovery

2 participants