fix: retry quorum-short chunks in cli merkle upload path#105
Conversation
Hermes quick review — PR #105Verdict: one blocker to resolve/check before RC merge. The retry refactor is generally sound for the intended Blocker
This looks like a regression from the old direct path, where a missing proof / non-quorum error was surfaced as Suggested fix: keep non-quorum errors non-retryable, but convert them at the Non-blocking notes
Given the PR is targeting RC, I’d hold merge until the partial-proof behaviour is either fixed or explicitly ruled impossible for this path. |
The CLI `ant file upload` direct merkle path (`upload_waves_merkle`) aborted the entire file the instant one chunk's store returned `Err`, with no retry and no convergence wait. A chunk that was only transiently short of quorum (a few close-group peers' routing tables briefly disagreeing, or momentarily full) would kill a multi-hundred-chunk upload — observed in the PROD-UL-01 failures where a single unstoreable chunk aborted uploads at 29/367 and 163/479. The external-signer path already handled this correctly via `merkle_store_with_retry`. This change makes the CLI path reuse that same machinery so transient quorum shortfalls are retried instead of aborting. - Make `merkle_store_with_retry` and the `MERKLE_STORE_MAX_ATTEMPTS` / `MERKLE_RETRY_BACKOFF` constants `pub(crate)` so the CLI path can call them - Add `failed_addresses` to `MerkleStoreOutcome` and carry each chunk's last quorum-shortfall message through retry rounds, so the CLI path can build a faithful `PartialUpload` - Rewrite `upload_waves_merkle` to drive each wave through `merkle_store_with_retry` (per-wave, preserving the streaming memory bound), aggregate failures across waves, and return `PartialUpload` only after retries are exhausted — never silently succeeding with missing chunks - Pre-partition addresses by proof so chunks left without a proof by a partial `pay_for_merkle_multi_batch` result (a later sub-batch's payment failed) are reported as failed via `PartialUpload` rather than aborting the whole file with a generic error — preserving the resumable partial-upload contract and every already-stored chunk - Convert any residual non-quorum store error at the `upload_waves_merkle` boundary into `PartialUpload` so earlier waves' stored chunks survive for resume - Non-`InsufficientPeers` errors stay non-retryable, unchanged Tests: `cargo test -p ant-core --lib` for the new merkle + file unit tests (failed_addresses contract; partition-by-proof boundary), `cargo clippy --all-targets --all-features -- -D warnings` clean, `cargo fmt --all -- --check` clean. Closes V2-413 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
9c5caad to
415b324
Compare
|
Thanks @dirvine — good catch, the blocker is valid. Fixed in the amended commit:
On the non-blocking note about
|
Hermes follow-up reviewThanks — I re-checked the amended commit ( What changed correctly:
Local checks I ran on the amended commit:
CI at the time of this follow-up: format, clippy, unit tests, build, docs, and security audit are green; E2E and Merkle E2E jobs are still pending. Verdict: my previous blocker is cleared. I have no further code-review blockers from this diff; I’d just wait for the pending E2E/Merkle E2E jobs before RC merge. |
Summary
The CLI
ant file uploaddirect merkle path (upload_waves_merkle) aborted the entire file the instant one chunk's store returnedErr, with no retry and no convergence wait. A chunk that was only transiently short of quorum (a few close-group peers' routing tables briefly disagreeing, or momentarily full) would kill a multi-hundred-chunk upload — observed in the PROD-UL-01 failures (2026-06-01) where a single unstoreable chunk aborted uploads at 29/367 and 163/479.The external-signer path already handled this correctly via
merkle_store_with_retry. This PR makes the CLI path reuse that same machinery (Fix D of the PROD-UL-01 set; client-side, independent of A/C).merkle_store_with_retryand theMERKLE_STORE_MAX_ATTEMPTS/MERKLE_RETRY_BACKOFFconstantspub(crate)so the CLI path can call them.failed_addressestoMerkleStoreOutcomeand carry each chunk's last quorum-shortfall message through retry rounds, so the CLI path can build a faithfulPartialUpload.upload_waves_merkleto drive each wave throughmerkle_store_with_retry(per-wave, preserving the streamingUPLOAD_WAVE_SIZEmemory bound), aggregate failures across waves, and returnPartialUploadonly after retries are exhausted — never silently succeeding with missing chunks.InsufficientPeerserrors (e.g. a missing proof) stay fatal and abort immediately, unchanged.Reuses the existing constants (
MERKLE_STORE_MAX_ATTEMPTS= 4,MERKLE_RETRY_BACKOFF= 30s ±10% jitter).Test plan
cargo test -p ant-core --lib merkle— 40 passed, including 2 new tests covering thefailed_addressescontract (populated on exhaustion with messages; empty on full success).merkle_store_with_retrytests already cover the three issue scenarios: round-1-fail/round-2-success → completes; all-rounds-fail → reported failed with others stored; non-quorum error → immediate abort.cargo clippy --all-targets --all-features -- -D warnings— clean.cargo fmt --all -- --check— clean.Closes V2-413
🤖 Generated with Claude Code