Skip to content

fix: retry quorum-short chunks in cli merkle upload path#105

Merged
jacderida merged 1 commit into
WithAutonomi:rc-2026.5.5from
jacderida:feat-retry_quorum_short_chunks
Jun 2, 2026
Merged

fix: retry quorum-short chunks in cli merkle upload path#105
jacderida merged 1 commit into
WithAutonomi:rc-2026.5.5from
jacderida:feat-retry_quorum_short_chunks

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

Summary

The CLI ant file upload direct merkle path (upload_waves_merkle) aborted the entire file the instant one chunk's store returned Err, with no retry and no convergence wait. A chunk that was only transiently short of quorum (a few close-group peers' routing tables briefly disagreeing, or momentarily full) would kill a multi-hundred-chunk upload — observed in the PROD-UL-01 failures (2026-06-01) where a single unstoreable chunk aborted uploads at 29/367 and 163/479.

The external-signer path already handled this correctly via merkle_store_with_retry. This PR makes the CLI path reuse that same machinery (Fix D of the PROD-UL-01 set; client-side, independent of A/C).

  • Make merkle_store_with_retry and the MERKLE_STORE_MAX_ATTEMPTS / MERKLE_RETRY_BACKOFF constants pub(crate) so the CLI path can call them.
  • Add failed_addresses to MerkleStoreOutcome and carry each chunk's last quorum-shortfall message through retry rounds, so the CLI path can build a faithful PartialUpload.
  • Rewrite upload_waves_merkle to drive each wave through merkle_store_with_retry (per-wave, preserving the streaming UPLOAD_WAVE_SIZE memory bound), aggregate failures across waves, and return PartialUpload only after retries are exhausted — never silently succeeding with missing chunks.
  • Non-InsufficientPeers errors (e.g. a missing proof) stay fatal and abort immediately, unchanged.

Reuses the existing constants (MERKLE_STORE_MAX_ATTEMPTS = 4, MERKLE_RETRY_BACKOFF = 30s ±10% jitter).

Test plan

  • cargo test -p ant-core --lib merkle — 40 passed, including 2 new tests covering the failed_addresses contract (populated on exhaustion with messages; empty on full success).
  • Existing merkle_store_with_retry tests already cover the three issue scenarios: round-1-fail/round-2-success → completes; all-rounds-fail → reported failed with others stored; non-quorum error → immediate abort.
  • cargo clippy --all-targets --all-features -- -D warnings — clean.
  • cargo fmt --all -- --check — clean.

Closes V2-413

🤖 Generated with Claude Code

@dirvine
Copy link
Copy Markdown

dirvine commented Jun 1, 2026

Hermes quick review — PR #105

Verdict: one blocker to resolve/check before RC merge.

The retry refactor is generally sound for the intended InsufficientPeers case, and the focused local test suite passes (cargo test -p ant-core --lib merkle --no-fail-fast: 40 passed). CI is mostly green so far, but the E2E / Merkle E2E jobs are still pending at the time of review.

Blocker

  • upload_waves_merkle now propagates non-quorum failures from merkle_store_with_retry directly via ? (file.rs around the merkle_store_with_retry(...).await? call). That changes the existing partial-upload behaviour for the expected partial-paid-batch/missing-proof path:
    • pay_for_merkle_multi_batch can intentionally return a partial MerkleBatchPaymentResult containing proofs only for already-paid sub-batches if a later sub-batch payment fails.
    • The upload path then still iterates all merkle_plan.to_upload addresses. When it reaches an address without a proof, store_one returns Error::Payment("Missing merkle proof ...").
    • Because that is not InsufficientPeers, merkle_store_with_retry aborts immediately, and upload_waves_merkle returns the generic payment error rather than Error::PartialUpload with the chunks already stored / still failed. Any chunks successfully stored earlier in the current wave also are not added to stored_addresses before the error propagates.

This looks like a regression from the old direct path, where a missing proof / non-quorum error was surfaced as PartialUpload with stored addresses and failed chunk details.

Suggested fix: keep non-quorum errors non-retryable, but convert them at the upload_waves_merkle boundary into the same PartialUpload shape, preserving successes already observed in the current wave. Alternatively, if partial merkle payment results are no longer meant to be supported, the payment path should stop returning partial Ok(MerkleBatchPaymentResult) and abort before storage begins.

Non-blocking notes

  • Direct CLI merkle progress now goes through try_send inside merkle_store_with_retry; the previous direct path awaited progress sends. This can drop ChunkStored events under channel backpressure. Probably acceptable if progress is best-effort, but worth being aware of.
  • The new tests cover failed_addresses, but not the changed upload_waves_merkle boundary itself. A focused test for the partial-proof/non-quorum boundary would make this safer for the RC.

Given the PR is targeting RC, I’d hold merge until the partial-proof behaviour is either fixed or explicitly ruled impossible for this path.

The CLI `ant file upload` direct merkle path (`upload_waves_merkle`) aborted the
entire file the instant one chunk's store returned `Err`, with no retry and no
convergence wait. A chunk that was only transiently short of quorum (a few
close-group peers' routing tables briefly disagreeing, or momentarily full)
would kill a multi-hundred-chunk upload — observed in the PROD-UL-01 failures
where a single unstoreable chunk aborted uploads at 29/367 and 163/479.

The external-signer path already handled this correctly via
`merkle_store_with_retry`. This change makes the CLI path reuse that same
machinery so transient quorum shortfalls are retried instead of aborting.

- Make `merkle_store_with_retry` and the `MERKLE_STORE_MAX_ATTEMPTS` /
  `MERKLE_RETRY_BACKOFF` constants `pub(crate)` so the CLI path can call them
- Add `failed_addresses` to `MerkleStoreOutcome` and carry each chunk's last
  quorum-shortfall message through retry rounds, so the CLI path can build a
  faithful `PartialUpload`
- Rewrite `upload_waves_merkle` to drive each wave through
  `merkle_store_with_retry` (per-wave, preserving the streaming memory bound),
  aggregate failures across waves, and return `PartialUpload` only after retries
  are exhausted — never silently succeeding with missing chunks
- Pre-partition addresses by proof so chunks left without a proof by a partial
  `pay_for_merkle_multi_batch` result (a later sub-batch's payment failed) are
  reported as failed via `PartialUpload` rather than aborting the whole file with
  a generic error — preserving the resumable partial-upload contract and every
  already-stored chunk
- Convert any residual non-quorum store error at the `upload_waves_merkle`
  boundary into `PartialUpload` so earlier waves' stored chunks survive for resume
- Non-`InsufficientPeers` errors stay non-retryable, unchanged

Tests: `cargo test -p ant-core --lib` for the new merkle + file unit tests
(failed_addresses contract; partition-by-proof boundary), `cargo clippy
--all-targets --all-features -- -D warnings` clean, `cargo fmt --all -- --check`
clean.

Closes V2-413

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the feat-retry_quorum_short_chunks branch from 9c5caad to 415b324 Compare June 1, 2026 23:11
@jacderida
Copy link
Copy Markdown
Contributor Author

Thanks @dirvine — good catch, the blocker is valid. pay_for_merkle_multi_batch does intentionally return a partial MerkleBatchPaymentResult (proofs only for already-paid sub-batches), and my ? propagation turned the resulting "Missing merkle proof" into a generic Error::Payment, dropping the resumable PartialUpload shape the old direct path produced.

Fixed in the amended commit:

  • Pre-partition by proof. Before storing, upload_waves_merkle now splits the addresses into those that have a proof and those that don't (the unpaid chunks from a partial payment). The missing-proof chunks are recorded as failed up front and surfaced via PartialUpload at the end — they never enter the retry helper, so they can't abort the file and no already-stored chunk's progress is lost. This is the precise partial-paid-batch path you flagged, and it now stores every paid chunk rather than aborting on the first unpaid one.
  • Boundary conversion for residual errors. Any other non-quorum store error (e.g. a genuine DHT-lookup failure from close_group_peers) is converted at the upload_waves_merkle boundary into PartialUpload, preserving earlier waves' stored chunks for resume instead of escaping as a generic error. Non-InsufficientPeers errors remain non-retryable, per the issue.
  • Focused tests added for the partition boundary (partition_addresses_by_proof_*) alongside the existing failed_addresses tests.

On the non-blocking note about try_send dropping ChunkStored under backpressure: that lives in the shared merkle_store_with_retry (already used by the external-signer path), so I've left it as-is to keep this PR scoped to the retry fix — progress is best-effort. Happy to follow up separately if you'd prefer it await.

cargo test -p ant-core --lib, cargo clippy --all-targets --all-features -- -D warnings, and cargo fmt --all -- --check all clean.

@dirvine
Copy link
Copy Markdown

dirvine commented Jun 1, 2026

Hermes follow-up review

Thanks — I re-checked the amended commit (415b324). The blocker I raised is addressed.

What changed correctly:

  • upload_waves_merkle now pre-partitions addresses by proof, so chunks omitted from a partial MerkleBatchPaymentResult are recorded as failed up front instead of entering the retry helper and aborting with a generic Error::Payment.
  • The upload then attempts all paid/storable chunks, preserving usable progress and returning Error::PartialUpload at the file boundary when unpaid or permanently quorum-short chunks remain.
  • Residual non-quorum store errors are also converted to PartialUpload at the boundary, preserving earlier completed wave progress.

Local checks I ran on the amended commit:

  • cargo test -p ant-core --lib merkle --no-fail-fast — 40 passed
  • cargo test -p ant-core --lib partition_addresses_by_proof --no-fail-fast — 2 passed

CI at the time of this follow-up: format, clippy, unit tests, build, docs, and security audit are green; E2E and Merkle E2E jobs are still pending.

Verdict: my previous blocker is cleared. I have no further code-review blockers from this diff; I’d just wait for the pending E2E/Merkle E2E jobs before RC merge.

@jacderida jacderida merged commit b08dd19 into WithAutonomi:rc-2026.5.5 Jun 2, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants