Skip to content

fix(payment): gate quote freshness on own quote only, not neighbours'#127

Merged
jacderida merged 1 commit into
mainfrom
fix/quote-freshness-own-quote-only
Jun 5, 2026
Merged

fix(payment): gate quote freshness on own quote only, not neighbours'#127
jacderida merged 1 commit into
mainfrom
fix/quote-freshness-own-quote-only

Conversation

@jacderida
Copy link
Copy Markdown
Collaborator

@jacderida jacderida commented Jun 5, 2026

Problem

The price-staleness gate introduced in 56dd370 checks every quote in the payment bundle against the verifying node's own record count:

for (encoded_peer_id, quote) in &payment.peer_quotes {
    if quote.price < min_acceptable_price { // floor derived from MY records

Fullness across a close group is wildly heterogeneous on a real network. On ant-prod-01 (2026-06-04, PROD-UL-01 upload run) the close group for a public DataMap chunk spanned 47..=1788 records. The client collected 7 quotes and paid 3× the median on-chain (14,520,949,218,750,000 atto — more than 2× even the fullest node's current price), yet:

  • node holding 1788 records: floor 5,271,199,218,750,000 → rejected the honest quote from the 978-record peer (4,840,316,406,250,000)
  • nodes holding 1180 / 1310 records: rejected the honest quote from the 47-record peer (3,908,407,226,562,500) — the 1180-record node by a margin of just 1.05%

All quotes were 10 seconds old — no staleness was involved, only cross-node fullness spread. With a fourth close-group node disk-full, the PUT landed on 3/7 peers (majority is 4): the upload failed after the on-chain payment was burned, and the 500 already-stored body chunks became unreferenced. The same gate also rejects the replication paid-notify path (observed on two further nodes at 1180/1443 records), so the chunk cannot heal past the 3 copies.

Every close group whose fullest member's 75% floor exceeds its emptiest member's price will reject all single-node payments — guaranteed steady-state on prod where freshly joined nodes coexist with established ones.

Fix

A node can only re-derive its own price from its own record count, so its own quote is the only one it can legitimately call stale (the doc comment's stated intent — "a node storing a few replicated records between quoting and verifying" — is exactly the self-staleness case). The gate now:

  • resolves the node's own peer ID from the attached P2PNode (same handle the merkle closeness check uses) and skips every quote in the bundle that isn't ours;
  • fails open with a debug log when no identity source is attached, mirroring the existing missing-record-count posture (production always attaches);
  • rewords the rejection message — it previously claimed "paid X" while actually printing the quote price, not the on-chain payment.

The on-chain median payment binding (completedPayments, 3× median) is unaffected.

Tests

  • test_neighbour_cheap_quote_not_rejected — regression test reproducing the incident shape: own fresh quote at 1788 records alongside honest neighbour quotes at 47 and 978; must pass.
  • test_own_stale_quote_still_rejected_among_neighbours — the gate still bites on a genuinely stale own quote.
  • test_freshness_skipped_without_self_peer_id — fail-open without an identity source.
  • Existing freshness tests updated to mark the quote under test as the node's own (set_peer_id_for_tests, cfg-gated like set_records_stored_for_tests).

All 61 payment::verifier tests pass; clippy clean. (config::tests::test_bootstrap_peers_discover_env_var fails identically on untouched main — environment-sensitive, unrelated.)

Possible follow-up

An additional/alternative gate could compare the verified on-chain amount against the node's current price after single_payment.verify() — that would also fix the "rejected despite aggregate overpayment" semantics. Kept out of this PR to stay surgical.

Companion PR (reporting): WithAutonomi/ant-client#107ant --json file upload emits no JSON when the public DataMap store fails, so harnesses report 0/0 chunks for uploads like this one.

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the quote price “freshness” (anti-staleness) gate in payment::verifier so that a node only evaluates its own quote in a multi-peer payment bundle, instead of incorrectly judging neighbours’ quotes against the verifying node’s local record count. This aligns the freshness logic with what the verifying node can actually re-derive (its own price-from-fullness) and prevents heterogeneous close-group fullness from causing systematic false rejections and failed PUTs after on-chain payment.

Changes:

  • Add a mechanism to resolve the verifying node’s own peer ID (from attached P2PNode, with a test-only override fallback) and use it to select which quote to freshness-check.
  • Update validate_quote_freshness to skip all non-self quotes and improve the rejection message wording.
  • Add/adjust unit tests to cover neighbour-quote scenarios, self-stale rejection, and fail-open behavior when self identity is unavailable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jacderida jacderida force-pushed the fix/quote-freshness-own-quote-only branch from cf1ba3e to 51f7d55 Compare June 5, 2026 00:55
Copy link
Copy Markdown
Collaborator

@grumbach grumbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the fix end to end (read verify_evm_payment, validate_quote_freshness, the self-ID resolution, the signing/binding path, and the merkle path). This is a clean, correct, well-scoped fix and the diagnosis matches the incident. Approving, with a couple of notes for your call.

What I verified

  • Root cause is real and the fix is the right shape. A close group spanning 47..=1788 records makes a full node's 75% floor exceed an empty node's honest price, so the old gate rejected perfectly fresh neighbour quotes. A node can only re-derive its own price from its own record count, so gating only its own quote is correct. The neighbour's stale-replay risk is genuinely that neighbour's gate to enforce when the PUT reaches it.
  • The self-ID comparison actually matches in production. self_peer_id_bytes() returns *node.peer_id().as_bytes(), and an EncodedPeerId is constructed from exactly those bytes (EncodedPeerId::new(*ant_peer_id.as_bytes()) in this file). Same 32-byte encoding, so encoded_peer_id.as_bytes() != &self_peer_id is sound and won't silently skip our own quote.
  • No bypass from the pre-binding ordering. validate_quote_freshness runs before validate_peer_bindings and signature verification, so the encoded_peer_id label isn't authenticated yet when we pick out "our" quote. I chased the relabeling angle: an attacker could mislabel a forged cheap quote under a neighbour's peer ID to dodge the own-quote gate, but that quote then fails the BLAKE3 peer-id binding + signature (price is signed via bytes_for_signing). Both paths reject, so there's no surviving relabeled quote. The only untrusted-data op freshness performs (derive_records_stored_from_price) is already saturating/panic-proof for exactly this pre-auth reason. Fine as-is.
  • Scope is right. validate_quote_freshness is only called from the single-node path; the merkle/batch path has no analogous own-fullness gate, so nothing to mirror there.

Notes (non-blocking)

  1. Two independent self-identity notions now coexist in the verifier. validate_local_recipient identifies "us" by quote.rewards_address == local_rewards_address; the new gate identifies "us" by P2PNode peer ID. If those two ever disagree for a given node (config mismatch, or one handle attached but not the other), validate_local_recipient could still consider the node a recipient while the freshness gate matches no quote and silently no-ops. It fails open, so it's not a security hole, just a latent inconsistency. Might be worth a one-line comment noting the two are expected to refer to the same node, or a debug log when the bundle contains our rewards address but no quote under our peer ID.

  2. Tiny doc-comment nit. The new doc on validate_quote_freshness is excellent, but it says "this node has no basis to judge it against its own record count" for neighbour quotes, which is the key insight worth keeping front and center. Consider also stating explicitly that a bundle is expected to contain exactly one quote per peer (so the loop matches at most one own quote) so a future reader doesn't wonder about multiple-own-quote handling.

Tests

The three new tests cover the regression shape (own fresh @1788 + honest neighbours @47/@978), own-still-bites, and fail-open-without-self-id. The fail-open test relies on no override being set, which is the right negative. Since these call validate_quote_freshness directly, the pub_key: [0u8; 64] fakes are fine (binding/signature aren't on this path). Coverage matches the change well.

Nice surgical fix. LGTM once you've eyeballed note 1.

The price-staleness gate (56dd370) checked every quote in the payment
bundle against the verifying node's OWN record count. Fullness across a
close group is wildly heterogeneous on a real network — on ant-prod-01
(2026-06-04) a close group spanned 47..=1788 records, so the three
fullest nodes each found the emptiest node's honest, 10-second-old
quote below their 75% floor and rejected the PUT after the client had
already paid 3x the median on-chain (more than 2x even the fullest
node's current price). With one more node disk-full, the DataMap store
landed on 3/7 peers — below the majority of 4 — failing the upload and
burning the payment. The same gate also blocked the replication
paid-notify path, so the chunk could not heal past 3 copies.

A node can only re-derive its own price from its own record count, so
its own quote is the only one it can legitimately call stale. Resolve
the node's peer ID from the attached P2PNode and skip every quote in
the bundle that isn't ours; fail open (with a debug log) when no
identity source is attached, mirroring the missing-record-count
posture. The on-chain median payment binding is unaffected.

Also reword the rejection message: it claimed "paid X" while actually
printing the quote price, not the on-chain payment.

Adds a regression test reproducing the incident shape (own fresh quote
at 1788 records alongside honest neighbour quotes at 47 and 978), a
test that an own stale quote still rejects among expensive neighbours,
and a fail-open test for the missing-identity path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the fix/quote-freshness-own-quote-only branch from 51f7d55 to 8f69ae1 Compare June 5, 2026 14:35
Copilot AI review requested due to automatic review settings June 5, 2026 14:35
@jacderida
Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough review @grumbach — both notes addressed in the amended commit (51f7d558f69ae1, history rewritten rather than stacked):

Note 1 (two self-identity notions): added a comment in validate_quote_freshness stating that the rewards-address identity (validate_local_recipient) and the peer-ID identity (this gate) are expected to refer to the same node, plus the debug! breadcrumb you suggested: when the bundle contains our rewards address but no quote under our peer ID, the fail-open is now logged instead of silent. That divergence is actually routine in our own fleet (all our nodes share a rewards address, so a PUT reaching a node whose own quote isn't in the bundle passes validate_local_recipient on a sibling's quote) — exactly the kind of path we'd have wanted in the node logs while reconstructing this incident.

Note 2 (at-most-one own quote): added a doc sentence noting that validate_quote_structure rejects duplicate peer IDs and runs before this gate on every path, so the loop matches at most one own quote.

No behaviour change on any rejecting path; clippy (CI flags) and all 61 verifier tests still pass.

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread src/payment/verifier.rs
Comment on lines +688 to +691
/// tolerance — on ant-prod-01 a close group spanning 47..=1788 records
/// made the three fullest nodes reject every bundle containing the
/// emptiest node's (perfectly fresh, 10-second-old) quote, failing the
/// PUT after the client had already paid on-chain. The node can only
@jacderida jacderida merged commit f054fbe into main Jun 5, 2026
11 of 12 checks passed
@jacderida jacderida deleted the fix/quote-freshness-own-quote-only branch June 5, 2026 15:14
jacderida added a commit that referenced this pull request Jun 5, 2026
A proof-of-payment presented during replication is a receipt for a sale that
already closed, but the verifier ran the full client-PUT check set against it.
Two of those checks interrogate the present and therefore guarantee false
rejections for replicated records: the own-quote price-freshness gate (record
counts only grow, so every receipt's quoted price eventually drops below the
verifier's live floor) and the local-recipient check (close groups churn, so a
post-churn member receiving a record via replication was never a payee on the
original receipt). The merkle candidate-closeness check has the same shape: it
validates the winner pool against the live DHT, but the pool was sampled from
the DHT of the original sale.

On DEV-01 (2026-06-05) this rejected nearly 100% of replication
proof-of-payment transfers within an hour of launch: 4M+ "PoP verification
error ... stale" rejections at ~300k/hour, records pinned below target
redundancy, close-group record counts diverging 150x (75..=11,231 per
service), and a permanent ~500 MB/s fleet-wide re-offer storm (~25 TB egress
in 16h). The divergence the rejections caused is also what made the client-PUT
freshness gate (fixed for the heterogeneous-neighbour case in #127) keep
firing: the two failure modes fed each other.

Introduce VerificationContext { ClientPut, Replication } and thread it through
verify_payment. Under Replication the verifier skips only the
storer-being-paid-now checks (own-quote freshness, local recipient, merkle
candidate closeness). Every receipt-authenticity check still runs in both
contexts: quote structure, content binding to the exact address, peer-ID/
pub-key bindings, ML-DSA signatures, and the on-chain settlement lookup — a
record cannot be admitted via replication without an authentic, settled
payment for that record.

The verified-XorName cache is context-aware to match: each entry records
whether its verification ran the full client-PUT check set, a
Replication-verified entry satisfies later replication lookups (re-offers of
the same key are routine) but never a later ClientPut fast-path, and a full
ClientPut verification upgrades the entry without ever being downgraded back.
Without this, a replication receipt would let a later proof-less client PUT
bypass the context-gated checks via the cache.

Deliberate trade-off (documented on the enum): skipping the recipient and
closeness checks for replication admits receipts from self-dealing payers who
settle the median payment to their own wallet on-chain. The client-PUT path
still rejects such pools, replication admission still requires responsibility
for the key, and the abuse costs a settled on-chain payment per chunk; closing
it properly belongs in quote issuance / payment policy rather than in the
replication hot path, where the equivalent defence provably destroys the
network's ability to heal.

Call sites: the chunk PUT handler passes ClientPut (behaviour unchanged); the
fresh-offer and paid-notify replication handlers pass Replication.

Test results: payment::verifier 66/66 (5 new context tests: stale own quote
and non-recipient receipts pass the gated checks under Replication, failing at
the later binding/signature stage; content mismatch rejected under both
contexts; duplicate-candidate merkle pool rejected under ClientPut but past
the closeness check under Replication; Replication-verified cache entry does
not satisfy a ClientPut fast-path, upgrades on full verification, never
downgrades), replication 230/230, storage 29/29. cargo clippy --all-targets
clean. Note: config::tests::test_bootstrap_peers_discover_env_var fails on
machines with a real ~/.config/ant/bootstrap_peers.toml — pre-existing on
main, unrelated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants