perf/goldilocks-prover-v1#254
Open
Barnadrot wants to merge 7 commits into
Open
Conversation
…ization Lazy-accumulator trait for deferred-reduction arithmetic (Pornin 2022/274 + BDT 2024/1046 cost models): scalar Goldilocks 192-bit limbs, AVX-512 packed override (L,H,W,K), and vectorized base×ext round-1 kernel. WHIR initial product sumcheck: lazy round-1 keeps evals as 2 implicit base streams (a_i, b_i) with 4 lazy-acc passes binding r0 once per accumulator; fused R2-entry materializes E'' at 2^24 directly from P via 4-term basis comb. Includes equality harness with recording FSProver ensuring transcript bit-identity between eager and lazy paths across multiple sizes and round counts. Measured: run_product_sumcheck 250ms → ~210ms (lazy path), proofs byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
C2 per-pair constraint tables for Poseidon8: pre-compute and cache the constraint composition per (lo,hi) pair, avoiding redundant evaluation in rounds where the folded columns are re-read. Bus-only round-0 seeding avoids the full constraint eval at z=0 when the pair's C-table is available. degree_z() Air trait method: decouples the prover's fresh-eval count from the declared degree_air(). Poseidon8 override returns 7 (true max constraint degree; 0 of 106 constraints reach the declared degree 8, validated by the degree_census test on real witness traces). Saves one eval pass per pair per round. Wire format unchanged (messages sized by degree_air; proofs byte- identical). Measured: AIR phase 766ms → 692ms (−72ms Poseidon skip, −39ms C2 tables), e2e −5.21% wall. Proofs byte-identical, recursion −3.14% faster. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…otocol Replace 3 committed execution columns (ADDR_A/B/C) with virtual closed-form expressions of existing committed columns. Introduces deferred-claim bus protocol: the prover sends one composite denominator eval per virtual memory lookup, and the batched AIR sumcheck re-derives the fingerprint from committed column evaluations (eval_bus_data_only constraint). Column reduction: 20 → 17 committed execution columns → 6 fewer GKR evaluations, 6 fewer WHIR opening statements, stacked PCS shrinks. Full protocol stack: LogUp prover/verifier arms for deferred One-buses, verify_execution initial_sum loop extended, recursion circuit (zkDSL) ported with deferred-claim codegen in compilation.rs. Soundness: adversarial bitflip test (95 transcript scalars, all binding); padding row carries mem0 for (0, VALUE_*) bus balance; frozen-verifier rejection is the anticipated tripwire (h9 format vs pre-h9 verifier). Measured: −4.01% e2e wall, −4.14% recursion, −1.40% proof size. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Witness MDS: replace the fully-reducing mds_vec_mul with a u128-accumulator variant that defers modular reduction to the final sum (6 muls × 8 accumulators → one reduce). Original kept as mds_vec_mul_oracle under #[cfg(test)] with 100k-pattern equality test. Packed deferred-aux witness pass: AVX-512 vectorized fill for the auxiliary witness columns (round intermediates), processing 8 rows in parallel with tail-correctness validated on non-aligned row counts. Both are prover-only (witness generation); proofs byte-identical. Measured: witness 175ms → 124ms, e2e −2.05% wall. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…docs Remove iteration/hypothesis/plan_spec references (h9-A, h6', T3', plan §X, etc.) from code comments — they belong in commit messages, not code. Shorten multi-line doc comments to concise one-liners where the code is self- explanatory. Rename h9_deferred_adversarial.rs → deferred_claim_adversarial.rs. Net: −167 comment lines, zero logic changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- product_computation.rs: 3 #[cfg(test)] modules (kernel oracle, recording FSProver equality, lazy-vs-eager transcript) → tests/product_sumcheck_equality.rs - poseidon/mod.rs: witness microbench + deferred_aux_fill + packed_witness → tests/poseidon_witness.rs (mds_u128_equals_oracle stays inline — needs private mds_vec_mul) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: perf/goldilocks-prover-v1
Goldilocks prover performance: −11.6% e2e wall-clock, −1.4% proof size
Base:
origin/goldilocks(671e6e7)Branch:
perf/goldilocks-prover-v1Source experiment:
goldilocks-ax42u(6 iterations, 3 keeps + 3 discards)Summary
Four orthogonal prover optimizations from the goldilocks-ax42u autoresearch experiment.
All changes are prover-side except the execution column virtualization (commit 3), which
removes 3 committed columns and introduces a deferred-claim bus protocol — the only
change visible to verifiers. Proofs from the other 3 commits are byte-identical to
baseline.
Cumulative measured improvement (eval_paired.sh, origin/goldilocks vs perf/goldilocks-prover-v1, 4 rounds × 4 warm proofs = 16 samples per side):
Gate details: noise floor 0.14% (< 0.5%), no drift across rounds (−11.67 / −11.55 / −11.91 / −11.41%), steady-state confirmed (250 proofs), fancy-aggregation topology 28.5s.
Commits
1.
perf: lazy-accumulator field vocabulary + WHIR product sumcheck optimizationLazy-accumulator trait for deferred-reduction arithmetic over Goldilocks (Pornin 2022/274,
BDT 2024/1046 cost models). Scalar 192-bit limbs + AVX-512 packed (L,H,W,K) + vectorized
base×ext kernel. Applied to WHIR initial product sumcheck: lazy round-1 with 4
accumulator passes, fused R2-entry via 4-term basis comb.
field.rs,goldilocks.rs,packing.rs,lazy_acc.rs,product_computation.rslazy_acc.rs(338 lines, boundary + randomized), eager/lazy transcript equality via recording FSProver2.
perf: AIR sumcheck C2 constraint tables + Poseidon degree-deficit skipC2 per-pair constraint tables for Poseidon8: cache constraint composition per (lo,hi)
pair, skip redundant evals.
degree_z()Air trait: Poseidon8 true max degree is 7(0/106 constraints reach declared degree 8), so z=8 eval pass removed. Wire format
unchanged (messages sized by
degree_air()).air/lib.rs,air_sumcheck.rs,poseidon/mod.rs,degree_census.rs,c2_bus_seed.rs,c2_table.rsdegree_zis a prover-only optimization. The structural degree argument (S-boxx^7, linear column refs) is validated empirically. No interaction with Johnson bound / WHIR proximity gap — those operate on committed RS codewords, not sumcheck eval point counts.3.
perf: execution address-column virtualization + deferred-claim bus protocolThis is the only protocol-level change. 3 committed execution columns (ADDR_A/B/C)
replaced with virtual closed-form expressions. Memory-bus lookups become deferred-claim
buses: prover sends one composite denominator eval per lookup, AIR sumcheck re-derives
from committed column openings.
execution/air.rs,execution/mod.rs,table_trait.rs,utils.rs,logup.rs,prove_execution.rs,verify_execution.rs,stacked_pcs.rs,compilation.rs,recursion.pyh9_deferred_adversarial.rs, 95 transcript scalars all binding), padding row virtual-address layout testinitial_sum, and the verifier's constraint re-derivation from column openings rejects. Cross-table compensation impossible (each table's constraint eval depends only on its own column evals). Virtual address closed-forms are exact for valid flag values;aux_1is bytecode-bus-constrained to {0,1,2}, preventing non-standard flag values.4.
perf: Poseidon8 witness MDS u128 kernel + packed deferred-aux passWitness-generation speedup. MDS: u128-accumulator variant defers modular reduction to
final sum (original kept as
mds_vec_mul_oracleunder#[cfg(test)]). Deferred-aux:AVX-512 vectorized fill for round intermediates, 8 rows in parallel.
poseidon/mod.rs,poseidon/trace_gen.rsmds_u128_equals_oracle(100k patterns),deferred_aux_fill_equals_scalar(non-aligned row count)Main Portability
Measurements on a cherry picked branch for portable wins show that the wins don't translate to KoalaBear
The proof size reduction alone could be relevant to justify porting to main, I have a branch ready to open a PR.
Security Review
z=8 eval removal.
Investigated whether removing the z=8 evaluation point from Poseidon8's sumcheck
affects Johnson bound security. It does not: sumcheck soundness depends on the degree
bound (Schwartz-Zippel), not the eval point count; the FRI/WHIR proximity gap operates
on committed RS codewords independently; and
degree_air()still returns 8, so proofsare byte-identical. The structural argument (S-box x^7, all column refs linear) is
validated by the
degree_census.rstest. Proximity Prize and Poseidon Initiativefindings were checked — neither is relevant to this scenario.
Deferred-claim soundness (3 attack paths).
initial_sumwithalpha * (logup_c - D_hat), then the AIR sumcheck re-derives thefingerprint from WHIR-opened column evals. A forged D_hat fails the AIR final check.
Cross-table compensation is impossible.
aux_1 is in the bytecode bus (committed column 15), so GKR forces it to match the
compiled bytecode. The compiler only emits aux_1 ∈ {0, 1, 2}.
would silently break. Not exploitable today (structural). Mitigated by degree_census
test; a runtime debug_assert on the top coefficient would strengthen this.
Fingerprint field security.
Verified the
eval_bus_data_onlyfingerprint provides full extension-field security(~192 bits), not base-field. All challenges (
logup_c,logup_alphas) are EF. Columnvalues are committed before challenges are drawn. At the sumcheck verification point,
column MLEs produce EF values.
assert_zero_efuses one alpha power per constraint(invertible in EF). No small-field reduction occurs in the GKR bus balance.
Pre-existing test failures
test_aggregationandtest_prove_poseidon_8at default LOG_N_ROWS=11 fail due to apre-existing
eq_mlepacking assertion. Reproduces on clean main and origin/goldilocks.Both pass at LOG_N_ROWS=14. Not caused by this branch.
Tests
The automated research with Fable built extensive tests to validate the approach. Happy to strip them with an additional commit on top, left them in for review.