perf/goldilocks-prover-v1 by Barnadrot · Pull Request #254 · leanEthereum/leanVM

Barnadrot · 2026-06-14T14:54:38Z

PR: perf/goldilocks-prover-v1

Goldilocks prover performance: −11.6% e2e wall-clock, −1.4% proof size

Base: origin/goldilocks (671e6e7)
Branch: perf/goldilocks-prover-v1
Source experiment: goldilocks-ax42u (6 iterations, 3 keeps + 3 discards)

Summary

Four orthogonal prover optimizations from the goldilocks-ax42u autoresearch experiment.
All changes are prover-side except the execution column virtualization (commit 3), which
removes 3 committed columns and introduces a deferred-claim bus protocol — the only
change visible to verifiers. Proofs from the other 3 commits are byte-identical to
baseline.

Cumulative measured improvement (eval_paired.sh, origin/goldilocks vs perf/goldilocks-prover-v1, 4 rounds × 4 warm proofs = 16 samples per side):

Metric	Baseline	Candidate	Delta
e2e prove_loop wall	4.115s	3.636s	−11.64% (t = −108.2, p ≈ 0)
Proof size	500 KiB	493 KiB	−1.40%
Recursion wall	6.919s	6.363s	−8.04% (faster)

Gate details: noise floor 0.14% (< 0.5%), no drift across rounds (−11.67 / −11.55 / −11.91 / −11.41%), steady-state confirmed (250 proofs), fancy-aggregation topology 28.5s.

Commits

1. `perf: lazy-accumulator field vocabulary + WHIR product sumcheck optimization`

Lazy-accumulator trait for deferred-reduction arithmetic over Goldilocks (Pornin 2022/274,
BDT 2024/1046 cost models). Scalar 192-bit limbs + AVX-512 packed (L,H,W,K) + vectorized
base×ext kernel. Applied to WHIR initial product sumcheck: lazy round-1 with 4
accumulator passes, fused R2-entry via 4-term basis comb.

Files: field.rs, goldilocks.rs, packing.rs, lazy_acc.rs, product_computation.rs
Measured: run_product_sumcheck 250ms → ~210ms, proofs byte-identical
Tests: lazy_acc.rs (338 lines, boundary + randomized), eager/lazy transcript equality via recording FSProver

2. `perf: AIR sumcheck C2 constraint tables + Poseidon degree-deficit skip`

C2 per-pair constraint tables for Poseidon8: cache constraint composition per (lo,hi)
pair, skip redundant evals. degree_z() Air trait: Poseidon8 true max degree is 7
(0/106 constraints reach declared degree 8), so z=8 eval pass removed. Wire format
unchanged (messages sized by degree_air()).

Files: air/lib.rs, air_sumcheck.rs, poseidon/mod.rs, degree_census.rs, c2_bus_seed.rs, c2_table.rs
Measured: AIR phase 766ms → 692ms, e2e −5.21%, proofs byte-identical, recursion −3.14%
Tests: degree census (32 random fold lines, 9-point interpolant top coeff = 0), C2 table equality, bus seed equality
Security note: degree_z is a prover-only optimization. The structural degree argument (S-box x^7, linear column refs) is validated empirically. No interaction with Johnson bound / WHIR proximity gap — those operate on committed RS codewords, not sumcheck eval point counts.

3. `perf: execution address-column virtualization + deferred-claim bus protocol`

This is the only protocol-level change. 3 committed execution columns (ADDR_A/B/C)
replaced with virtual closed-form expressions. Memory-bus lookups become deferred-claim
buses: prover sends one composite denominator eval per lookup, AIR sumcheck re-derives
from committed column openings.

Files: execution/air.rs, execution/mod.rs, table_trait.rs, utils.rs, logup.rs, prove_execution.rs, verify_execution.rs, stacked_pcs.rs, compilation.rs, recursion.py
Measured: e2e −4.01%, recursion −4.14%, proof −1.40%
Column change: 20 → 17 committed execution columns, −6 GKR evals, −6 WHIR statements
Tests: adversarial bitflip test (h9_deferred_adversarial.rs, 95 transcript scalars all binding), padding row virtual-address layout test
Soundness analysis: D_hat (deferred denominator) is grounded by the AIR sumcheck final check — a forged D_hat poisons initial_sum, and the verifier's constraint re-derivation from column openings rejects. Cross-table compensation impossible (each table's constraint eval depends only on its own column evals). Virtual address closed-forms are exact for valid flag values; aux_1 is bytecode-bus-constrained to {0,1,2}, preventing non-standard flag values.

4. `perf: Poseidon8 witness MDS u128 kernel + packed deferred-aux pass`

Witness-generation speedup. MDS: u128-accumulator variant defers modular reduction to
final sum (original kept as mds_vec_mul_oracle under #[cfg(test)]). Deferred-aux:
AVX-512 vectorized fill for round intermediates, 8 rows in parallel.

Files: poseidon/mod.rs, poseidon/trace_gen.rs
Measured: witness 175ms → 124ms, e2e −2.05%, proofs byte-identical
Tests: mds_u128_equals_oracle (100k patterns), deferred_aux_fill_equals_scalar (non-aligned row count)

Main Portability

Measurements on a cherry picked branch for portable wins show that the wins don't translate to KoalaBear

Metric	Baseline	Candidate	Delta
e2e wall	2.068s	2.059s	−0.44% (below 1.0% keep bar)
Proof size	492 KiB	484 KiB	−1.63%
Recursion	3.41s	3.41s	+0.06% (neutral)

The proof size reduction alone could be relevant to justify porting to main, I have a branch ready to open a PR.

Security Review

z=8 eval removal.
Investigated whether removing the z=8 evaluation point from Poseidon8's sumcheck
affects Johnson bound security. It does not: sumcheck soundness depends on the degree
bound (Schwartz-Zippel), not the eval point count; the FRI/WHIR proximity gap operates
on committed RS codewords independently; and degree_air() still returns 8, so proofs
are byte-identical. The structural argument (S-box x^7, all column refs linear) is
validated by the degree_census.rs test. Proximity Prize and Poseidon Initiative
findings were checked — neither is relevant to this scenario.

Deferred-claim soundness (3 attack paths).

Grounding gap: Can a malicious prover forge D_hat? No — the verifier seeds
initial_sum with alpha * (logup_c - D_hat), then the AIR sumcheck re-derives the
fingerprint from WHIR-opened column evals. A forged D_hat fails the AIR final check.
Cross-table compensation is impossible.
Non-standard aux_1: Setting aux_1=3 would give flag_deref=3, polluting addr_b. But
aux_1 is in the bytecode bus (committed column 15), so GKR forces it to match the
compiled bytecode. The compiler only emits aux_1 ∈ {0, 1, 2}.
degree_z undershoot: A future constraint reaching degree 8 without updating degree_z
would silently break. Not exploitable today (structural). Mitigated by degree_census
test; a runtime debug_assert on the top coefficient would strengthen this.

Fingerprint field security.
Verified the eval_bus_data_only fingerprint provides full extension-field security
(~192 bits), not base-field. All challenges (logup_c, logup_alphas) are EF. Column
values are committed before challenges are drawn. At the sumcheck verification point,
column MLEs produce EF values. assert_zero_ef uses one alpha power per constraint
(invertible in EF). No small-field reduction occurs in the GKR bus balance.

Pre-existing test failures

test_aggregation and test_prove_poseidon_8 at default LOG_N_ROWS=11 fail due to a
pre-existing eq_mle packing assertion. Reproduces on clean main and origin/goldilocks.
Both pass at LOG_N_ROWS=14. Not caused by this branch.

Tests

The automated research with Fable built extensive tests to validate the approach. Happy to strip them with an additional commit on top, left them in for review.

…ization Lazy-accumulator trait for deferred-reduction arithmetic (Pornin 2022/274 + BDT 2024/1046 cost models): scalar Goldilocks 192-bit limbs, AVX-512 packed override (L,H,W,K), and vectorized base×ext round-1 kernel. WHIR initial product sumcheck: lazy round-1 keeps evals as 2 implicit base streams (a_i, b_i) with 4 lazy-acc passes binding r0 once per accumulator; fused R2-entry materializes E'' at 2^24 directly from P via 4-term basis comb. Includes equality harness with recording FSProver ensuring transcript bit-identity between eager and lazy paths across multiple sizes and round counts. Measured: run_product_sumcheck 250ms → ~210ms (lazy path), proofs byte-identical. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

C2 per-pair constraint tables for Poseidon8: pre-compute and cache the constraint composition per (lo,hi) pair, avoiding redundant evaluation in rounds where the folded columns are re-read. Bus-only round-0 seeding avoids the full constraint eval at z=0 when the pair's C-table is available. degree_z() Air trait method: decouples the prover's fresh-eval count from the declared degree_air(). Poseidon8 override returns 7 (true max constraint degree; 0 of 106 constraints reach the declared degree 8, validated by the degree_census test on real witness traces). Saves one eval pass per pair per round. Wire format unchanged (messages sized by degree_air; proofs byte- identical). Measured: AIR phase 766ms → 692ms (−72ms Poseidon skip, −39ms C2 tables), e2e −5.21% wall. Proofs byte-identical, recursion −3.14% faster. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…otocol Replace 3 committed execution columns (ADDR_A/B/C) with virtual closed-form expressions of existing committed columns. Introduces deferred-claim bus protocol: the prover sends one composite denominator eval per virtual memory lookup, and the batched AIR sumcheck re-derives the fingerprint from committed column evaluations (eval_bus_data_only constraint). Column reduction: 20 → 17 committed execution columns → 6 fewer GKR evaluations, 6 fewer WHIR opening statements, stacked PCS shrinks. Full protocol stack: LogUp prover/verifier arms for deferred One-buses, verify_execution initial_sum loop extended, recursion circuit (zkDSL) ported with deferred-claim codegen in compilation.rs. Soundness: adversarial bitflip test (95 transcript scalars, all binding); padding row carries mem0 for (0, VALUE_*) bus balance; frozen-verifier rejection is the anticipated tripwire (h9 format vs pre-h9 verifier). Measured: −4.01% e2e wall, −4.14% recursion, −1.40% proof size. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Witness MDS: replace the fully-reducing mds_vec_mul with a u128-accumulator variant that defers modular reduction to the final sum (6 muls × 8 accumulators → one reduce). Original kept as mds_vec_mul_oracle under #[cfg(test)] with 100k-pattern equality test. Packed deferred-aux witness pass: AVX-512 vectorized fill for the auxiliary witness columns (round intermediates), processing 8 rows in parallel with tail-correctness validated on non-aligned row counts. Both are prover-only (witness generation); proofs byte-identical. Measured: witness 175ms → 124ms, e2e −2.05% wall. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…docs Remove iteration/hypothesis/plan_spec references (h9-A, h6', T3', plan §X, etc.) from code comments — they belong in commit messages, not code. Shorten multi-line doc comments to concise one-liners where the code is self- explanatory. Rename h9_deferred_adversarial.rs → deferred_claim_adversarial.rs. Net: −167 comment lines, zero logic changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- product_computation.rs: 3 #[cfg(test)] modules (kernel oracle, recording FSProver equality, lazy-vs-eager transcript) → tests/product_sumcheck_equality.rs - poseidon/mod.rs: witness microbench + deferred_aux_fill + packed_witness → tests/poseidon_witness.rs (mds_u128_equals_oracle stays inline — needs private mds_vec_mul) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Barnadrot and others added 7 commits June 14, 2026 14:15

chore: remove stray quintic_extension/tests.rs (not part of this branch)

c626859

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf/goldilocks-prover-v1#254

perf/goldilocks-prover-v1#254
Barnadrot wants to merge 7 commits into
leanEthereum:goldilocksfrom
Barnadrot:perf/goldilocks-prover-v1

Barnadrot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Barnadrot commented Jun 14, 2026

PR: perf/goldilocks-prover-v1

Summary

Commits

1. perf: lazy-accumulator field vocabulary + WHIR product sumcheck optimization

2. perf: AIR sumcheck C2 constraint tables + Poseidon degree-deficit skip

3. perf: execution address-column virtualization + deferred-claim bus protocol

4. perf: Poseidon8 witness MDS u128 kernel + packed deferred-aux pass

Main Portability

Security Review

Pre-existing test failures

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `perf: lazy-accumulator field vocabulary + WHIR product sumcheck optimization`

2. `perf: AIR sumcheck C2 constraint tables + Poseidon degree-deficit skip`

3. `perf: execution address-column virtualization + deferred-claim bus protocol`

4. `perf: Poseidon8 witness MDS u128 kernel + packed deferred-aux pass`