Skip to content

perf/goldilocks-prover-v1#254

Open
Barnadrot wants to merge 7 commits into
leanEthereum:goldilocksfrom
Barnadrot:perf/goldilocks-prover-v1
Open

perf/goldilocks-prover-v1#254
Barnadrot wants to merge 7 commits into
leanEthereum:goldilocksfrom
Barnadrot:perf/goldilocks-prover-v1

Conversation

@Barnadrot

Copy link
Copy Markdown
Contributor

PR: perf/goldilocks-prover-v1

Goldilocks prover performance: −11.6% e2e wall-clock, −1.4% proof size

Base: origin/goldilocks (671e6e7)
Branch: perf/goldilocks-prover-v1
Source experiment: goldilocks-ax42u (6 iterations, 3 keeps + 3 discards)

Summary

Four orthogonal prover optimizations from the goldilocks-ax42u autoresearch experiment.
All changes are prover-side except the execution column virtualization (commit 3), which
removes 3 committed columns and introduces a deferred-claim bus protocol — the only
change visible to verifiers. Proofs from the other 3 commits are byte-identical to
baseline.

Cumulative measured improvement (eval_paired.sh, origin/goldilocks vs perf/goldilocks-prover-v1, 4 rounds × 4 warm proofs = 16 samples per side):

Metric Baseline Candidate Delta
e2e prove_loop wall 4.115s 3.636s −11.64% (t = −108.2, p ≈ 0)
Proof size 500 KiB 493 KiB −1.40%
Recursion wall 6.919s 6.363s −8.04% (faster)

Gate details: noise floor 0.14% (< 0.5%), no drift across rounds (−11.67 / −11.55 / −11.91 / −11.41%), steady-state confirmed (250 proofs), fancy-aggregation topology 28.5s.

Commits

1. perf: lazy-accumulator field vocabulary + WHIR product sumcheck optimization

Lazy-accumulator trait for deferred-reduction arithmetic over Goldilocks (Pornin 2022/274,
BDT 2024/1046 cost models). Scalar 192-bit limbs + AVX-512 packed (L,H,W,K) + vectorized
base×ext kernel. Applied to WHIR initial product sumcheck: lazy round-1 with 4
accumulator passes, fused R2-entry via 4-term basis comb.

  • Files: field.rs, goldilocks.rs, packing.rs, lazy_acc.rs, product_computation.rs
  • Measured: run_product_sumcheck 250ms → ~210ms, proofs byte-identical
  • Tests: lazy_acc.rs (338 lines, boundary + randomized), eager/lazy transcript equality via recording FSProver

2. perf: AIR sumcheck C2 constraint tables + Poseidon degree-deficit skip

C2 per-pair constraint tables for Poseidon8: cache constraint composition per (lo,hi)
pair, skip redundant evals. degree_z() Air trait: Poseidon8 true max degree is 7
(0/106 constraints reach declared degree 8), so z=8 eval pass removed. Wire format
unchanged (messages sized by degree_air()).

  • Files: air/lib.rs, air_sumcheck.rs, poseidon/mod.rs, degree_census.rs, c2_bus_seed.rs, c2_table.rs
  • Measured: AIR phase 766ms → 692ms, e2e −5.21%, proofs byte-identical, recursion −3.14%
  • Tests: degree census (32 random fold lines, 9-point interpolant top coeff = 0), C2 table equality, bus seed equality
  • Security note: degree_z is a prover-only optimization. The structural degree argument (S-box x^7, linear column refs) is validated empirically. No interaction with Johnson bound / WHIR proximity gap — those operate on committed RS codewords, not sumcheck eval point counts.

3. perf: execution address-column virtualization + deferred-claim bus protocol

This is the only protocol-level change. 3 committed execution columns (ADDR_A/B/C)
replaced with virtual closed-form expressions. Memory-bus lookups become deferred-claim
buses: prover sends one composite denominator eval per lookup, AIR sumcheck re-derives
from committed column openings.

  • Files: execution/air.rs, execution/mod.rs, table_trait.rs, utils.rs, logup.rs, prove_execution.rs, verify_execution.rs, stacked_pcs.rs, compilation.rs, recursion.py
  • Measured: e2e −4.01%, recursion −4.14%, proof −1.40%
  • Column change: 20 → 17 committed execution columns, −6 GKR evals, −6 WHIR statements
  • Tests: adversarial bitflip test (h9_deferred_adversarial.rs, 95 transcript scalars all binding), padding row virtual-address layout test
  • Soundness analysis: D_hat (deferred denominator) is grounded by the AIR sumcheck final check — a forged D_hat poisons initial_sum, and the verifier's constraint re-derivation from column openings rejects. Cross-table compensation impossible (each table's constraint eval depends only on its own column evals). Virtual address closed-forms are exact for valid flag values; aux_1 is bytecode-bus-constrained to {0,1,2}, preventing non-standard flag values.

4. perf: Poseidon8 witness MDS u128 kernel + packed deferred-aux pass

Witness-generation speedup. MDS: u128-accumulator variant defers modular reduction to
final sum (original kept as mds_vec_mul_oracle under #[cfg(test)]). Deferred-aux:
AVX-512 vectorized fill for round intermediates, 8 rows in parallel.

  • Files: poseidon/mod.rs, poseidon/trace_gen.rs
  • Measured: witness 175ms → 124ms, e2e −2.05%, proofs byte-identical
  • Tests: mds_u128_equals_oracle (100k patterns), deferred_aux_fill_equals_scalar (non-aligned row count)

Main Portability

Measurements on a cherry picked branch for portable wins show that the wins don't translate to KoalaBear

Metric Baseline Candidate Delta
e2e wall 2.068s 2.059s −0.44% (below 1.0% keep bar)
Proof size 492 KiB 484 KiB −1.63%
Recursion 3.41s 3.41s +0.06% (neutral)

The proof size reduction alone could be relevant to justify porting to main, I have a branch ready to open a PR.

Security Review

z=8 eval removal.
Investigated whether removing the z=8 evaluation point from Poseidon8's sumcheck
affects Johnson bound security. It does not: sumcheck soundness depends on the degree
bound (Schwartz-Zippel), not the eval point count; the FRI/WHIR proximity gap operates
on committed RS codewords independently; and degree_air() still returns 8, so proofs
are byte-identical. The structural argument (S-box x^7, all column refs linear) is
validated by the degree_census.rs test. Proximity Prize and Poseidon Initiative
findings were checked — neither is relevant to this scenario.

Deferred-claim soundness (3 attack paths).

  • Grounding gap: Can a malicious prover forge D_hat? No — the verifier seeds
    initial_sum with alpha * (logup_c - D_hat), then the AIR sumcheck re-derives the
    fingerprint from WHIR-opened column evals. A forged D_hat fails the AIR final check.
    Cross-table compensation is impossible.
  • Non-standard aux_1: Setting aux_1=3 would give flag_deref=3, polluting addr_b. But
    aux_1 is in the bytecode bus (committed column 15), so GKR forces it to match the
    compiled bytecode. The compiler only emits aux_1 ∈ {0, 1, 2}.
  • degree_z undershoot: A future constraint reaching degree 8 without updating degree_z
    would silently break. Not exploitable today (structural). Mitigated by degree_census
    test; a runtime debug_assert on the top coefficient would strengthen this.

Fingerprint field security.
Verified the eval_bus_data_only fingerprint provides full extension-field security
(~192 bits), not base-field. All challenges (logup_c, logup_alphas) are EF. Column
values are committed before challenges are drawn. At the sumcheck verification point,
column MLEs produce EF values. assert_zero_ef uses one alpha power per constraint
(invertible in EF). No small-field reduction occurs in the GKR bus balance.

Pre-existing test failures

test_aggregation and test_prove_poseidon_8 at default LOG_N_ROWS=11 fail due to a
pre-existing eq_mle packing assertion. Reproduces on clean main and origin/goldilocks.
Both pass at LOG_N_ROWS=14. Not caused by this branch.

Tests

The automated research with Fable built extensive tests to validate the approach. Happy to strip them with an additional commit on top, left them in for review.

Barnadrot and others added 7 commits June 14, 2026 14:15
…ization

Lazy-accumulator trait for deferred-reduction arithmetic (Pornin 2022/274 +
BDT 2024/1046 cost models): scalar Goldilocks 192-bit limbs, AVX-512 packed
override (L,H,W,K), and vectorized base×ext round-1 kernel.

WHIR initial product sumcheck: lazy round-1 keeps evals as 2 implicit base
streams (a_i, b_i) with 4 lazy-acc passes binding r0 once per accumulator;
fused R2-entry materializes E'' at 2^24 directly from P via 4-term basis comb.

Includes equality harness with recording FSProver ensuring transcript
bit-identity between eager and lazy paths across multiple sizes and round
counts.

Measured: run_product_sumcheck 250ms → ~210ms (lazy path), proofs byte-identical.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
C2 per-pair constraint tables for Poseidon8: pre-compute and cache the
constraint composition per (lo,hi) pair, avoiding redundant evaluation in
rounds where the folded columns are re-read. Bus-only round-0 seeding avoids
the full constraint eval at z=0 when the pair's C-table is available.

degree_z() Air trait method: decouples the prover's fresh-eval count from the
declared degree_air(). Poseidon8 override returns 7 (true max constraint
degree; 0 of 106 constraints reach the declared degree 8, validated by the
degree_census test on real witness traces). Saves one eval pass per pair per
round. Wire format unchanged (messages sized by degree_air; proofs byte-
identical).

Measured: AIR phase 766ms → 692ms (−72ms Poseidon skip, −39ms C2 tables),
e2e −5.21% wall. Proofs byte-identical, recursion −3.14% faster.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…otocol

Replace 3 committed execution columns (ADDR_A/B/C) with virtual closed-form
expressions of existing committed columns. Introduces deferred-claim bus
protocol: the prover sends one composite denominator eval per virtual memory
lookup, and the batched AIR sumcheck re-derives the fingerprint from committed
column evaluations (eval_bus_data_only constraint).

Column reduction: 20 → 17 committed execution columns → 6 fewer GKR
evaluations, 6 fewer WHIR opening statements, stacked PCS shrinks.

Full protocol stack: LogUp prover/verifier arms for deferred One-buses,
verify_execution initial_sum loop extended, recursion circuit (zkDSL)
ported with deferred-claim codegen in compilation.rs.

Soundness: adversarial bitflip test (95 transcript scalars, all binding);
padding row carries mem0 for (0, VALUE_*) bus balance; frozen-verifier
rejection is the anticipated tripwire (h9 format vs pre-h9 verifier).

Measured: −4.01% e2e wall, −4.14% recursion, −1.40% proof size.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Witness MDS: replace the fully-reducing mds_vec_mul with a u128-accumulator
variant that defers modular reduction to the final sum (6 muls × 8 accumulators
→ one reduce). Original kept as mds_vec_mul_oracle under #[cfg(test)] with
100k-pattern equality test.

Packed deferred-aux witness pass: AVX-512 vectorized fill for the auxiliary
witness columns (round intermediates), processing 8 rows in parallel with
tail-correctness validated on non-aligned row counts.

Both are prover-only (witness generation); proofs byte-identical.

Measured: witness 175ms → 124ms, e2e −2.05% wall.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…docs

Remove iteration/hypothesis/plan_spec references (h9-A, h6', T3', plan §X,
etc.) from code comments — they belong in commit messages, not code. Shorten
multi-line doc comments to concise one-liners where the code is self-
explanatory. Rename h9_deferred_adversarial.rs → deferred_claim_adversarial.rs.

Net: −167 comment lines, zero logic changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- product_computation.rs: 3 #[cfg(test)] modules (kernel oracle, recording
  FSProver equality, lazy-vs-eager transcript) → tests/product_sumcheck_equality.rs
- poseidon/mod.rs: witness microbench + deferred_aux_fill + packed_witness
  → tests/poseidon_witness.rs (mds_u128_equals_oracle stays inline — needs
  private mds_vec_mul)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant