Skip to content

Reduce shard-0 cross-shard writes with streamed witnesses#36

Merged
hero78119 merged 14 commits intocenofrom
feat/opt_first_shard
May 2, 2026
Merged

Reduce shard-0 cross-shard writes with streamed witnesses#36
hero78119 merged 14 commits intocenofrom
feat/opt_first_shard

Conversation

@hero78119
Copy link
Copy Markdown
Collaborator

@hero78119 hero78119 commented Apr 30, 2026

Summary

This PR reduces shard-0 cross-shard external writes by changing the client input path from eager materialization to just-in-time witness streaming.

Main changes:

  • Read client input through a streaming guest reader instead of deserializing the whole ClientExecutorInput up front.
  • Keep trie byte payloads as raw hint slices and materialize them only immediately before use.
  • Stream account, storage-trie, and bytecode witness items in actual access order.
  • Cache only small account values needed during execution instead of carrying the full decoded state trie from shard 0.
  • Delay full parent state-trie byte reads until bundle-state update, right before trie decode and validation.
  • Materialize post-update account/storage witnesses into keyed caches before unordered bundle-state traversal.
  • Document the cross-shard materialization result in cross_shard_opt.md.

Implementation Decision

After reviewing the latest benchmark, we decided to revert commit 2b5a2ec06b617def6a0b1a65852493234dfec0a (Reduce cross-shard hint reads for storage tries). The dual-region storage-trie reread experiment reduced ShardRAM pressure, but it added guest work, increased shard count, and regressed E2E time.

The benchmark tables below are intentionally kept as historical data for that experiment. The PR implementation should be evaluated as the streamed-witness refactor without the reverted dual-region storage-trie reread change.

Trust and Verification

Hint data remains untrusted. The streaming refactor keeps the usual verification checks:

  • Parent state trie is decoded and checked against the ancestor header state root.
  • Streamed account witness items are validated against the decoded parent state trie before applying bundle updates.
  • Streamed storage tries are checked against the expected account storage root.
  • Streamed bytecodes are checked against the requested bytecode hash.
  • Final post-execution state root is still checked against the executed block header.

CI Benchmark

Compared block 23817600 CI runs:

Run Branch / SHA Result
Original baseline ceno / 09edb1b5 https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25004787999
Initial streamed-witness PR feat/prover_mle_zero_padding / c4ab85b https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25167964126
Previous feat/opt_first_shard baseline feat/opt_first_shard / 65a75752 https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25168546731
Latest PR feat/opt_first_shard / f17d2166 https://github.com/scroll-tech/ceno-reth-benchmark/actions/runs/25240584862

End-to-End / Proving

Historical note: the Latest PR column below is the benchmark for f17d2166, before reverting 2b5a2ec06b617def6a0b1a65852493234dfec0a. We keep it in the PR description to document why the dual-region storage-trie reread experiment was reverted. Latest delta/change are computed against the previous feat/opt_first_shard baseline 65a75752.

Metric Original baseline Initial PR Latest PR Latest delta vs 65a75752 Latest change vs 65a75752
E2E total time 81.9s 74.4s 82.6s +7.2s +9.6%
app_prove.inner / app prove 67.2s 59.7s 66.5s +5.8s +9.6%
preflight-execute / emulator 10.4s 10.2s 11.3s +1.2s +11.9%
Instructions - - 349,475,672 +35,677,896 +11.4%
Cycles - - 1,397,902,692 +142,711,584 +11.4%
Shards - - 17 +2 +13.3%
App verify - - 3.73s +0.34s +10.0%
Total witness generation 39.25s 38.02s 43.04s +0.11s +0.3%
Shard-0 witness generation 3.91s 2.94s 2.91s -0.08s -2.7%
pcs_opening 17.892s 15.252s 17.183s +1.811s +11.8%

ShardRAM Circuit Instances

Metric Original baseline Initial PR Latest PR Latest delta vs 65a75752 Latest change vs 65a75752
Total ShardRAM instances 6,838,864 3,301,398 2,791,174 -439,402 -13.6%
Shard-0 ShardRAM instances 2,081,547 166,979 110,355 -56,659 -33.9%
Shard-1 ShardRAM instances 747,286 185,852 66,136 -119,781 -64.4%
Shard-2 ShardRAM instances 560,960 142,172 109,879 -32,481 -22.8%
Shard-3 ShardRAM instances 619,185 204,592 83,973 -120,584 -58.9%
Max shard ShardRAM instances 2,081,547 471,008 334,336 -82,663 -19.8%
Total shard_ram_assign_instances time 2.455s 1.108s 1.074s -0.185s -14.7%
Shard-0 shard_ram_assign_instances time 390ms 85.3ms 55.6ms -42.7ms -43.4%

GPU Module Breakdown

Metric Original baseline Initial PR Latest PR Latest delta vs 65a75752 Latest change vs 65a75752
commit_traces 8.075s 6.898s 7.577s +0.674s +9.8%
prove_main_constraints 24.464s 20.796s 24.328s +0.246s +1.0%
transport_structural_witness 3.475s 2.163s 2.642s -0.121s -4.4%
build_tower_witness_gpu 4.711s 3.133s 5.566s +1.151s +26.1%
prove_tower_relation_gpu 178.197s 170.119s 181.423s +8.835s +5.1%
pcs_opening - - 17.183s +1.811s +11.8%

ShardRAM Distribution

Shard Original baseline Initial PR Latest PR Latest delta vs 65a75752 Latest change vs 65a75752
0 2,081,547 166,979 110,355 -56,659 -33.9%
1 747,286 185,852 66,136 -119,781 -64.4%
2 560,960 142,172 109,879 -32,481 -22.8%
3 619,185 204,592 83,973 -120,584 -58.9%
4 405,006 63,851 50,808 -13,213 -20.6%
5 327,294 208,019 185,825 -22,121 -10.6%
6 188,069 162,476 135,677 -26,701 -16.4%
7 160,828 109,640 78,506 -31,102 -28.4%
8 198,634 186,436 167,883 -18,751 -10.0%
9 220,053 170,506 168,799 -1,714 -1.0%
10 165,876 316,420 316,330 +147 +0.0%
11 201,228 402,888 334,336 -49,240 -12.8%
12 443,809 471,008 134,547 -282,452 -67.7%
13 386,616 377,753 167,113 -214,408 -56.2%
14 132,473 132,806 188,217 +56,868 +43.3%
15 - - 307,921 new shard n/a
16 - - 184,869 new shard n/a

Historical f17d2166 result summary:

  • The reverted experiment improved total ShardRAM instances versus the previous feat/opt_first_shard baseline: 3,230,576 -> 2,791,174 (-13.6%).
  • Shard 0 improved further: 167,014 -> 110,355 (-33.9%).
  • Runtime regressed: E2E 75.4s -> 82.6s (+9.6%), instructions/cycles +11.4%, and shard count 15 -> 17.
  • Decision: revert the dual-region storage-trie reread change and keep this benchmark only as rationale/history.

Note: these CI logs do not include CENO_DEBUG_SHARD_RAM=1, so current_shard_external_write is not available from these raw logs. The comparable raw-log metric here is shard_ram_assign_instances n.

Local Validation

Validated locally on block 23587691, shard 0:

  • cargo check -p openvm-client-executor -p openvm-reth-benchmark
  • cargo ceno build --release from bin/ceno-client-eth
  • GPU sanity e2e with CENO_DEBUG_SHARD_RAM=1

Latest local key metrics after reverting 2b5a2ec06b617def6a0b1a65852493234dfec0a:

Metric Value
Instructions 24,793,946
Cycles 99,175,788
Shards 2
Shard cycle boundaries [4, 60806136, 99175788]
Shard-0 shard_ram_assign_instances 89,594
E2E span 75.7s

Local log: sanity_23587691_shard0_after_revert_local_maxcell6_20260502_100230.log.

Note: the local GPU sanity used the temporary max-cell validation knob required on this machine to avoid host memory pressure; that knob is not included in this PR.

@hero78119 hero78119 changed the title Feat/opt first shard Reduce shard-0 cross-shard writes with streamed witnesses Apr 30, 2026
@hero78119
Copy link
Copy Markdown
Collaborator Author

@copilot please strictly follow https://github.com/forrestchang/andrej-karpathy-skills/blob/main/CLAUDE.md skill and review and refactor this PR and submit changes in a separate PR.

Copy link
Copy Markdown

Copilot AI commented May 1, 2026

@hero78119 I've opened a new pull request, #37, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the client input/witness plumbing to stream witness items in just-in-time access order (instead of eagerly materializing/serializing the full ClientExecutorInput upfront), with the goal of reducing shard-0 cross-shard external writes during proving.

Changes:

  • Introduces a streaming input protocol (ClientInputReader, ClientWitnessInput, trie headers/byte payload streaming) and a new ClientExecutor::execute_from_reader path.
  • Adds witness-access order recording to drive streaming emission order from the host side.
  • Updates the Ceno guest binary and host benchmark harness to use the new streaming input format; adds required dependencies.

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
crates/host-bench/src/lib.rs Writes streaming-formatted client input/witness data into CenoStdin instead of serializing ClientExecutorInput directly.
crates/host-bench/Cargo.toml Adds reth-trie dependency needed for account RLP extraction.
crates/executor/client/src/lib.rs Adds execute_from_reader and witness-order recording to support streamed witness consumption/emission.
crates/executor/client/src/io.rs Defines streaming input types/reader trait, implements StreamingEthereumState, and refactors WitnessDb to support eager vs streaming providers.
crates/executor/client/src/error.rs Adds TrieWitnessError for streaming witness validation failures.
crates/executor/client/Cargo.toml Adds bytes dependency for streamed trie byte payload representation.
bin/ceno-client-eth/src/main.rs Switches guest input reading from ClientExecutorInput to the streaming reader interface.
Cargo.lock Lockfile updates for new/updated dependencies.
bin/ceno-client-eth/Cargo.lock Lockfile updates for new/updated dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/executor/client/src/io.rs
Comment thread crates/executor/client/src/io.rs
hero78119 and others added 4 commits May 2, 2026 09:18
* Initial plan

* Refactor: simplify execute_recording_witness_order return type and extract build_block_hashes helper

Agent-Logs-Url: https://github.com/scroll-tech/ceno-reth-benchmark/sessions/197ed1b0-bdfa-4cbf-9551-d742665adc4e

Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
@hero78119 hero78119 merged commit 3c03879 into ceno May 2, 2026
5 of 7 checks passed
@hero78119 hero78119 deleted the feat/opt_first_shard branch May 2, 2026 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants