Reduce shard-0 cross-shard writes with streamed witnesses#36
Reduce shard-0 cross-shard writes with streamed witnesses#36
Conversation
|
@copilot please strictly follow https://github.com/forrestchang/andrej-karpathy-skills/blob/main/CLAUDE.md skill and review and refactor this PR and submit changes in a separate PR. |
|
@hero78119 I've opened a new pull request, #37, to work on those changes. Once the pull request is ready, I'll request review from you. |
There was a problem hiding this comment.
Pull request overview
This PR refactors the client input/witness plumbing to stream witness items in just-in-time access order (instead of eagerly materializing/serializing the full ClientExecutorInput upfront), with the goal of reducing shard-0 cross-shard external writes during proving.
Changes:
- Introduces a streaming input protocol (
ClientInputReader,ClientWitnessInput, trie headers/byte payload streaming) and a newClientExecutor::execute_from_readerpath. - Adds witness-access order recording to drive streaming emission order from the host side.
- Updates the Ceno guest binary and host benchmark harness to use the new streaming input format; adds required dependencies.
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/host-bench/src/lib.rs | Writes streaming-formatted client input/witness data into CenoStdin instead of serializing ClientExecutorInput directly. |
| crates/host-bench/Cargo.toml | Adds reth-trie dependency needed for account RLP extraction. |
| crates/executor/client/src/lib.rs | Adds execute_from_reader and witness-order recording to support streamed witness consumption/emission. |
| crates/executor/client/src/io.rs | Defines streaming input types/reader trait, implements StreamingEthereumState, and refactors WitnessDb to support eager vs streaming providers. |
| crates/executor/client/src/error.rs | Adds TrieWitnessError for streaming witness validation failures. |
| crates/executor/client/Cargo.toml | Adds bytes dependency for streamed trie byte payload representation. |
| bin/ceno-client-eth/src/main.rs | Switches guest input reading from ClientExecutorInput to the streaming reader interface. |
| Cargo.lock | Lockfile updates for new/updated dependencies. |
| bin/ceno-client-eth/Cargo.lock | Lockfile updates for new/updated dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Initial plan * Refactor: simplify execute_recording_witness_order return type and extract build_block_hashes helper Agent-Logs-Url: https://github.com/scroll-tech/ceno-reth-benchmark/sessions/197ed1b0-bdfa-4cbf-9551-d742665adc4e Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hero78119 <3962077+hero78119@users.noreply.github.com>
This reverts commit 2b5a2ec.
Summary
This PR reduces shard-0 cross-shard external writes by changing the client input path from eager materialization to just-in-time witness streaming.
Main changes:
ClientExecutorInputup front.cross_shard_opt.md.Implementation Decision
After reviewing the latest benchmark, we decided to revert commit
2b5a2ec06b617def6a0b1a65852493234dfec0a(Reduce cross-shard hint reads for storage tries). The dual-region storage-trie reread experiment reduced ShardRAM pressure, but it added guest work, increased shard count, and regressed E2E time.The benchmark tables below are intentionally kept as historical data for that experiment. The PR implementation should be evaluated as the streamed-witness refactor without the reverted dual-region storage-trie reread change.
Trust and Verification
Hint data remains untrusted. The streaming refactor keeps the usual verification checks:
CI Benchmark
Compared block
23817600CI runs:ceno/09edb1b5feat/prover_mle_zero_padding/c4ab85bfeat/opt_first_shardbaselinefeat/opt_first_shard/65a75752feat/opt_first_shard/f17d2166End-to-End / Proving
Historical note: the
Latest PRcolumn below is the benchmark forf17d2166, before reverting2b5a2ec06b617def6a0b1a65852493234dfec0a. We keep it in the PR description to document why the dual-region storage-trie reread experiment was reverted. Latest delta/change are computed against the previousfeat/opt_first_shardbaseline65a75752.65a7575265a75752app_prove.inner/ app provepreflight-execute/ emulatorpcs_openingShardRAM Circuit Instances
65a7575265a75752shard_ram_assign_instancestimeshard_ram_assign_instancestimeGPU Module Breakdown
65a7575265a75752commit_tracesprove_main_constraintstransport_structural_witnessbuild_tower_witness_gpuprove_tower_relation_gpupcs_openingShardRAM Distribution
65a7575265a75752Historical
f17d2166result summary:feat/opt_first_shardbaseline:3,230,576 -> 2,791,174(-13.6%).167,014 -> 110,355(-33.9%).75.4s -> 82.6s(+9.6%), instructions/cycles+11.4%, and shard count15 -> 17.Note: these CI logs do not include
CENO_DEBUG_SHARD_RAM=1, socurrent_shard_external_writeis not available from these raw logs. The comparable raw-log metric here isshard_ram_assign_instances n.Local Validation
Validated locally on block
23587691, shard0:cargo check -p openvm-client-executor -p openvm-reth-benchmarkcargo ceno build --releasefrombin/ceno-client-ethCENO_DEBUG_SHARD_RAM=1Latest local key metrics after reverting
2b5a2ec06b617def6a0b1a65852493234dfec0a:[4, 60806136, 99175788]shard_ram_assign_instancesLocal log:
sanity_23587691_shard0_after_revert_local_maxcell6_20260502_100230.log.Note: the local GPU sanity used the temporary max-cell validation knob required on this machine to avoid host memory pressure; that knob is not included in this PR.