Use compact GPU tower witnesses by hero78119 · Pull Request #1330 · scroll-tech/ceno

hero78119 · 2026-04-26T02:31:34Z

Summary

This PR makes the GPU prover compact/default-aware for MLE inputs while keeping protocol-facing behavior unchanged. The verifier, PCS dimensions, transcript, and sumcheck domains still use the logical domain; only GPU-resident prover data can stay compact by occupied rows.

Companion CUDA kernel changes: scroll-tech/ceno-gpu#146.

Compact vs logical domains

Logical domain is the constraint/sumcheck/PCS domain: num_vars, power-of-two padding, and rotation expansion remain verifier-visible and unchanged.
Compact domain is the occupied physical row range kept on GPU. Keccak is the main stress case: each logical syscall instance expands to 32 physical rows, and compact buffers avoid padding those rows to the full logical domain.
Missing compact tail entries are represented by tail_default. Most witness tails are zero; logup shared numerator can use one.
CPU-side structures remain logical for compatibility. GPU proving carries compact resident length plus logical metadata and materializes logical shape only at boundaries that require it.

Logical domain, protocol view:

  rows used by constraints / transcript / PCS / sumcheck
  [ occupied physical rows ][ logical tail padding ........ ]
  <----------------------- 2^num_vars ---------------------->

Compact GPU resident view:

  [ occupied physical rows ] + metadata { logical num_vars, tail_default }

Kernel read rule:

  if index < occupied_len: read compact[index]
  else:                    read tail_default

Flow modes

CPU backend: unchanged logical host MLE/RMM behavior.
GPU backend with CENO_GPU_ENABLE_WITGEN=0: CPU witgen still produces host traces; GPU proving extracts/copies occupied rows into compact GPU MLE specs while preserving logical num_vars for constraints and sumcheck.
GPU backend with CENO_GPU_ENABLE_WITGEN=1: GPU witgen can feed compact device-backed traces or replay-materialized inputs directly into the same compact proving path.
Replay-heavy Keccak/ShardRam: compact tower inputs are materialized for tower, released, then rematerialized for ECC/rotation/main constraints so peak VRAM is lower without changing proof semantics.

CPU backend, no compact GPU semantics:

  CPU witgen
    -> logical RowMajorMatrix / MLEs
    -> CPU prover stages
    -> PCS / transcript / verifier all see logical domain

GPU backend, CENO_GPU_ENABLE_WITGEN=0:

  CPU witgen
    -> logical host traces / committed PCS data
    -> per-chip GPU extraction
         host logical rows -> compact GPU MLE specs
         keep { occupied_len, logical num_vars, tail_default }
    -> shared GPU proving stages
         tower -> ECC -> rotation -> main constraints -> opening
    -> PCS / transcript / verifier still see logical domain

GPU backend, CENO_GPU_ENABLE_WITGEN=1:

  GPU witgen
    -> compact device-backed traces or replay sources
    -> deferred commit / replay materialization when needed
    -> shared GPU proving stages
         tower -> ECC -> rotation -> main constraints -> opening
    -> PCS / transcript / verifier still see logical domain

  Keccak / ShardRam replay lifetime:

    materialize compact tower input
      -> prove tower
      -> drop tower input
      -> rematerialize for ECC / rotation / main
      -> open committed traces

Proving semantics

Sumcheck runs over logical domains. Compact metadata only changes how GPU kernels read resident buffers and defaults for omitted tail entries.
Tower build/prove consumes compact product/logup inputs directly and avoids carrying full-domain padded tower inputs through the proof lifetime.
Rotation/main GKR use the same logical constraint domains while accepting compact/default-aware GPU MLE inputs where the kernels support it.
Scheduler-facing estimates and memtracking distinguish compact resident bytes from logical-domain temporary bytes to avoid both under-booking and double-counting.

Shared GPU proving path:

  compact/default-aware MLE specs
    { ptr, occupied_len, logical num_vars, tail_default }
        |
        +--> tower build/prove
        |      - compact product/logup inputs
        |      - logical sumcheck rounds
        |
        +--> rotation / main GKR
        |      - logical constraint domain
        |      - compact/default-aware reads
        |
        +--> PCS opening
               - verifier-visible logical dimensions unchanged

Unified paths

CPU witgen + GPU proving and GPU witgen + GPU proving now share the same compact chip proof stages: tower, ECC, rotation, main constraints, and PCS opening.
Product/logup tower construction is centralized around compact specs, including the scalar-one logup numerator case.
Sequential and concurrent chip proving use the same estimator model, with memtracking checks available to catch estimator drift.

Reviewer focus

Boundaries between compact resident length and logical num_vars.
tail_default handling in sumcheck/tower, especially non-zero logup numerator defaults.
Keccak rotated physical rows and ShardRam replay/materialization lifetime.
Scheduler estimates for CENO_GPU_ENABLE_WITGEN=0/1 and CENO_CONCURRENT_CHIP_PROVING=0/1.
Verifier/protocol parity: this PR should not change proof format or transcript semantics.

Benchmark

Source runs:

Baseline: ceno-reth-benchmark run 25004787999 attempt 1, result mainnet23817600-20260427-234425
This PR: ceno-reth-benchmark run 25004860748 attempt 2, result mainnet23817600-20260428-074320

Block: 23817600. Per-operation app-prove rows are profile totals across overlapped shard work, so they can exceed wall time; E2E/app-prove rows are the wall-time comparison.

Metric	Baseline	This PR	Delta	Change
E2E total time	81.900s	80.900s	-1.000s	-1.22%
app_prove wall time	67.200s	66.300s	-0.900s	-1.34%
emulator	10.400s	10.500s	+0.100s	+0.96%
commit_traces	8.075s	8.049s	-0.026s	-0.32%
extract_witness_mles	27.569s	28.837s	+1.268s	+4.60%
transport_structural_witness	3.475s	3.058s	-0.417s	-12.00%
build_tower_witness_gpu	4.711s	3.413s	-1.298s	-27.55%
prove_tower_relation_gpu	178.197s	188.436s	+10.239s	+5.75%
prove_main_constraints	24.464s	24.238s	-0.226s	-0.92%
pcs_opening	17.892s	17.716s	-0.176s	-0.98%
CPU/GPU overlap gap	3.910s	3.930s	+0.020s	+0.51%

Peak memory is extracted from concurrent benchmark job logs by taking the max of [gpu device] snapshots. pool_booked is scheduler reservation/estimate, not actual VRAM usage.

Memory metric	Baseline peak	This PR peak	Drop	Drop %
`cuda_used`	23637.19 MB	21557.19 MB	2080.00 MB	8.80%
`pool_used`	21792.58 MB	19267.89 MB	2524.69 MB	11.59%
`pool_reserved`	23136.00 MB	21056.00 MB	2080.00 MB	8.99%
`pool_booked`	23180.86 MB	23180.87 MB	-0.01 MB	-0.00%

Summary: wall time is slightly faster in this run (81.9s -> 80.9s). Peak VRAM is lower (cuda_used: 23637.19 MB -> 21557.19 MB, -2080.00 MB / -8.80%; pool_reserved: 23136.00 MB -> 21056.00 MB, -2080.00 MB / -8.99%). Compact tower build is materially faster (4.711s -> 3.413s), while the overlapped tower proving profile total is higher (178.197s -> 188.436s); because chip proving overlaps across shards, the wall-time result is the primary performance signal.

Validation commands

cargo check --features gpu --package ceno_zkvm --bin e2e
cargo make clippy
CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
CENO_GPU_MEM_TRACKING=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

…_mle_zero_padding

…/ceno into feat/prover_mle_zero_padding

…_mle_zero_padding

hero78119 added 3 commits April 25, 2026 23:18

refactor GPU compact tower witness flow

ac49ac6

Fix compact tower memory accounting

84a2631

Optimize compact logup ones allocation

12453f6

hero78119 changed the title ~~Feat/prover mle zero padding~~ Use compact GPU tower witnesses Apr 26, 2026

hero78119 added 8 commits April 26, 2026 11:22

update dep

7d60f01

Merge branch 'master' into feat/prover_mle_zero_padding

925de92

fix main mem estimation

e9fbe9c

Merge branch 'master' of github.com:scroll-tech/ceno into feat/prover…

46e87bb

…_mle_zero_padding

Merge branch 'feat/prover_mle_zero_padding' of github.com:scroll-tech…

b888fbb

…/ceno into feat/prover_mle_zero_padding

fix mem estimator

5ecce04

snapshot compact tower estimator state

be14006

rollback Cargo.toml, Cargo.lock change

df88dec

hero78119 force-pushed the feat/prover_mle_zero_padding branch from 506a380 to df88dec Compare April 27, 2026 03:09

hero78119 added 17 commits April 27, 2026 13:43

fix memory estimation

b57b692

verifier log

c50b793

Pass tower input by value for GPU proving

89b8698

split tower layer by view

f210e1f

Use dense tower build for compact GPU input

99b7a94

Pass logup shape to tower prove estimator

f0d81b6

Deduplicate borrowed tower input booking

917810c

fix logging

4fc8dae

Check scheduler memory estimate in mem tracking

ef9fa30

Refine replay tower proof memory estimate

011a898

clippy fix

f3ca1cf

add missing syncronization, avoid race condition

147f567

Account ShardRam tower prove allocator overhead

94fc7bf

misc: clippy fix

c9401d1

Fix GPU proof memory estimation

d14e66a

Fix GPU proof estimate row basis

ceced51

Tune ShardRam tower proof estimate

d1ab71a

hero78119 force-pushed the feat/prover_mle_zero_padding branch from 2147c5d to d1ab71a Compare April 28, 2026 07:50

hero78119 added 2 commits April 30, 2026 15:33

update gkr dependency

29ae6df

Merge branch 'master' of github.com:scroll-tech/ceno into feat/prover…

7d1a9de

…_mle_zero_padding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use compact GPU tower witnesses#1330

Use compact GPU tower witnesses#1330
hero78119 wants to merge 30 commits intomasterfrom
feat/prover_mle_zero_padding

hero78119 commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hero78119 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compact vs logical domains

Flow modes

Proving semantics

Unified paths

Reviewer focus

Benchmark

Validation commands

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hero78119 commented Apr 26, 2026 •

edited

Loading