Skip to content

Use compact GPU tower witnesses#1330

Open
hero78119 wants to merge 30 commits intomasterfrom
feat/prover_mle_zero_padding
Open

Use compact GPU tower witnesses#1330
hero78119 wants to merge 30 commits intomasterfrom
feat/prover_mle_zero_padding

Conversation

@hero78119
Copy link
Copy Markdown
Collaborator

@hero78119 hero78119 commented Apr 26, 2026

Summary

This PR makes the GPU prover compact/default-aware for MLE inputs while keeping protocol-facing behavior unchanged. The verifier, PCS dimensions, transcript, and sumcheck domains still use the logical domain; only GPU-resident prover data can stay compact by occupied rows.

Companion CUDA kernel changes: scroll-tech/ceno-gpu#146.

Compact vs logical domains

  • Logical domain is the constraint/sumcheck/PCS domain: num_vars, power-of-two padding, and rotation expansion remain verifier-visible and unchanged.
  • Compact domain is the occupied physical row range kept on GPU. Keccak is the main stress case: each logical syscall instance expands to 32 physical rows, and compact buffers avoid padding those rows to the full logical domain.
  • Missing compact tail entries are represented by tail_default. Most witness tails are zero; logup shared numerator can use one.
  • CPU-side structures remain logical for compatibility. GPU proving carries compact resident length plus logical metadata and materializes logical shape only at boundaries that require it.
Logical domain, protocol view:

  rows used by constraints / transcript / PCS / sumcheck
  [ occupied physical rows ][ logical tail padding ........ ]
  <----------------------- 2^num_vars ---------------------->

Compact GPU resident view:

  [ occupied physical rows ] + metadata { logical num_vars, tail_default }

Kernel read rule:

  if index < occupied_len: read compact[index]
  else:                    read tail_default

Flow modes

  • CPU backend: unchanged logical host MLE/RMM behavior.
  • GPU backend with CENO_GPU_ENABLE_WITGEN=0: CPU witgen still produces host traces; GPU proving extracts/copies occupied rows into compact GPU MLE specs while preserving logical num_vars for constraints and sumcheck.
  • GPU backend with CENO_GPU_ENABLE_WITGEN=1: GPU witgen can feed compact device-backed traces or replay-materialized inputs directly into the same compact proving path.
  • Replay-heavy Keccak/ShardRam: compact tower inputs are materialized for tower, released, then rematerialized for ECC/rotation/main constraints so peak VRAM is lower without changing proof semantics.
CPU backend, no compact GPU semantics:

  CPU witgen
    -> logical RowMajorMatrix / MLEs
    -> CPU prover stages
    -> PCS / transcript / verifier all see logical domain
GPU backend, CENO_GPU_ENABLE_WITGEN=0:

  CPU witgen
    -> logical host traces / committed PCS data
    -> per-chip GPU extraction
         host logical rows -> compact GPU MLE specs
         keep { occupied_len, logical num_vars, tail_default }
    -> shared GPU proving stages
         tower -> ECC -> rotation -> main constraints -> opening
    -> PCS / transcript / verifier still see logical domain
GPU backend, CENO_GPU_ENABLE_WITGEN=1:

  GPU witgen
    -> compact device-backed traces or replay sources
    -> deferred commit / replay materialization when needed
    -> shared GPU proving stages
         tower -> ECC -> rotation -> main constraints -> opening
    -> PCS / transcript / verifier still see logical domain

  Keccak / ShardRam replay lifetime:

    materialize compact tower input
      -> prove tower
      -> drop tower input
      -> rematerialize for ECC / rotation / main
      -> open committed traces

Proving semantics

  • Sumcheck runs over logical domains. Compact metadata only changes how GPU kernels read resident buffers and defaults for omitted tail entries.
  • Tower build/prove consumes compact product/logup inputs directly and avoids carrying full-domain padded tower inputs through the proof lifetime.
  • Rotation/main GKR use the same logical constraint domains while accepting compact/default-aware GPU MLE inputs where the kernels support it.
  • Scheduler-facing estimates and memtracking distinguish compact resident bytes from logical-domain temporary bytes to avoid both under-booking and double-counting.
Shared GPU proving path:

  compact/default-aware MLE specs
    { ptr, occupied_len, logical num_vars, tail_default }
        |
        +--> tower build/prove
        |      - compact product/logup inputs
        |      - logical sumcheck rounds
        |
        +--> rotation / main GKR
        |      - logical constraint domain
        |      - compact/default-aware reads
        |
        +--> PCS opening
               - verifier-visible logical dimensions unchanged

Unified paths

  • CPU witgen + GPU proving and GPU witgen + GPU proving now share the same compact chip proof stages: tower, ECC, rotation, main constraints, and PCS opening.
  • Product/logup tower construction is centralized around compact specs, including the scalar-one logup numerator case.
  • Sequential and concurrent chip proving use the same estimator model, with memtracking checks available to catch estimator drift.

Reviewer focus

  • Boundaries between compact resident length and logical num_vars.
  • tail_default handling in sumcheck/tower, especially non-zero logup numerator defaults.
  • Keccak rotated physical rows and ShardRam replay/materialization lifetime.
  • Scheduler estimates for CENO_GPU_ENABLE_WITGEN=0/1 and CENO_CONCURRENT_CHIP_PROVING=0/1.
  • Verifier/protocol parity: this PR should not change proof format or transcript semantics.

Benchmark

Source runs:

Block: 23817600. Per-operation app-prove rows are profile totals across overlapped shard work, so they can exceed wall time; E2E/app-prove rows are the wall-time comparison.

Metric Baseline This PR Delta Change
E2E total time 81.900s 80.900s -1.000s -1.22%
app_prove wall time 67.200s 66.300s -0.900s -1.34%
emulator 10.400s 10.500s +0.100s +0.96%
commit_traces 8.075s 8.049s -0.026s -0.32%
extract_witness_mles 27.569s 28.837s +1.268s +4.60%
transport_structural_witness 3.475s 3.058s -0.417s -12.00%
build_tower_witness_gpu 4.711s 3.413s -1.298s -27.55%
prove_tower_relation_gpu 178.197s 188.436s +10.239s +5.75%
prove_main_constraints 24.464s 24.238s -0.226s -0.92%
pcs_opening 17.892s 17.716s -0.176s -0.98%
CPU/GPU overlap gap 3.910s 3.930s +0.020s +0.51%

Peak memory is extracted from concurrent benchmark job logs by taking the max of [gpu device] snapshots. pool_booked is scheduler reservation/estimate, not actual VRAM usage.

Memory metric Baseline peak This PR peak Drop Drop %
cuda_used 23637.19 MB 21557.19 MB 2080.00 MB 8.80%
pool_used 21792.58 MB 19267.89 MB 2524.69 MB 11.59%
pool_reserved 23136.00 MB 21056.00 MB 2080.00 MB 8.99%
pool_booked 23180.86 MB 23180.87 MB -0.01 MB -0.00%

Summary: wall time is slightly faster in this run (81.9s -> 80.9s). Peak VRAM is lower (cuda_used: 23637.19 MB -> 21557.19 MB, -2080.00 MB / -8.80%; pool_reserved: 23136.00 MB -> 21056.00 MB, -2080.00 MB / -8.99%). Compact tower build is materially faster (4.711s -> 3.413s), while the overlapped tower proving profile total is higher (178.197s -> 188.436s); because chip proving overlaps across shards, the wall-time result is the primary performance signal.

Validation commands

cargo check --features gpu --package ceno_zkvm --bin e2e
cargo make clippy
CENO_GPU_MEM_TRACKING=1 CENO_CONCURRENT_CHIP_PROVING=0 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall
CENO_GPU_MEM_TRACKING=0 CENO_CONCURRENT_CHIP_PROVING=1 CENO_GPU_ENABLE_WITGEN=1 cargo run --config net.git-fetch-with-cli=true --release --package ceno_zkvm --features gpu --bin e2e -- --platform=ceno --max-cycle-per-shard=1600 examples/target/riscv32im-ceno-zkvm-elf/release/examples/keccak_syscall

@hero78119 hero78119 changed the title Feat/prover mle zero padding Use compact GPU tower witnesses Apr 26, 2026
@hero78119 hero78119 force-pushed the feat/prover_mle_zero_padding branch from 506a380 to df88dec Compare April 27, 2026 03:09
@hero78119 hero78119 force-pushed the feat/prover_mle_zero_padding branch from 2147c5d to d1ab71a Compare April 28, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant