GPU-native Circle STARK prover with end-to-end proof generation and verification on NVIDIA Blackwell (RTX 5090) and Ada Lovelace (RTX 4090). Rust + CUDA.
To our knowledge, the only public Circle STARK prover with a real GPU backend. The upstream stwo reference implementation (Starkware) runs on CPU; VortexSTARK is a from-scratch CUDA implementation of the same construction — trace generation, NTT, Merkle commitment, LogUp interaction, FRI, and proof-of-work all run kernel-side with no host round-trips on the hot path.
Two STARK flavors ship in-tree:
- Fibonacci STARK — the canonical 1-column / 1-transition example, used as the prover/verifier baseline and benchmark target
- Cairo VM STARK — full 34-column trace with 35 transition constraints, prove arbitrary Cairo programs (.casm or fetched from Starknet)
Nine of the prover's GPU kernels are emitted by Forge (a formally-verified systems language with Z3 proof discharge) and run default-on. Total of 0 user-supplied assume() across the forge-emitted surface — every fact in the production prover is SMT-discharged.
Forge (.fg) ──► CUDA C ──► nvcc ──► cubin ──► GPU (in-tree prover)
↑ 9 kernels in production: NTT, NTT-batch, FRI, blake2s,
permute, bit-reverse, gather, barycentric, grind
Open-toolchain build (no NVIDIA compiler) is also supported via the OpenCUDA + OpenPTXas pair — they consume the same Forge-emitted CUDA C / PTX. Cross-stack tooling (run / compare / benchmark / classify any kernel through any backend) lives in forge-workbench.
Source-available under BSL 1.1, converts to Apache 2.0 on 2029-03-20. Non-production use permitted today; commercial licensing available before the conversion date — contact garrick.wagner@gmail.com.
- Fibonacci STARK: Full prove → verify pipeline, 100-bit conjectured security, 18+ tamper-detection tests
- Cairo VM STARK: 34-column trace, 35 transition constraints, verifier independently evaluates all constraints at query points
- LogUp memory consistency: Full cancellation check — memory table committed as explicit proof data (all unique entries), verifier independently checks exec_sum + table_sum == 0
- Pedersen hash: GPU windowed EC scalar multiplication, 37.7M hashes/sec, verified against CPU reference
- Poseidon2 hash: GPU trace generation + NTT, 4.7M hashes/sec at log_n=28 (30 rows/perm, RF=8 RP=22)
- RPO-M31 hash: Circle STARK–native hash (eprint 2024/1635), 3.5M hashes/sec at log_n=28 (14 rows/perm, 24 cols)
- FRI: Circle fold + line folds, GPU-resident decommitment, all fold equations verified
Default build, BLOWUP_BITS=2, 160-bit security, N_QUERIES=80, PoW=26 bits:
| Workload | Scale | Prove | Verify |
|---|---|---|---|
| Fibonacci log_n=24 | 16.8M elements | 214ms | 6.2ms |
| Fibonacci log_n=28 | 268M elements | 1.55s | 8.2ms |
| Fibonacci log_n=29 | 537M elements | 8.9s | 7.8ms |
| Cairo VM log_n=20 | 1M steps | 1.4s | ~0.5s |
| Cairo VM log_n=22 | 4.2M steps | 5.0s | ~1s |
| Cairo VM log_n=24 | 16.8M steps | 22.1s | ~3s |
| Cairo VM log_n=25 | 33.5M steps | 80s | ~5s |
| Cairo VM log_n=26 | 67M steps | runs via chunked-slab quotient kernel (commit 7a7b2b2); requires ≥40 GB host RAM for the parallel cn/hc host copies. Native-Linux wallclock TBD — WSL2's 31 GB host ceiling caps in-house validation at log_n=22 |
— |
| Poseidon2 trace+NTT log_n=28 | 8.9M hashes | 1.92s | — |
| RPO-M31 trace+NTT log_n=28 | 19.2M hashes | 5.51s | — |
| Pedersen GPU batch | 1M hashes | 26.6ms | — |
bench-max-size feature build (cargo build --release --features bench-max-size),
BLOWUP_BITS=1, 80-bit security — used to measure the 1B-element headline, which
requires log_eval ≤ 31 (the order of M31's circle subgroup):
| Workload | Scale | Prove | Verify |
|---|---|---|---|
| Fibonacci log_n=28 | 268M elements | 2.97s | 6.9ms |
| Fibonacci log_n=29 | 537M elements | 8.95s | 7.8ms |
| Fibonacci log_n=30 | 1.07B elements | 19.68s | 7.6ms |
log_n=30 requires ~20 GB of free VRAM and a single 8 GB pinned→device transfer. On WSL2 this works on systems with enough allocated to the VM; native Linux is recommended for the headline measurement.
All Cairo VM numbers include 34 columns, 35 constraints, full LogUp+RC memory table, S_dict LogUp, OODS quotient, and 26-bit proof-of-work. Fibonacci numbers are the standalone single-column prover.
34-column trace, 35 transition constraints. The following are now enforced:
- Operand address verification: dst_addr, op0_addr, op1_addr constrained against register + offset - bias
- JNZ fall-through: dst_inv auxiliary column, fall-through constrained to pc + inst_size when dst=0
- JNZ inverse consistency: dst * dst_inv = 1 enforced when jnz and dst != 0
- Op1 source exclusivity: pairwise products of op1_imm, op1_fp, op1_ap all constrained to zero
- PC update exclusivity: pairwise products of jump_abs, jump_rel, jnz constrained to zero
- Opcode exclusivity: pairwise products of call, ret, assert constrained to zero
- Flag binary: all 15 flags constrained to {0, 1}
- Instruction decomposition: all 63 bits verified — inst_lo + inst_hi ≡ off0 + off1·2^16 + off2·2^32 + flags·2^48 (mod P)
- LogUp memory consistency: memory table committed as explicit proof data; verifier checks exec_sum + table_sum == 0
- Range check argument: all 16-bit offsets verified via LogUp against precomputed table, wired into prover with z_rc challenge
- Merkle hash construction: leaf and internal-node hashes both use standard Blake2s without h[6] personalization, matching stwo's
MerkleHasherLifted::hash_children(seeblake2s_hash_nodeinsrc/channel.rs, which callsblake2s_hashwith domain=0x00, andIV6_NODE = IV6incuda/blake2s.cu). Leaf inputs are sized asmin(n_cols, 16) × 4bytes; internal-node inputs are always 64 bytes — different lengths produce different Blake2st0counters when n_cols < 16. For 16-column commits, leaf-vs-node distinction relies on Blake2s collision resistance and protocol structure (verifier knows the depth at each auth-path step) rather than algebraic personalization - Full ZK: all 34 trace columns blinded via
r · Z_H(x)— GAP-4 closed 2026-03-26
- Felt252 arithmetic: VM operates over M31 (2^31 − 1).
cairo_prove_programnow returnsErr(ProveError::Felt252Overflow)for programs whose bytecode contains values wider than 64 bits, preventing silent misproofs. Values in the range (M31, u64] are still reduced mod M31 without error — proof output is wrong for programs that rely on Stark252 arithmetic in that range. - Starknet syscalls: All 9 syscall selectors fully implemented.
CallContractandLibraryCallexecute registered callees in-process viaHintContext::register_contract— retdata is written back into the caller's response buffer with nesting up to depth 8. Unregistered targets return empty retdata.Deployreturns a deterministic mock address (salt XOR class_hash). All syscall state (events, calls, deployed contracts, L1 messages) recorded inSyscallStateand available after proving. - Dict consistency proofs: Dict read/write execution is fully functional. An execution-side chain consistency check runs at prove time (
ProveError::DictConsistencyViolation). The S_dict step-transition LogUp (C34) links main trace dict columns to an authenticated exec trace; verifier checksdict_link_final == exec_key_new_sum. Soundness holds against a malicious prover for dict-heavy programs. - Bitwise builtin (32-bit sound): The bitwise builtin commits
(x, y, x&y, x^y, x|y)rows into the Fiat-Shamir channel and the verifier recomputes the true bitwise result natively from the authenticated(x, y), rejecting any mismatch. All 32-bit inputs are supported. - Initial register state (boundary constraint): The verifier does not enforce a hard AIR boundary constraint
T_PC[0] == initial_pc. The initial register state (initial_pc,initial_ap) is part of the public input (the statement being proven) and is trusted by the verifier as the starting point.cairo_verifydoes enforce a sanity guard thatinitial_pc < program.len()so a proof cannot claim execution from outside the committed bytecode, but it does not checkinitial_pc/initial_apagainst any caller-supplied expected value. Callers who need to verify that a proof corresponds to a specific initial state should useverify_cairo_statement(&proof, &stmt), which comparesproof.public_inputs.{initial_pc, initial_ap, n_steps, program_hash}againststmtbefore delegating tocairo_verify. This is the standard STARK convention: the verifier checks that the computation starting from the stated initial state satisfies all transition constraints; it does not re-derive the initial state from the program. - Program-hash binding (verifier-enforced; caller still picks the program):
proof.public_inputs.program_hashisBlake2s(bytecode).cairo_verifyrecomputesBlake2s(proof.public_inputs.program)and rejects any internal hash/bytecode inconsistency before Fiat-Shamir replay (closed 2026-04-23 — see AUDIT.md). What the verifier does not do is decide which program the caller intended to run — it accepts whatever bytecode the proof's public input carries. To bind a proof to a specific program, supply the expectedprogram_hashin aCairoStatementand useverify_cairo_statement(&proof, &stmt), which comparesproof.public_inputs.program_hashagainststmt.program_hash(usecompute_program_hash(&bytecode)to derive the expected hash). The remaining caller responsibility is choosing whichCairoStatementto verify against — not recomputing the hash, which the verifier already does.
See SOUNDNESS.md for the full constraint-by-constraint analysis.
- Full-group NTT on subgroup(2^31) via ForwardTwiddleCache
- Fused quotient + circle fold: zero host transfer for quotient data
- FRI arena: MOVE layers into storage, zero clones, GPU-resident decommitment
- Two-phase commitment: trace → Fiat-Shamir → LogUp interaction → quotient → FRI
- GPU LogUp interaction: batch QM31 inverse + parallel prefix sum (16ms at 16.7M steps)
- Pinned DMA: async trace download overlapped with FRI folding at PCIe 5.0 bandwidth
All stwo proof system operations run GPU-native. No host round-trips inside the hot path:
| Operation | Kernel | Notes |
|---|---|---|
| Circle NTT evaluate/interpolate | circle_ntt_stwo.cu |
Cached twiddle tree per coset |
| Merkle leaf hashing (Blake2s) | merkle_leaves_lifted.cu |
Grouped by log_size, chunked ×16 |
| Merkle leaf hashing (Poseidon252) | merkle_poseidon252.cu |
Montgomery-form Fp252 throughout |
| Blake2s PoW grind | grind.cu |
16M nonces/launch, ~1ms/batch |
| Blake2s M31-output grind | grind_m31_output.cu |
Applies M31 reduction before trailing-zero check |
| Poseidon252 PoW grind | grind_poseidon.cu |
GPU Poseidon permute in Montgomery form |
| GKR fix_first_variable (M31→QM31) | gkr.cu |
fold_mle_evals per element |
| GKR fix_first_variable (QM31→QM31) | gkr.cu |
fold_mle_evals per element |
| GKR gen_eq_evals | gkr.cu |
log_k sequential doubling passes |
| GKR next_layer (all 4 variants) | gkr.cu |
Grand product + 3 LogUp variants |
| GKR sum_as_poly (all 4 variants) | gkr.cu |
Parallel block reduction + CPU accumulate |
| lift_and_accumulate | accumulate_lift.cu |
QM31 per-channel, src_idx = (i>>shift<<1)|(i&1) |
| QM31 bit-reverse | bit_reverse_wide.cu |
Out-of-place, thread i → bit_reverse(i) |
| pack_leaves_input | pack_leaves.cu |
4×N → 64×(N/16) gather transpose |
| eval_at_point_by_folding | fri.cu (existing) |
GPU circle+line fold, twiddle cache; CPU for n<1024 |
- Instruction decoder: 15 flags, 3 biased offsets, full Cairo encoding
- VM executor: add, mul, jump, jnz, call, ret, assert_eq (26ns/step fused)
- 34-column trace: registers(3), instruction(2), flags(15), operands(7), offsets(3), dst_inv(1), dict linkage(3)
- 35 transition constraints evaluated on GPU (single CUDA kernel) and independently by verifier
- LogUp memory consistency: permutation argument with execution + table sum cancellation
- Range check argument: 16-bit offset validation via LogUp bus, wired into prover pipeline
- Instruction decomposition: algebraic constraint tying inst_lo/inst_hi to offsets and flags
| Builtin | Status | Throughput (log_n=28) |
|---|---|---|
| Poseidon2 | GPU kernel, proven (RF=8 RP=22, 30 rows/perm) | 4.7M hashes/sec |
| RPO-M31 | GPU kernel, proven (14 rows/perm, 24 cols) | 3.5M hashes/sec |
| Pedersen | GPU kernel, proven (windowed 4-bit EC, Montgomery Jacobian) | 37.7M hashes/sec |
| Bitwise | Trace generation + verifier native recompute against Fiat-Shamir-bound (x, y) — see AUDIT.md §M1 | AND/XOR/OR on full 32-bit inputs |
stark_cli prove 24 1 1 -o proof.bin # Fibonacci STARK
stark_cli prove-file program.casm -o proof.bin # Cairo program
stark_cli prove-starknet --class-hash 0x... # From Starknet mainnet
stark_cli inspect program.casm # Disassemble CASM
stark_cli fetch-block --block 100000 # Starknet block info
stark_cli verify proof.bin # Verify a proof
stark_cli bench 28 # BenchmarkRequires: Rust 1.85+ (stable), CUDA 13.0+, RTX 5090 (SM 12.0) or RTX 4090 (SM 8.9).
cargo build --release
cargo test --release --workspace -- --test-threads=1
# Workspace totals: 396 lib + 49 vortex-cuda-backend + 35 integration = 480 pass, 0 fail, 3 #[ignore]
cargo run --release --bin full_benchmark
cargo run --release --bin gpu_bench # pre-flight checks + per-section GPU telemetryThe default build enables all 9 Forge-emitted kernel paths (forge-ntt, forge-ntt-batch, forge-fri, forge-blake2s, forge-permute, forge-bit-reverse, forge-gather, forge-barycentric, forge-grind). Compare against the hand-written CUDA baseline with cargo test --no-default-features (394 pass).
396 lib + 49 vortex-cuda-backend + 35 integration = 480 total (480 pass, 0 fail, 3 marked #[ignore] for benchmark/live-RPC opt-in) covering: M31/CM31/QM31 field arithmetic, Circle NTT, Merkle tree (commit, auth paths, tiled, SoA4), FRI (fold, circle fold, deterministic), STARK prover + verifier (multiple sizes, tamper detection), Cairo VM (decoder, executor, Fibonacci, constraints, LogUp, range checks, instruction decomposition), Poseidon, Pedersen (Stark252 field, EC ops, GPU vs CPU), Bitwise (memory segment, trace generation, verifier native recompute against Fiat-Shamir-bound (x, y), prove/verify round-trip, tamper detection, 32-bit input acceptance, forged-row rejection), LogUp/RC soundness (memory table commitment, cancellation check, RC counts commitment), OODS quotient formula correctness, GPU constraint eval (bytecode VM, warp-cooperative), GPU leaf hashing (Blake2s, domain separation), CASM loader, Cairo hints (AllocSegment, AllocFelt252Dict, dict entry lifecycle, squash, U256InvModN with 7 comprehensive test vectors), Fiat-Shamir transcript ordering (12 commitment points), property tests (completeness, soundness, random mutations), cross-validation (reference VM comparison for 9 program types).
If you can craft a malformed trace that the verifier accepts, that is a real bug. Open an issue.
- Felt252 arithmetic: bytecode values wider than u64 are rejected with
Felt252Overflow; values in (M31, u64] silently reduce mod M31 inside AIR trace columns. M31 wrap-around replaces Stark252 arithmetic for any value above 2^31 − 1 that flows through the trace - Cross-contract execution:
CallContractandLibraryCallexecute registered callees in-process (HintContext::register_contract). Unregistered targets return empty retdata. External contract resolution (fetching bytecode from Starknet RPC at call time) is not automatic. - Felt252 dict values: AIR-level dict columns are M31. The 2026-04-23 Phase 2 work added a Blake2s-committed
dict_side_table(pointer + 9 M31 limbs each for key/prev/new) and afelt_overlayfor syscall-routed dict writes, so values that originate from full-felt syscall data round-trip bit-exact at 252 bits via the side table. Direct in-program dict writes that bypass the syscall path still reduce mod M31 at the AIR boundary
These should not break. If they do, that's a real soundness bug:
- Honest prover produces proofs that verify for Fibonacci, add, mul, call/ret, and mixed programs
- Verifier independently evaluates all 35 constraints at query points and rejects any mismatch
- Operand addresses verified against register + offset for all three operands (dst, op0, op1)
- JNZ fall-through constrained: dst_inv auxiliary column forces next_pc = pc + inst_size when dst = 0
- Flag exclusivity enforced: op1 source, PC update mode, and opcode are pairwise mutually exclusive
- Tampering any committed value (trace, quotient, FRI, commitment) is detected
- LogUp final-value enforcement: corrupting the memory consistency sum breaks FRI verification
- FRI fold equations: algebraic consistency checked at every query across all layers
- Merkle auth paths: data integrity verified for trace, quotient, OODS quotient, and all FRI layers
- OODS quotient formula: verifier independently recomputes Q(p) from decommitted trace and AIR quotient values and rejects any mismatch
- Fiat-Shamir transcript: any mutation to public inputs, commitments, or challenges cascades into rejection
| Test | What it tampers | Result |
|---|---|---|
test_tamper_flag_binary |
Set flag to non-binary value | REJECTED |
test_tamper_result_computation |
Corrupt res column |
REJECTED |
test_tamper_pc_update |
Corrupt next_pc |
REJECTED |
test_tamper_ap_update |
Corrupt next_ap |
REJECTED |
test_tamper_fp_update |
Corrupt next_fp |
REJECTED |
test_tamper_assert_eq |
Corrupt dst != res |
REJECTED |
test_tamper_logup_final_sum |
Corrupt LogUp sum | REJECTED |
test_tamper_rc_final_sum |
Corrupt range check sum | REJECTED |
test_tamper_memory_table_data |
Corrupt memory table entry | REJECTED |
test_tamper_memory_table_commitment |
Corrupt memory table hash | REJECTED |
test_tamper_logup_cancellation |
Corrupt exec_sum (cancellation fails) | REJECTED |
test_tamper_rc_counts_data |
Corrupt RC multiplicity count | REJECTED |
test_tamper_rc_counts_commitment |
Corrupt RC counts hash | REJECTED |
test_cairo_tampered_program_hash |
Corrupt program hash | REJECTED |
test_cairo_prove_verify_tampered_quotient |
Corrupt quotient value | REJECTED |
test_cairo_prove_verify_tampered_fri |
Corrupt FRI value | REJECTED |
test_tamper_ec_trace |
Corrupt EC trace commitment | REJECTED |
test_soundness_oods_quotient_tamper |
Corrupt OODS quotient decommitment value | REJECTED |
Garrick Wagner — independent GPU systems work. Contact: garrick.wagner@gmail.com.
Business Source License 1.1 (LICENSE). Non-production use permitted. Converts to Apache License 2.0 on 2029-03-20. Commercial licensing: garrick.wagner@gmail.com.