Skip to content

feat: wire autobahn state sync with naive provider (CON-252)#3304

Draft
wen-coding wants to merge 1 commit intowen/fix_autobahn_restart_no_statesyncfrom
wen/autobahn_state_sync
Draft

feat: wire autobahn state sync with naive provider (CON-252)#3304
wen-coding wants to merge 1 commit intowen/fix_autobahn_restart_no_statesyncfrom
wen/autobahn_state_sync

Conversation

@wen-coding
Copy link
Copy Markdown
Contributor

@wen-coding wen-coding commented Apr 23, 2026

Stacks on top of #3300. Base will change to main once #3300 merges.

Status: draft — initial feedback round. No integration test yet; see "Scope / Follow-ups" below.

Summary

Extends #3300 to cover the new-validator joining and disk-wiped node recovery paths via CometBFT state sync. A joiner state-syncs the app to some height M, then runExecute resumes at M+1 pulling subsequent blocks from giga peers.

Changes

  • node.go — stop force-disabling stateSync in giga mode. Gate on a direct app.Info() check since CometBFT's state.LastBlockHeight never advances under autobahn. postSyncHook is a no-op in giga mode (no block-sync reactor to hand control to). ssReactor wired only when stateSync is actually enabled.
  • statesync/reactor.go — new optional stateProviderFactory param to NewReactor, used in place of the RPC/P2P selection when set. Non-giga callers pass nil; existing behaviour preserved.
  • statesync/giga_stateprovider.go (new) — naive provider: empty AppHash (opts out of pre-verification), minimal Commit, sm.State from GenesisDoc + static committee.
  • statesync/syncer.go — skip the post-restore AppHash comparison when trustedAppHash is empty. Vanilla providers always return non-empty, so their behaviour is unchanged.
  • autobahn/data/state.go — expose public SkipTo wrapping the existing internal skipTo. Used post-state-sync to align data cursors so peer-streamed blocks from M+1 onward insert correctly.
  • p2p/giga_router.go — new state-sync branch in runExecute (last > 0 && NextBlock() <= last): SkipTo(last+1), no PushAppHash. Fresh / plain-restart branches unchanged from Support autobahn node restart by skipping CometBFT handshaker (CON-252) #3300.

New startup path (case E, added to #3300's A–D)

shouldHandshake stateSync InitChain by first FinalizeBlock deliverState
E join/recovery, giga (no local WAL) false true — (after sync, last>0 skips it) CMS rebuilt from snapshot chunks

Trust model — naive for v1, AppQC-bundled for v2

This PR trusts the snapshot producer optimistically: stateProvider.AppHash returns empty, syncer.verifyApp skips the comparison, and a corrupt snapshot wedges the joiner until external restart-with-wipe. The vanilla SyncAny retry loop only helps against honest-but-unavailable peers, not malicious ones — a malicious peer can loop-trap a joiner.

The planned v2 mechanism (tracked by TODO(autobahn-snapshot-proof) in giga_stateprovider.go):

Peers serving a giga snapshot MUST include an AppQC@snapshot_height in snapshot.Metadata. The giga state provider decodes the AppQC, verifies appQC.Verify(committee) (2f+1 committee signatures — cryptographically self-verifiable from a single peer), compares AppQC.AppHash to the snapshot's claimed AppHash, and returns that as the authoritative trustedAppHash. Any corrupt snapshot fails stateProvider.AppHash and triggers vanilla SyncAny's errRejectSnapshot → next snapshot → repeat. No loop-trap, no node restart required.

Why not in this PR:

  • Autobahn only retains the latest AppQC in memory (avail.latestAppQC is a single utils.Option, not a queue). Historical AppQCs aren't persisted.
  • AppQC@M forms after the app commits block M (committee votes are asynchronous), so the snapshot taken at Commit(M) doesn't yet have its anchor.
  • v2 needs either (a) a "seal snapshot on AppQC formation" hook between the snapshot manager and avail state, or (b) coordinated delayed snapshot creation. Both are real scope, landing in a follow-up.

Until v2 lands: operators should only point joiners at known-honest peers; a bad snapshot is an ops issue, not a safety issue (the cluster is unaffected, only the joiner is stuck).

Test plan

  • go build ./... clean; gofmt -s -l . clean
  • go test ./internal/statesync/... ./internal/p2p/... ./internal/autobahn/... ./node/... ./internal/consensus/... — all green
  • New unit tests: 5 × TestGigaStateProvider_*, 2 × TestState_SkipTo_*, 2 × TestSyncer_verifyApp_EmptyTrustedAppHashSkipsCheck

Not yet:

  • JoinFromStateSync integration subtest — wipe a node's data dir, statesync.enable = true, restart, verify catchup. snapshot-interval=100 already configured in the docker cluster. Deferred to land together with v2 (snapshot-proof), since integration-testing the naive-trust path without proper verification is of limited value.

Scope / Follow-ups

  • v2 snapshot proof (TODO(autobahn-snapshot-proof)) — bundle AppQC in snapshot.Metadata, verify cryptographically, enable in-loop retry.
  • Integration test (JoinFromStateSync) — lands with v2.
  • TODO(epoch) in giga_stateprovider.go — committee / validator set derivation moves to epoch lookup once autobahn supports dynamic committees.

🤖 Generated with Claude Code

Extends the autobahn restart fix (PR #3300) to cover the "new validator
joining" and "disk-wiped node" recovery paths via CometBFT state sync.

- node.go: stop force-disabling stateSync in giga mode. Gate on a direct
  app.Info() check since CometBFT's state.LastBlockHeight never advances
  under autobahn. postSyncHook is a no-op in giga mode (no block-sync
  reactor to hand control to). ssReactor wired only when stateSync is
  actually enabled.
- statesync/reactor.go: new optional stateProviderFactory param, used
  in place of RPC/P2P selection when set. node.go injects it in giga mode.
- statesync/giga_stateprovider.go (new): naive provider that returns
  empty AppHash (opt out of pre-verification), minimal Commit, and an
  sm.State built from GenesisDoc + static committee. Peers are trusted
  optimistically for this PR — see TODO(autobahn-snapshot-proof).
- statesync/syncer.go: skip the post-restore AppHash check when the
  provider returned an empty trustedAppHash. Vanilla RPC/P2P providers
  always return non-empty, so their behaviour is unchanged.
- autobahn/data/state.go: expose SkipTo public method wrapping internal
  skipTo, used post-state-sync to align data cursors with the app height.
- p2p/giga_router.go: three-way branch in runExecute: fresh (InitChain),
  state-sync restart (SkipTo, no PushAppHash), plain restart (PushAppHash).

Tests: GigaStateProvider unit tests (5), SkipTo unit tests (2), syncer
empty-trustedAppHash skip test (2). Non-giga behaviour preserved.

Follow-up: integration test for JoinFromStateSync (wipe node, restart
with statesync.enable=true, verify catchup) is deferred to a subsequent
PR along with the snapshot-proof mechanism (peers bundling AppQC in
snapshot metadata) that closes the loop on malicious-snapshot retry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 23, 2026, 3:52 AM

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 47.05882% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.31%. Comparing base (6dba9fc) to head (9a8854d).
⚠️ Report is 2 commits behind head on wen/fix_autobahn_restart_no_statesync.

Files with missing lines Patch % Lines
sei-tendermint/node/node.go 21.95% 31 Missing and 1 partial ⚠️
sei-tendermint/internal/p2p/giga_router.go 12.50% 13 Missing and 1 partial ⚠️
sei-tendermint/internal/statesync/reactor.go 11.11% 7 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@                           Coverage Diff                           @@
##           wen/fix_autobahn_restart_no_statesync    #3304    +/-   ##
=======================================================================
  Coverage                                  58.31%   58.31%            
=======================================================================
  Files                                       2085     2086     +1     
  Lines                                     209065   209185   +120     
=======================================================================
+ Hits                                      121907   121993    +86     
- Misses                                     78366    78398    +32     
- Partials                                    8792     8794     +2     
Flag Coverage Δ
sei-chain-pr 70.28% <47.05%> (+0.46%) ⬆️
sei-db 69.53% <ø> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/autobahn/data/state.go 75.48% <100.00%> (+0.37%) ⬆️
...endermint/internal/statesync/giga_stateprovider.go 100.00% <100.00%> (ø)
sei-tendermint/internal/statesync/syncer.go 66.42% <100.00%> (ø)
sei-tendermint/internal/statesync/reactor.go 68.98% <11.11%> (-0.62%) ⬇️
sei-tendermint/internal/p2p/giga_router.go 68.60% <12.50%> (+4.84%) ⬆️
sei-tendermint/node/node.go 57.68% <21.95%> (-2.61%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

// State just after NewState, before any Push*). The caller is responsible
// for ensuring no concurrent Push* races with this call — giga state sync
// uses it between data.NewState and GigaRouter.Run.
func (s *State) SkipTo(n types.GlobalBlockNumber) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this method should do a proper pruning instead, instead of imposing on the caller that State is empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants