Skip to content

Support autobahn node restart by skipping CometBFT handshaker (CON-252)#3300

Open
wen-coding wants to merge 4 commits intomainfrom
wen/fix_autobahn_restart_no_statesync
Open

Support autobahn node restart by skipping CometBFT handshaker (CON-252)#3300
wen-coding wants to merge 4 commits intomainfrom
wen/fix_autobahn_restart_no_statesync

Conversation

@wen-coding
Copy link
Copy Markdown
Contributor

@wen-coding wen-coding commented Apr 22, 2026

Summary

Enables an autobahn validator to be stopped and restarted. The CometBFT handshaker used to abort because autobahn never writes to the CometBFT block/state stores, so it observed storeHeight=0 < appHeight=NErrAppBlockHeightTooHigh.

  • node.go: shouldHandshake = !stateSync && !gigaEnabled — skip the handshaker in giga mode.
  • giga_router.go (runExecute): now owns InitChain. Calls it on fresh start (last == 0); skips it on restart (last > 0), where BaseApp.FinalizeBlock's existing nil-check rebuilds deliverState from the committed CMS.
  • giga_router_test.go: drop the simulated-handshaker InitChain from TestGigaRouter_FinalizeBlocks; tighten the contract-test doc on TestInitChainCommitThenFinalize.
  • integration_test/autobahn/autobahn_test.go: replace the t.Skip in the Recovery subtest with a real restart test. Adds restartNode helper.

Startup paths

shouldHandshake InitChain by FinalizeBlock's deliverState
A (fresh, no giga) true handshaker InitChain
B (fresh, giga) false runExecute InitChain
C (restart, no giga) true — (appHeight>0) handshaker replay if needed, else existing
D (restart, giga) false — (last>0) BaseApp nil-check fallback from committed CMS

Cases B and D are what change. In B, runExecute takes over the InitChain duty the handshaker used to perform. In D, skipping the handshaker is the whole fix: the CometBFT state/block stores are never populated by autobahn, so reconciling against them is vacuous; the app's committed CMS is the source of truth, and BaseApp.FinalizeBlock's existing if deliverState == nil { setDeliverState(header) } fallback picks it up on the first block post-restart.

Scope / TODO

Covers restart with local disk intact (app.Info().LastBlockHeight = N, autobahn WAL has block ≥ N). Does not cover a new validator joining or a disk-wiped node — that needs app state sync + autobahn WAL sync. Tracked via TODO(autobahn-recovery) in node.go.

Test plan

  • go build ./... clean; gofmt -s -l . clean
  • go test ./internal/p2p/... ./node/... ./internal/consensus/... ./internal/autobahn/... — green
  • make autobahn-integration-test — all 5 subtests pass. Recovery: halted @ 482 → resumed @ 537 in 12.4s after restarting a killed node.

🤖 Generated with Claude Code

In giga mode the CometBFT block/state stores are never written, so on
restart the handshaker observes storeHeight=0 < appHeight=N and aborts
with ErrAppBlockHeightTooHigh. Skip the handshaker entirely when
gigaEnabled (same escape hatch state sync uses) and move InitChain
ownership into GigaRouter.runExecute: call it on fresh start (last==0),
skip on restart (last>0) — BaseApp.FinalizeBlock's existing nil-check
fallback rebuilds deliverState from the committed CMS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wen-coding wen-coding changed the title fix: support autobahn node restart by skipping CometBFT handshaker Support autobahn node restart by skipping CometBFT handshaker (CON-252) Apr 22, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedApr 23, 2026, 3:27 AM

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 72.72727% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.31%. Comparing base (0ddccd5) to head (1ac271a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sei-tendermint/internal/p2p/giga_router.go 83.33% 2 Missing and 1 partial ⚠️
sei-tendermint/node/node.go 25.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #3300   +/-   ##
=======================================
  Coverage   58.31%   58.31%           
=======================================
  Files        2085     2085           
  Lines      209061   209070    +9     
=======================================
+ Hits       121906   121912    +6     
- Misses      78363    78366    +3     
  Partials     8792     8792           
Flag Coverage Δ
sei-chain-pr 69.86% <72.72%> (?)
sei-db 69.36% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/p2p/giga_router.go 64.57% <83.33%> (-0.41%) ⬇️
sei-tendermint/node/node.go 60.28% <25.00%> (-0.38%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wen-coding and others added 3 commits April 22, 2026 12:54
Replaces the Skip'd Recovery subtest with a real test body that exercises
the restart path fixed in the prior commit: HaltsBeyondMaxFaults leaves
the chain halted, Recovery restarts one killed node, and the test asserts
the chain resumes advancing. Adds a restartNode helper mirroring killNode.

All 5 subtests pass locally against a 4-node docker cluster:
  BlockProduction, BankTransfer, LivenessUnderMaxFaults,
  HaltsBeyondMaxFaults, Recovery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recovery used to depend on HaltsBeyondMaxFaults having killed
maxFaults+1 nodes; it also hardcoded the "last killed" target index
from that test's internal kill order. Move the halted precondition
into Recovery itself — killNode is idempotent (pkill tolerates an
already-dead process), so this works whether the subtest runs alone
or after the prior ones. Add a halt-assertion before restart so the
"chain resumed" check is only meaningful on a truly halted chain.

Also restore the comment on the trailing assertAutobahnEnabled with
the non-obvious *why*: start_sei.sh truncates the log on restart, so
the grep necessarily matches a post-restart GigaRouter init.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ery log assertion

Three comment-only additions surfaced during PR review:

- giga_router.go: note that re-entering the last==0 path after a
  mid-InitChain crash is safe (nothing was committed, so the second
  InitChain behaves as a fresh init).
- autobahn_test.go: document restartNode's precondition — the target
  must not be running, since start_sei.sh unconditionally spawns a new
  seid process.
- autobahn_test.go: fix the Recovery trailing assertAutobahnEnabled
  comment to reflect that the check iterates all running nodes; the
  post-restart fresh-log guarantee applies to the restarted node,
  which is among them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant