Support autobahn node restart by skipping CometBFT handshaker (CON-252)#3300
Open
wen-coding wants to merge 4 commits intomainfrom
Open
Support autobahn node restart by skipping CometBFT handshaker (CON-252)#3300wen-coding wants to merge 4 commits intomainfrom
wen-coding wants to merge 4 commits intomainfrom
Conversation
In giga mode the CometBFT block/state stores are never written, so on restart the handshaker observes storeHeight=0 < appHeight=N and aborts with ErrAppBlockHeightTooHigh. Skip the handshaker entirely when gigaEnabled (same escape hatch state sync uses) and move InitChain ownership into GigaRouter.runExecute: call it on fresh start (last==0), skip on restart (last>0) — BaseApp.FinalizeBlock's existing nil-check fallback rebuilds deliverState from the committed CMS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3300 +/- ##
=======================================
Coverage 58.31% 58.31%
=======================================
Files 2085 2085
Lines 209061 209070 +9
=======================================
+ Hits 121906 121912 +6
- Misses 78363 78366 +3
Partials 8792 8792
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Replaces the Skip'd Recovery subtest with a real test body that exercises the restart path fixed in the prior commit: HaltsBeyondMaxFaults leaves the chain halted, Recovery restarts one killed node, and the test asserts the chain resumes advancing. Adds a restartNode helper mirroring killNode. All 5 subtests pass locally against a 4-node docker cluster: BlockProduction, BankTransfer, LivenessUnderMaxFaults, HaltsBeyondMaxFaults, Recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recovery used to depend on HaltsBeyondMaxFaults having killed maxFaults+1 nodes; it also hardcoded the "last killed" target index from that test's internal kill order. Move the halted precondition into Recovery itself — killNode is idempotent (pkill tolerates an already-dead process), so this works whether the subtest runs alone or after the prior ones. Add a halt-assertion before restart so the "chain resumed" check is only meaningful on a truly halted chain. Also restore the comment on the trailing assertAutobahnEnabled with the non-obvious *why*: start_sei.sh truncates the log on restart, so the grep necessarily matches a post-restart GigaRouter init. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ery log assertion Three comment-only additions surfaced during PR review: - giga_router.go: note that re-entering the last==0 path after a mid-InitChain crash is safe (nothing was committed, so the second InitChain behaves as a fresh init). - autobahn_test.go: document restartNode's precondition — the target must not be running, since start_sei.sh unconditionally spawns a new seid process. - autobahn_test.go: fix the Recovery trailing assertAutobahnEnabled comment to reflect that the check iterates all running nodes; the post-restart fresh-log guarantee applies to the restarted node, which is among them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables an autobahn validator to be stopped and restarted. The CometBFT handshaker used to abort because autobahn never writes to the CometBFT block/state stores, so it observed
storeHeight=0 < appHeight=N→ErrAppBlockHeightTooHigh.node.go:shouldHandshake = !stateSync && !gigaEnabled— skip the handshaker in giga mode.giga_router.go(runExecute): now ownsInitChain. Calls it on fresh start (last == 0); skips it on restart (last > 0), whereBaseApp.FinalizeBlock's existing nil-check rebuildsdeliverStatefrom the committed CMS.giga_router_test.go: drop the simulated-handshakerInitChainfromTestGigaRouter_FinalizeBlocks; tighten the contract-test doc onTestInitChainCommitThenFinalize.integration_test/autobahn/autobahn_test.go: replace thet.Skipin the Recovery subtest with a real restart test. AddsrestartNodehelper.Startup paths
BaseAppnil-check fallback from committed CMSCases B and D are what change. In B,
runExecutetakes over theInitChainduty the handshaker used to perform. In D, skipping the handshaker is the whole fix: the CometBFT state/block stores are never populated by autobahn, so reconciling against them is vacuous; the app's committed CMS is the source of truth, andBaseApp.FinalizeBlock's existingif deliverState == nil { setDeliverState(header) }fallback picks it up on the first block post-restart.Scope / TODO
Covers restart with local disk intact (
app.Info().LastBlockHeight = N, autobahn WAL has block ≥ N). Does not cover a new validator joining or a disk-wiped node — that needs app state sync + autobahn WAL sync. Tracked viaTODO(autobahn-recovery)innode.go.Test plan
go build ./...clean;gofmt -s -l .cleango test ./internal/p2p/... ./node/... ./internal/consensus/... ./internal/autobahn/...— greenmake autobahn-integration-test— all 5 subtests pass. Recovery: halted @ 482 → resumed @ 537 in 12.4s after restarting a killed node.🤖 Generated with Claude Code