Skip to content

fix: unify sync-status definition across metric, duties, and reaggregation#438

Merged
pablodeymo merged 4 commits into
mainfrom
fix/dedup-sync-status-heuristic
Jun 16, 2026
Merged

fix: unify sync-status definition across metric, duties, and reaggregation#438
pablodeymo merged 4 commits into
mainfrom
fix/dedup-sync-status-heuristic

Conversation

@MegaRedHand

Copy link
Copy Markdown
Collaborator

🗒️ Description / Motivation

PR #417 added the spec-aligned SyncStatusTracker (4-slot lag threshold, hysteresis, network-stall detection) and pointed lean_node_sync_status at it. But the legacy heuristic in process_block (head_slot >= current_slot, 0-slot tolerance, no hysteresis, no stall awareness) kept writing the same metric on every block import.

The two writers raced on set_node_sync_status:

  • on tick: spec heuristic (4-slot lag)
  • on block import: strict 0-lag heuristic

Whichever ran last won, so the gauge flapped between two contradictory definitions of "synced", and #417's goal of aligning the metric with leanSpec was only half-done.

What Changed

  • Drop the set_node_sync_status write from process_block. SyncStatusTracker (refreshed on each tick via update_sync_status) is now the single source of truth for the metric.
  • process_block no longer computes/returns the synced bool; it returns Result<(), StoreError>.
  • Gate post-import reaggregation on sync_status.duties_allowed() instead of the strict bool, so reaggregation, validator duties, and the metric all share one spec-aligned definition of "in sync".

Behavior

Concern Before After
lean_node_sync_status raced between tick (4-slot) and import (0-slot) tracker only (spec 4-slot + hysteresis + stall)
validator duties gate tracker (#418) tracker (unchanged)
reaggregation gate strict head >= current duties_allowed() (tracker)

Note: the reaggregation gate now reads the last-tick tracker assessment rather than recomputing at import time. During backfill the tracker reports syncing (large head-lag), so reaggregation stays suppressed as intended; the only difference is sub-interval staleness, which is harmless here.

Tests

  • cargo test -p ethlambda-blockchain --lib — 38 passed.
  • cargo clippy -p ethlambda-blockchain --lib -- -D warnings — clean.
  • cargo fmt --all -- --check — clean.

Related

…ation

PR #417 added the spec-aligned SyncStatusTracker (4-slot lag threshold,
hysteresis, network-stall detection) and pointed lean_node_sync_status at
it, but the legacy heuristic in process_block (head_slot >= current_slot,
0-slot tolerance) kept writing the same metric on every block import. The
two writers raced, so the gauge flapped between two different definitions
of "synced".

Drop the metric write from process_block, leaving SyncStatusTracker
(refreshed on tick) as the single source of truth. process_block no longer
needs to compute or return the synced bool.

Gate post-import reaggregation on sync_status.duties_allowed() instead of
the strict bool, so "in sync" has one spec-aligned meaning across the
metric, validator duties, and reaggregation.
@github-actions

Copy link
Copy Markdown

🤖 Kimi Code Review

Overall Assessment: Good architectural refactoring that properly separates block processing from sync status tracking. The change eliminates duplicate sync logic and ensures validator duties and reaggregation use a consistent sync gate.

Issues and Concerns:

1. Potential metrics regression (Line 643-650)
The removal of metrics::set_node_sync_status(status) from process_block is correct for separation of concerns, but verify that SyncStatusTracker::update_sync_status() (referenced in the doc comment at line 634) actually calls metrics::set_node_sync_status(). If not, the node_sync_status metric will freeze at its last value.

2. Race condition on sync status check (Line 782)
After process_block updates the store with a new head slot, self.sync_status.duties_allowed() may return stale data if update_sync_status runs on a different tick/interval. This means:

  • A node might temporarily skip reaggregation for a block that just brought it into sync
  • Or worse, might reaggregate briefly after a block that put it out of sync (until the next tick updates status)

Recommendation: Consider immediately updating the sync status tracker with the new head slot before the duties_allowed() check, or document that this specific race window is acceptable (sub-second delay in reaggregation eligibility is typically fine).

3. Semantic equivalence of sync check (Line 782)
Ensure SyncStatusTracker::duties_allowed() implements logic at least as strict as the removed:

let synced = head_slot >= current_slot;

If duties_allowed() is more permissive (e.g., allows duties within 1-2 slots of head), the node could spam invalid reaggregates during sync. For consensus safety, stricter is acceptable; more lenient is dangerous.

Positive Notes:

  • Line 632: Excellent documentation explaining the architectural boundary
  • Line 767: Cleaner API returning Result<(), StoreError> instead of the coupled Result<bool, StoreError>
  • Line 775: Preserving the clone before consumption is correct for the reaggregation flow

Nit: The doc comment at line 638 references Self::update_sync_status but the method shown in the diff is named process_block. Ensure the naming is consistent with the actual method names in the codebase (likely update_sync_status exists elsewhere, but verify the reference is correct).


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Codex Code Review

  1. crates/blockchain/src/lib.rs:787 introduces a correctness regression in the reaggregation path. self.sync_status.duties_allowed() is only refreshed on ticks (crates/blockchain/src/lib.rs:272, crates/blockchain/src/lib.rs:903), but this check runs on block arrival. If the node is syncing and imports the block that actually catches it up between ticks, the cached sync_status is still syncing, so run_reaggregate_from_block is skipped for that exact block. Reaggregation only happens at import time, so the attestations carried by the catch-up block never get folded into the local aggregated pool or republished by aggregators. That can drop fork-choice weight / gossip coverage for the first post-catch-up block. This path needs a fresh sync evaluation after successful import, not the last tick’s cached state.

I didn’t see other obvious bugs in this small diff once that point is addressed.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps

greptile-apps Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR eliminates a metric-ownership race between process_block and SyncStatusTracker by making the tracker the sole writer of lean_node_sync_status, and unifies the reaggregation sync gate with the same spec-aligned definition used for validator duties.

  • Removes set_node_sync_status and the strict head >= current_slot heuristic from process_block; update_sync_status (called each tick) is now the single path that updates the metric.
  • Changes process_block's return type from Result<bool, StoreError> to Result<(), StoreError> and replaces the post-import if synced reaggregation guard with if self.sync_status.duties_allowed(), so reaggregation, attestations, and proposals all share the same 4-slot-lag + hysteresis + network-stall threshold.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to one file, removes a competing metric write, and unifies three previously inconsistent in-sync definitions behind a single spec-aligned tracker.

The removed code path was genuinely harmful (a race on every block import vs. every tick) and the replacement uses the same duties_allowed() predicate already trusted for proposals and attestations. The only behavioral shift in reaggregation is a looser threshold (4-slot lag vs. 0-slot), which the existing tests and the PR description both account for. No pre-existing callers relied on the bool return value beyond the single match arm, and that arm is correctly updated.

No files require special attention; crates/blockchain/src/lib.rs is the sole changed file and the diff is clean.

Important Files Changed

Filename Overview
crates/blockchain/src/lib.rs Removes redundant sync-status write from process_block, simplifies return type to Result<(), StoreError>, and switches reaggregation gate to duties_allowed(); change is coherent and the staleness bound (≤1 tick interval) is acknowledged and benign.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Tick as on_tick
    participant Block as on_receive_block / process_block
    participant Tracker as SyncStatusTracker
    participant Metrics as lean_node_sync_status

    Note over Tick,Metrics: BEFORE (race condition)
    Tick->>Tracker: update(current_slot, head, max_seen)
    Tracker-->>Metrics: set_node_sync_status (4-slot lag)
    Block->>Metrics: set_node_sync_status (0-slot strict)
    Note over Metrics: last writer wins — metric flaps

    Note over Tick,Metrics: AFTER (single source of truth)
    Tick->>Tracker: update(current_slot, head, max_seen)
    Tracker-->>Metrics: set_node_sync_status (4-slot lag + hysteresis)
    Block->>Block: process_block returns Result
    Block->>Tracker: duties_allowed()? (read last-tick state)
    Tracker-->>Block: true / false
    Block->>Block: run_reaggregate_from_block (if allowed)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Tick as on_tick
    participant Block as on_receive_block / process_block
    participant Tracker as SyncStatusTracker
    participant Metrics as lean_node_sync_status

    Note over Tick,Metrics: BEFORE (race condition)
    Tick->>Tracker: update(current_slot, head, max_seen)
    Tracker-->>Metrics: set_node_sync_status (4-slot lag)
    Block->>Metrics: set_node_sync_status (0-slot strict)
    Note over Metrics: last writer wins — metric flaps

    Note over Tick,Metrics: AFTER (single source of truth)
    Tick->>Tracker: update(current_slot, head, max_seen)
    Tracker-->>Metrics: set_node_sync_status (4-slot lag + hysteresis)
    Block->>Block: process_block returns Result
    Block->>Tracker: duties_allowed()? (read last-tick state)
    Tracker-->>Block: true / false
    Block->>Block: run_reaggregate_from_block (if allowed)
Loading

Reviews (1): Last reviewed commit: "fix: unify sync-status definition across..." | Re-trigger Greptile

Comment thread crates/blockchain/src/lib.rs Outdated
@github-actions

Copy link
Copy Markdown

🤖 Claude Code Review

I now have full context. Here is the review:


PR #438 — fix: unify sync-status definition across metric, duties, and reaggregation

Overview: This PR fixes a genuine race condition where process_block and update_sync_status (called on tick) were both writing to lean_node_sync_status with conflicting definitions of "in sync". The fix removes the write from process_block, simplifies its return type from Result<bool, StoreError> to Result<(), StoreError>, and gates reaggregation on self.sync_status.duties_allowed() rather than a freshly computed strict boolean.


Correctness

The race condition is real and the fix is correct. The two conflicting writers were:

  • Tick path (update_sync_status): 4-slot lag threshold, hysteresis, network-stall detection.
  • Block import path (process_block): strict head_slot >= current_slot (0-slot tolerance, no hysteresis, no stall awareness).

Removing the write from process_block and delegating exclusively to SyncStatusTracker is the right call.

The reaggregation gate change is semantically sound and consistent. Line 352 (attestation production gate) already used self.sync_status.duties_allowed(); this PR aligns reaggregation to the same gate, making "in sync" have a single definition across all three duty paths.

Subtle behavioral difference worth acknowledging

The old if synced at the call site evaluated head_slot >= current_slot at import time — a fresh wall-clock comparison. The new self.sync_status.duties_allowed() reads the last-tick assessment, which can be up to one full slot stale (~4 seconds).

For reaggregation this is harmless in practice (the PR description explicitly notes this), but it's worth confirming one startup edge case:

SyncStatusTracker derives Default, so syncing initializes to false, meaning duties_allowed() returns true before the first tick runs update_sync_status. If blocks arrive in this window during an initial backfill, reaggregation would trigger. The old code would have correctly returned false (head behind current). However:

  • This startup window is very short (sub-second, first tick fires immediately).
  • run_reaggregate_from_block already guards on is_aggregator before publishing gossip (lines 820–822 of the current file), so gossip spam is impossible even if the sync gate is briefly open.
  • This same default-true behavior already existed for the attestation duty gate (line 352), so this is not a regression introduced here.

This is low risk, but it may be worth adding a comment that SyncStatusTracker::default() starts in the "synced" state intentionally (or updating the Default derive to reflect the intended conservative initial state, if the syncing: true start is preferable).

Code quality

  • The diff is minimal (15 additions, 22 deletions) and focused — exactly the right scope.
  • The docstring update on process_block is accurate and useful.
  • The comment at the run_reaggregate_from_block call site is updated correctly and explains the invariant well.
  • Inlining self.store.head_slot() directly into metrics::update_head_slot(...) is fine now that the head_slot variable has no other uses.
  • Removing INTERVALS_PER_SLOT from process_block (since current_slot computation is gone) is correct — assuming it's still imported elsewhere.

Testing

The PR reports 38 existing unit tests pass and clippy/fmt are clean. There are no new tests specifically verifying the single-writer invariant (i.e., that process_block no longer touches lean_node_sync_status). A test asserting that the metric is unchanged after process_block would lock in this fix against regression, but it's not strictly required for a change this small.

Summary

The fix is correct, the implementation is clean, and the motivating race condition is clearly documented. The only non-critical note is the default-true startup state for SyncStatusTracker, which is pre-existing and harmless given the is_aggregator guard in run_reaggregate_from_block. Ready to merge.


Automated review by Claude (Anthropic) · sonnet · custom prompt

Comment thread crates/blockchain/src/lib.rs Outdated
@pablodeymo pablodeymo merged commit 6478c4a into main Jun 16, 2026
2 checks passed
@pablodeymo pablodeymo deleted the fix/dedup-sync-status-heuristic branch June 16, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants