Skip to content

fix: stabilize raft leader failover (election livelock, follower cache staleness, test race)#1373

Open
bplatz wants to merge 2 commits into
mainfrom
fix/raft-failover-flake
Open

fix: stabilize raft leader failover (election livelock, follower cache staleness, test race)#1373
bplatz wants to merge 2 commits into
mainfrom
fix/raft-failover-flake

Conversation

@bplatz

@bplatz bplatz commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Background

CI/CD was intermittently failing on fluree-db-server::raft_multi_node leader_failover_resumes_writes — a transient red that didn't reproduce on a re-run and didn't obviously point at any one change. Reproducing it locally showed a ~60% failure rate with three distinct panics, which turned out to be three independent issues. Two are real product bugs in the consensus layer; the third is a race in the test itself. The flakiness was self-masking: the test usually failed at an earlier stage before it could exercise the most important bug.

Root causes & fixes

1. Election livelock — no new leader after failover (product)

The default election timeout (openraft's stock 150–300 ms) sat below NetworkConfig::rpc_timeout (500 ms) and near connect_timeout (250 ms to a dead-but-not-yet-evicted voter). A candidate re-elected — bumping its term — before its own vote RPCs from the previous round resolved, so every vote landed on a now-stale term and no quorum ever formed. Survivors livelocked: all four climbed to the same term with leader = None and never converged.

Fix: derive the election window from the transport timeouts so the invariant election_timeout_min > rpc_timeout holds by construction (2× rpc_timeout, 750 ms floor, 2× spread → 1000–2000 ms at the defaults). RaftIntegration::bootstrap also warns at startup if election_timeout_min <= rpc_timeout, so an embedder who tunes either config independently and reintroduces the risk is told about it.

2. Committed write invisible on followers (product — the actual CI failure)

After a write replicated and applied on every node (all at the same raft applied index), only the staging node's query reflected it; followers kept serving the prior head. The local cache event listener reconciled the ledger view only on LedgerIndexPublished, never LedgerCommitPublished — so a committed-but-unindexed write left follower caches stale until some later index publish happened to land. The leader stayed correct because its own commit path updates its cache directly.

Fix: reconcile on commit publishes as well as index publishes (matching the listener's own documented contract). While here, the listener now also recovers from broadcast Lagged: rather than only logging, it sweeps the loaded ledgers and reconciles each against the nameservice head, so a dropped event can't strand a ledger's cache.

3. Test race on a transient leadership 503 (test)

The test read the new leader from one node, then immediately wrote through a different surviving follower that hadn't yet learned the new leader — getting a retryable 503 "no leader currently elected". It now waits until every surviving node agrees on the new leader before resuming writes.

Verification

leader_failover_resumes_writes was run in a tight loop (90+ iterations across the iterations of this change) and is now consistently green; the full raft_multi_node and raft_integration suites pass, as do the fluree-db-api unit tests. fmt + clippy clean.

bplatz added 2 commits June 25, 2026 07:28
…e staleness, test race)

Three independent issues surfaced by leader_failover_resumes_writes, which
flaked ~60% locally with three distinct panics:

1. Election livelock (no new leader): default election timeout (150-300ms)
   sat below NetworkConfig::rpc_timeout (500ms) and near connect_timeout
   (250ms). A candidate re-elected before its vote RPCs resolved, so votes
   landed on stale terms and no quorum formed; survivors climbed to the same
   term with no leader. Raise default election timeout to 750-1500ms, safely
   above the RPC timeouts, with a wide spread for desync.

2. Follower cache staleness (committed write invisible on followers): the
   local cache event listener reconciled only on LedgerIndexPublished, never
   LedgerCommitPublished. A committed-but-unindexed write replicated and
   applied on every node, yet only the staging node's cache reflected it.
   Reconcile on commit publishes too (matches the function's documented
   contract).

3. Test race: the test read the new leader from one node then immediately
   wrote through a different surviving follower that had not yet learned the
   leader (retryable 503). Wait for cluster-wide leader agreement first.

leader_failover_resumes_writes now 35/35; full raft_multi_node + raft_integration green.
…ariant

Review follow-ups:

- Cache listener now recovers from broadcast lag. RecvError::Lagged
  previously only logged; a dropped event could be the sole notice a
  ledger gets, leaving its cached view stale until the next event. On
  lag, sweep every currently-loaded ledger and reconcile it against the
  nameservice head (no-op when already current) — restoring the
  catch-up fallback the design assumes. Reconcile logic extracted into
  a shared helper.

- Make the election_timeout > rpc_timeout invariant explicit instead of
  a hardcoded pair. default_raft_config now derives the election window
  from NetworkConfig (2x rpc_timeout, 750ms floor, 2x spread), and
  bootstrap() warns when election_timeout_min <= rpc_timeout so an
  override of either config that reintroduces the livelock risk is
  surfaced at startup.
@bplatz bplatz requested review from aaj3f and zonotope June 25, 2026 11:53

@aaj3f aaj3f left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good & correct fix. Only one detailed note below. At some point it may be valuable to have a kind of firehose bench to assess the conditions under which we start to see Lagged get triggered and to validate the fallback handling. There may be improvements we could make to take pressure off the broadcast channel or to better improve fallback reconciliation to recover from it, but not likely worth doing now

Comment thread fluree-db-api/src/lib.rs
Comment on lines 1235 to 1252
Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped)) => {
// The channel overflowed and dropped events. With
// commit publishes on the bus this is the only
// notice some ledger gets, so a bare log would leave
// its cache stale until the next event happens to
// land. Fall back to a catch-up sweep: re-reconcile
// every currently-loaded ledger against the
// nameservice head (a no-op for any already current).
let aliases = ledger_manager.cached_aliases().await;
tracing::warn!(
skipped,
"local cache event listener lagged behind nameservice events"
ledgers = aliases.len(),
"local cache event listener lagged; sweeping loaded ledgers"
);
for alias in aliases {
reconcile_cached_ledger(&ledger_manager, &alias).await;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we've observed this Lagged state demonstrably, but my only concern is that, because the reconciliation sweep happens within an inner loop and so isn't helping drain the receiver, if the listener is lagging because reconciles are slow, then it could possibly contribute to a re-trigger of Lagged. Perhaps this could be hardened w/ a future bound or a scope to sweep only ledgers that are provably behind etc? Or we can just make a comment for now and revisit only if we prove this issue can/does arise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants