fix: stabilize raft leader failover (election livelock, follower cache staleness, test race) by bplatz · Pull Request #1373 · fluree/db

bplatz · 2026-06-25T11:53:39Z

Background

CI/CD was intermittently failing on fluree-db-server::raft_multi_node leader_failover_resumes_writes — a transient red that didn't reproduce on a re-run and didn't obviously point at any one change. Reproducing it locally showed a ~60% failure rate with three distinct panics, which turned out to be three independent issues. Two are real product bugs in the consensus layer; the third is a race in the test itself. The flakiness was self-masking: the test usually failed at an earlier stage before it could exercise the most important bug.

Root causes & fixes

1. Election livelock — no new leader after failover (product)

The default election timeout (openraft's stock 150–300 ms) sat below NetworkConfig::rpc_timeout (500 ms) and near connect_timeout (250 ms to a dead-but-not-yet-evicted voter). A candidate re-elected — bumping its term — before its own vote RPCs from the previous round resolved, so every vote landed on a now-stale term and no quorum ever formed. Survivors livelocked: all four climbed to the same term with leader = None and never converged.

Fix: derive the election window from the transport timeouts so the invariant election_timeout_min > rpc_timeout holds by construction (2× rpc_timeout, 750 ms floor, 2× spread → 1000–2000 ms at the defaults). RaftIntegration::bootstrap also warns at startup if election_timeout_min <= rpc_timeout, so an embedder who tunes either config independently and reintroduces the risk is told about it.

2. Committed write invisible on followers (product — the actual CI failure)

After a write replicated and applied on every node (all at the same raft applied index), only the staging node's query reflected it; followers kept serving the prior head. The local cache event listener reconciled the ledger view only on LedgerIndexPublished, never LedgerCommitPublished — so a committed-but-unindexed write left follower caches stale until some later index publish happened to land. The leader stayed correct because its own commit path updates its cache directly.

Fix: reconcile on commit publishes as well as index publishes (matching the listener's own documented contract). While here, the listener now also recovers from broadcast Lagged: rather than only logging, it sweeps the loaded ledgers and reconciles each against the nameservice head, so a dropped event can't strand a ledger's cache.

3. Test race on a transient leadership 503 (test)

The test read the new leader from one node, then immediately wrote through a different surviving follower that hadn't yet learned the new leader — getting a retryable 503 "no leader currently elected". It now waits until every surviving node agrees on the new leader before resuming writes.

Verification

leader_failover_resumes_writes was run in a tight loop (90+ iterations across the iterations of this change) and is now consistently green; the full raft_multi_node and raft_integration suites pass, as do the fluree-db-api unit tests. fmt + clippy clean.

…e staleness, test race) Three independent issues surfaced by leader_failover_resumes_writes, which flaked ~60% locally with three distinct panics: 1. Election livelock (no new leader): default election timeout (150-300ms) sat below NetworkConfig::rpc_timeout (500ms) and near connect_timeout (250ms). A candidate re-elected before its vote RPCs resolved, so votes landed on stale terms and no quorum formed; survivors climbed to the same term with no leader. Raise default election timeout to 750-1500ms, safely above the RPC timeouts, with a wide spread for desync. 2. Follower cache staleness (committed write invisible on followers): the local cache event listener reconciled only on LedgerIndexPublished, never LedgerCommitPublished. A committed-but-unindexed write replicated and applied on every node, yet only the staging node's cache reflected it. Reconcile on commit publishes too (matches the function's documented contract). 3. Test race: the test read the new leader from one node then immediately wrote through a different surviving follower that had not yet learned the leader (retryable 503). Wait for cluster-wide leader agreement first. leader_failover_resumes_writes now 35/35; full raft_multi_node + raft_integration green.

…ariant Review follow-ups: - Cache listener now recovers from broadcast lag. RecvError::Lagged previously only logged; a dropped event could be the sole notice a ledger gets, leaving its cached view stale until the next event. On lag, sweep every currently-loaded ledger and reconcile it against the nameservice head (no-op when already current) — restoring the catch-up fallback the design assumes. Reconcile logic extracted into a shared helper. - Make the election_timeout > rpc_timeout invariant explicit instead of a hardcoded pair. default_raft_config now derives the election window from NetworkConfig (2x rpc_timeout, 750ms floor, 2x spread), and bootstrap() warns when election_timeout_min <= rpc_timeout so an override of either config that reintroduces the livelock risk is surfaced at startup.

aaj3f

Seems like a good & correct fix. Only one detailed note below. At some point it may be valuable to have a kind of firehose bench to assess the conditions under which we start to see Lagged get triggered and to validate the fallback handling. There may be improvements we could make to take pressure off the broadcast channel or to better improve fallback reconciliation to recover from it, but not likely worth doing now

aaj3f · 2026-06-27T17:52:54Z

                Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped)) => {
+                    // The channel overflowed and dropped events. With
+                    // commit publishes on the bus this is the only
+                    // notice some ledger gets, so a bare log would leave
+                    // its cache stale until the next event happens to
+                    // land. Fall back to a catch-up sweep: re-reconcile
+                    // every currently-loaded ledger against the
+                    // nameservice head (a no-op for any already current).
+                    let aliases = ledger_manager.cached_aliases().await;
                    tracing::warn!(
                        skipped,
-                        "local cache event listener lagged behind nameservice events"
+                        ledgers = aliases.len(),
+                        "local cache event listener lagged; sweeping loaded ledgers"
                    );
+                    for alias in aliases {
+                        reconcile_cached_ledger(&ledger_manager, &alias).await;
+                    }
                }


I don't know if we've observed this Lagged state demonstrably, but my only concern is that, because the reconciliation sweep happens within an inner loop and so isn't helping drain the receiver, if the listener is lagging because reconciles are slow, then it could possibly contribute to a re-trigger of Lagged. Perhaps this could be hardened w/ a future bound or a scope to sweep only ledgers that are provably behind etc? Or we can just make a comment for now and revisit only if we prove this issue can/does arise

bplatz added 2 commits June 25, 2026 07:28

bplatz requested review from aaj3f and zonotope June 25, 2026 11:53

aaj3f approved these changes Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: stabilize raft leader failover (election livelock, follower cache staleness, test race)#1373

fix: stabilize raft leader failover (election livelock, follower cache staleness, test race)#1373
bplatz wants to merge 2 commits into
mainfrom
fix/raft-failover-flake

bplatz commented Jun 25, 2026

Uh oh!

aaj3f left a comment

Uh oh!

aaj3f Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

bplatz commented Jun 25, 2026

Background

Root causes & fixes

1. Election livelock — no new leader after failover (product)

2. Committed write invisible on followers (product — the actual CI failure)

3. Test race on a transient leadership 503 (test)

Verification

Uh oh!

aaj3f left a comment

Choose a reason for hiding this comment

Uh oh!

aaj3f Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants