fix: stabilize raft leader failover (election livelock, follower cache staleness, test race)#1373
fix: stabilize raft leader failover (election livelock, follower cache staleness, test race)#1373bplatz wants to merge 2 commits into
Conversation
…e staleness, test race) Three independent issues surfaced by leader_failover_resumes_writes, which flaked ~60% locally with three distinct panics: 1. Election livelock (no new leader): default election timeout (150-300ms) sat below NetworkConfig::rpc_timeout (500ms) and near connect_timeout (250ms). A candidate re-elected before its vote RPCs resolved, so votes landed on stale terms and no quorum formed; survivors climbed to the same term with no leader. Raise default election timeout to 750-1500ms, safely above the RPC timeouts, with a wide spread for desync. 2. Follower cache staleness (committed write invisible on followers): the local cache event listener reconciled only on LedgerIndexPublished, never LedgerCommitPublished. A committed-but-unindexed write replicated and applied on every node, yet only the staging node's cache reflected it. Reconcile on commit publishes too (matches the function's documented contract). 3. Test race: the test read the new leader from one node then immediately wrote through a different surviving follower that had not yet learned the leader (retryable 503). Wait for cluster-wide leader agreement first. leader_failover_resumes_writes now 35/35; full raft_multi_node + raft_integration green.
…ariant Review follow-ups: - Cache listener now recovers from broadcast lag. RecvError::Lagged previously only logged; a dropped event could be the sole notice a ledger gets, leaving its cached view stale until the next event. On lag, sweep every currently-loaded ledger and reconcile it against the nameservice head (no-op when already current) — restoring the catch-up fallback the design assumes. Reconcile logic extracted into a shared helper. - Make the election_timeout > rpc_timeout invariant explicit instead of a hardcoded pair. default_raft_config now derives the election window from NetworkConfig (2x rpc_timeout, 750ms floor, 2x spread), and bootstrap() warns when election_timeout_min <= rpc_timeout so an override of either config that reintroduces the livelock risk is surfaced at startup.
aaj3f
left a comment
There was a problem hiding this comment.
Seems like a good & correct fix. Only one detailed note below. At some point it may be valuable to have a kind of firehose bench to assess the conditions under which we start to see Lagged get triggered and to validate the fallback handling. There may be improvements we could make to take pressure off the broadcast channel or to better improve fallback reconciliation to recover from it, but not likely worth doing now
| Err(tokio::sync::broadcast::error::RecvError::Lagged(skipped)) => { | ||
| // The channel overflowed and dropped events. With | ||
| // commit publishes on the bus this is the only | ||
| // notice some ledger gets, so a bare log would leave | ||
| // its cache stale until the next event happens to | ||
| // land. Fall back to a catch-up sweep: re-reconcile | ||
| // every currently-loaded ledger against the | ||
| // nameservice head (a no-op for any already current). | ||
| let aliases = ledger_manager.cached_aliases().await; | ||
| tracing::warn!( | ||
| skipped, | ||
| "local cache event listener lagged behind nameservice events" | ||
| ledgers = aliases.len(), | ||
| "local cache event listener lagged; sweeping loaded ledgers" | ||
| ); | ||
| for alias in aliases { | ||
| reconcile_cached_ledger(&ledger_manager, &alias).await; | ||
| } | ||
| } |
There was a problem hiding this comment.
I don't know if we've observed this Lagged state demonstrably, but my only concern is that, because the reconciliation sweep happens within an inner loop and so isn't helping drain the receiver, if the listener is lagging because reconciles are slow, then it could possibly contribute to a re-trigger of Lagged. Perhaps this could be hardened w/ a future bound or a scope to sweep only ledgers that are provably behind etc? Or we can just make a comment for now and revisit only if we prove this issue can/does arise
Background
CI/CD was intermittently failing on
fluree-db-server::raft_multi_node leader_failover_resumes_writes— a transient red that didn't reproduce on a re-run and didn't obviously point at any one change. Reproducing it locally showed a ~60% failure rate with three distinct panics, which turned out to be three independent issues. Two are real product bugs in the consensus layer; the third is a race in the test itself. The flakiness was self-masking: the test usually failed at an earlier stage before it could exercise the most important bug.Root causes & fixes
1. Election livelock — no new leader after failover (product)
The default election timeout (openraft's stock 150–300 ms) sat below
NetworkConfig::rpc_timeout(500 ms) and nearconnect_timeout(250 ms to a dead-but-not-yet-evicted voter). A candidate re-elected — bumping its term — before its own vote RPCs from the previous round resolved, so every vote landed on a now-stale term and no quorum ever formed. Survivors livelocked: all four climbed to the same term withleader = Noneand never converged.Fix: derive the election window from the transport timeouts so the invariant
election_timeout_min > rpc_timeoutholds by construction (2× rpc_timeout, 750 ms floor, 2× spread → 1000–2000 ms at the defaults).RaftIntegration::bootstrapalso warns at startup ifelection_timeout_min <= rpc_timeout, so an embedder who tunes either config independently and reintroduces the risk is told about it.2. Committed write invisible on followers (product — the actual CI failure)
After a write replicated and applied on every node (all at the same raft applied index), only the staging node's query reflected it; followers kept serving the prior head. The local cache event listener reconciled the ledger view only on
LedgerIndexPublished, neverLedgerCommitPublished— so a committed-but-unindexed write left follower caches stale until some later index publish happened to land. The leader stayed correct because its own commit path updates its cache directly.Fix: reconcile on commit publishes as well as index publishes (matching the listener's own documented contract). While here, the listener now also recovers from broadcast
Lagged: rather than only logging, it sweeps the loaded ledgers and reconciles each against the nameservice head, so a dropped event can't strand a ledger's cache.3. Test race on a transient leadership 503 (test)
The test read the new leader from one node, then immediately wrote through a different surviving follower that hadn't yet learned the new leader — getting a retryable
503 "no leader currently elected". It now waits until every surviving node agrees on the new leader before resuming writes.Verification
leader_failover_resumes_writeswas run in a tight loop (90+ iterations across the iterations of this change) and is now consistently green; the fullraft_multi_nodeandraft_integrationsuites pass, as do thefluree-db-apiunit tests. fmt + clippy clean.