[DEV] Improve Failover Restore / Replication Handling#1674
Merged
vazois merged 10 commits intomicrosoft:devfrom Apr 7, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses cluster failover correctness by preventing replication auto-resync logic from interfering with in-flight CLUSTER FAILOVER, and by adding a recovery path to reset the original primary if a DEFAULT failover aborts after failstopwrites has already modified the primary’s persisted config.
Changes:
- Suppress
ReplicationManager.EnsureReplicationauto-resync while a failover is in an active (non-terminal) state. - Ensure the primary is reset on failover failure paths after
failstopwriteswas confirmed, and add a DEBUG-only exception injection point to test this. - Add cluster tests covering the EnsureReplication/failover interaction and the primary-reset recovery behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
libs/cluster/Server/Replication/ReplicationManager.cs |
Adds a failover-state guard to avoid EnsureReplication resync/locks during active failover. |
libs/cluster/Server/Failover/ReplicaFailoverSession.cs |
Adds DEBUG exception injection, tracks failover success, and attempts to reset primary in finally on failed failover. |
libs/common/ExceptionInjectionType.cs |
Adds a new DEBUG-only exception injection enum value for failover takeover failure simulation. |
test/Garnet.test.cluster/ClusterNegativeTests.cs |
Adds tests validating failover success during EnsureReplication polling and recovery/reset behavior on takeover failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
vazois
reviewed
Apr 7, 2026
vazois
approved these changes
Apr 7, 2026
…ttps://github.com/kevinmichaelbowersox/garnet into users/kbowersox/prevent-replicate-on-failover-dev
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prevent EnsureReplication from sabotaging in-flight failovers and recover from partial failover failures
Summary
This PR fixes two related bugs in the cluster failover path:
EnsureReplicationraces withCLUSTER FAILOVER: TheReplicationManager.EnsureReplicationbackground poller can issueCLUSTER REPLICATEduring an in-flight failover, sabotaging the promotion and leaving the cluster in an incoherent state.No recovery when failover fails after
failstopwrites: IfTakeOverAsPrimaryfails afterPauseWritesAndWaitForSynchas already sentfailstopwritesto the primary, the primary is left in a modified state (no slots, replica role) with no node to serve those slots.The Core Problem
During a
CLUSTER FAILOVER DEFAULT, the replica sendsfailstopwritesto its primary, which immediately and irreversibly modifies the primary's config — making it a replica and reassigning all its slots. The replica then waits for replication offsets to converge before callingTakeOverAsPrimaryto claim those slots.EnsureReplicationruns on a background polling timer inside the Garnet process. When it detects a broken replication session (which can happen during failover), it acquires aReadRolelock and issuesCLUSTER REPLICATEto re-establish replication — completely unaware that a failover is in progress.This creates two failure modes:
EnsureReplication'sReadRoleread lock blocksTakeOverAsPrimary'sClusterFailoverwrite lock (TryWriteLockfails non-blockingly, aborting the failover)EnsureReplication'sCLUSTER REPLICATEre-establishes replication to the primary the replica is trying to take over from, making offset sync impossibleRace Condition Sequence
CLUSTER FAILOVERreceived →status = BEGIN_FAILOVERPauseWritesAndWaitForSync→ sendsfailstopwrites(localNodeId)to primaryfailstopwrites→TryStopWrites: gives up all slots, becomes replica of the failover node, bumps epoch, flushes config to diskstatus = WAITING_FOR_SYNC→ spinning onprimaryReplicationOffset > ReplicationOffsetIsReplicating = false)IsReplica? ✅RemoteNodeId == primaryId? ✅IsReplicating? ❌→ all conditions metPreventRoleChange()→BeginRecovery(ReadRole)→ acquires read lock onrecoverLockTask.Run→ issuesCLUSTER REPLICATEback to primary → re-establishes replication sessionPauseWritesAndWaitForSyncreturnsfalse(timeout) →BeginAsyncReplicaFailoverreturnsfalseTakeOverAsPrimarynever called → replica never claims slotsfinally:status = NO_FAILOVER,lastFailoverStatus = FAILOVER_ABORTEDResulting Incoherent Topology
No node can serve reads or writes for the affected slots. The primary's config is persisted to disk, so this state survives restarts.
Downstream Impact
Without recovery, this incoherent topology can cause:
Changes
1. Suppress
EnsureReplicationduring active failover (ReplicationManager.cs)Added an early-exit guard in
EnsureReplicationthat checksfailoverManager.lastFailoverStatus. If the failover is in any active state (BEGIN_FAILOVER,ISSUING_PAUSE_WRITES,WAITING_FOR_SYNC,FAILOVER_IN_PROGRESS,TAKING_OVER_AS_PRIMARY),EnsureReplicationreturns immediately without acquiring any locks. Terminal states (FAILOVER_COMPLETED,FAILOVER_ABORTED) are intentionally excluded soEnsureReplicationresumes normal operation after failover completes.2. Reset primary on failover failure (
ReplicaFailoverSession.cs)Moved the primary reset logic (
failstopwrites([])) from theTakeOverAsPrimaryfailure path into thefinallyblock ofBeginAsyncReplicaFailover. This ensures the primary is reset for all failure paths afterfailstopwriteswas sent — including timeout inPauseWritesAndWaitForSync, exceptions, andTakeOverAsPrimaryfailure. Previously, the reset was only attempted whenTakeOverAsPrimaryexplicitly returnedfalse, which was unreachable when the failover failed during the sync wait phase.3. Exception injection for testing (
ExceptionInjectionType.cs,ReplicaFailoverSession.cs)Added
Failover_Fail_TakeOverAsPrimaryexception injection point (DEBUG only) to simulateTakeOverAsPrimaryfailure afterPauseWritesAndWaitForSyncsucceeds, enabling deterministic testing of the recovery path.Tests
ClusterFailoverResetsPrimaryOnTakeOverFailure(new, DEBUG): Verifies that whenTakeOverAsPrimaryfails afterfailstopwriteswas sent, the primary is reset back to its original state — owns all 16384 slots, hasmasterrole, and data remains accessible.ClusterFailoverSucceedsDuringEnsureReplication(new): Verifies that aCLUSTER FAILOVERcompletes successfully whenEnsureReplicationis actively polling (1-second interval).ClusterEnsureReplicationWorksAfterFailover(new): Verifies thatEnsureReplicationresumes normal operation after failover completes — the guard only suppresses during active failover states.ClusterFailoverDuringRecovery(existing, unmodified): UsesTAKEOVERwhich skipsPauseWritesAndWaitForSync, so the new reset path doesn't trigger. Unaffected by these changes.