Skip to content

[DEV] Improve Failover Restore / Replication Handling#1674

Merged
vazois merged 10 commits intomicrosoft:devfrom
kevinmichaelbowersox:users/kbowersox/prevent-replicate-on-failover-dev
Apr 7, 2026
Merged

[DEV] Improve Failover Restore / Replication Handling#1674
vazois merged 10 commits intomicrosoft:devfrom
kevinmichaelbowersox:users/kbowersox/prevent-replicate-on-failover-dev

Conversation

@kevinmichaelbowersox
Copy link
Copy Markdown
Member

Prevent EnsureReplication from sabotaging in-flight failovers and recover from partial failover failures

Summary

This PR fixes two related bugs in the cluster failover path:

  1. EnsureReplication races with CLUSTER FAILOVER: The ReplicationManager.EnsureReplication background poller can issue CLUSTER REPLICATE during an in-flight failover, sabotaging the promotion and leaving the cluster in an incoherent state.

  2. No recovery when failover fails after failstopwrites: If TakeOverAsPrimary fails after PauseWritesAndWaitForSync has already sent failstopwrites to the primary, the primary is left in a modified state (no slots, replica role) with no node to serve those slots.

The Core Problem

During a CLUSTER FAILOVER DEFAULT, the replica sends failstopwrites to its primary, which immediately and irreversibly modifies the primary's config — making it a replica and reassigning all its slots. The replica then waits for replication offsets to converge before calling TakeOverAsPrimary to claim those slots.

EnsureReplication runs on a background polling timer inside the Garnet process. When it detects a broken replication session (which can happen during failover), it acquires a ReadRole lock and issues CLUSTER REPLICATE to re-establish replication — completely unaware that a failover is in progress.

This creates two failure modes:

  • Lock contention: EnsureReplication's ReadRole read lock blocks TakeOverAsPrimary's ClusterFailover write lock (TryWriteLock fails non-blockingly, aborting the failover)
  • State corruption: EnsureReplication's CLUSTER REPLICATE re-establishes replication to the primary the replica is trying to take over from, making offset sync impossible

Race Condition Sequence

Step Time Failover Task (replica) EnsureReplication (replica) Primary
1 T+0s CLUSTER FAILOVER received → status = BEGIN_FAILOVER
2 T+0s PauseWritesAndWaitForSync → sends failstopwrites(localNodeId) to primary Receives failstopwritesTryStopWrites: gives up all slots, becomes replica of the failover node, bumps epoch, flushes config to disk
3 T+0s status = WAITING_FOR_SYNC → spinning on primaryReplicationOffset > ReplicationOffset
4 T+Ns Detects broken replication session (IsReplicating = false)
5 T+Ns (still spinning on offset sync) Checks: IsReplica? ✅ RemoteNodeId == primaryId? ✅ IsReplicating? ❌ → all conditions met
6 T+Ns Calls PreventRoleChange()BeginRecovery(ReadRole) → acquires read lock on recoverLock
7 T+Ns Spawns Task.Run → issues CLUSTER REPLICATE back to primary → re-establishes replication session
8 T+Ns Offset sync never converges (active replication makes offsets a moving target) OR failover times out Holds read lock across entire replication sync
9 T+5m PauseWritesAndWaitForSync returns false (timeout) → BeginAsyncReplicaFailover returns false
10 TakeOverAsPrimary never called → replica never claims slots
11 finally: status = NO_FAILOVER, lastFailoverStatus = FAILOVER_ABORTED

Resulting Incoherent Topology

Node Role Slots Epoch
Primary master (or confused) none — gave them up at step 2 153
Replica replica of Primary has slots in config (as belonging to Primary's worker entry) 153

No node can serve reads or writes for the affected slots. The primary's config is persisted to disk, so this state survives restarts.

Downstream Impact

Without recovery, this incoherent topology can cause:

  • Slot unavailability: No primary node owns the affected slots — all reads/writes to keys in those slots fail

Changes

1. Suppress EnsureReplication during active failover (ReplicationManager.cs)

Added an early-exit guard in EnsureReplication that checks failoverManager.lastFailoverStatus. If the failover is in any active state (BEGIN_FAILOVER, ISSUING_PAUSE_WRITES, WAITING_FOR_SYNC, FAILOVER_IN_PROGRESS, TAKING_OVER_AS_PRIMARY), EnsureReplication returns immediately without acquiring any locks. Terminal states (FAILOVER_COMPLETED, FAILOVER_ABORTED) are intentionally excluded so EnsureReplication resumes normal operation after failover completes.

2. Reset primary on failover failure (ReplicaFailoverSession.cs)

Moved the primary reset logic (failstopwrites([])) from the TakeOverAsPrimary failure path into the finally block of BeginAsyncReplicaFailover. This ensures the primary is reset for all failure paths after failstopwrites was sent — including timeout in PauseWritesAndWaitForSync, exceptions, and TakeOverAsPrimary failure. Previously, the reset was only attempted when TakeOverAsPrimary explicitly returned false, which was unreachable when the failover failed during the sync wait phase.

3. Exception injection for testing (ExceptionInjectionType.cs, ReplicaFailoverSession.cs)

Added Failover_Fail_TakeOverAsPrimary exception injection point (DEBUG only) to simulate TakeOverAsPrimary failure after PauseWritesAndWaitForSync succeeds, enabling deterministic testing of the recovery path.

Tests

  • ClusterFailoverResetsPrimaryOnTakeOverFailure (new, DEBUG): Verifies that when TakeOverAsPrimary fails after failstopwrites was sent, the primary is reset back to its original state — owns all 16384 slots, has master role, and data remains accessible.
  • ClusterFailoverSucceedsDuringEnsureReplication (new): Verifies that a CLUSTER FAILOVER completes successfully when EnsureReplication is actively polling (1-second interval).
  • ClusterEnsureReplicationWorksAfterFailover (new): Verifies that EnsureReplication resumes normal operation after failover completes — the guard only suppresses during active failover states.
  • ClusterFailoverDuringRecovery (existing, unmodified): Uses TAKEOVER which skips PauseWritesAndWaitForSync, so the new reset path doesn't trigger. Unaffected by these changes.

Copilot AI review requested due to automatic review settings April 6, 2026 23:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses cluster failover correctness by preventing replication auto-resync logic from interfering with in-flight CLUSTER FAILOVER, and by adding a recovery path to reset the original primary if a DEFAULT failover aborts after failstopwrites has already modified the primary’s persisted config.

Changes:

  • Suppress ReplicationManager.EnsureReplication auto-resync while a failover is in an active (non-terminal) state.
  • Ensure the primary is reset on failover failure paths after failstopwrites was confirmed, and add a DEBUG-only exception injection point to test this.
  • Add cluster tests covering the EnsureReplication/failover interaction and the primary-reset recovery behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

File Description
libs/cluster/Server/Replication/ReplicationManager.cs Adds a failover-state guard to avoid EnsureReplication resync/locks during active failover.
libs/cluster/Server/Failover/ReplicaFailoverSession.cs Adds DEBUG exception injection, tracks failover success, and attempts to reset primary in finally on failed failover.
libs/common/ExceptionInjectionType.cs Adds a new DEBUG-only exception injection enum value for failover takeover failure simulation.
test/Garnet.test.cluster/ClusterNegativeTests.cs Adds tests validating failover success during EnsureReplication polling and recovery/reset behavior on takeover failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Kevin Bowersox and others added 2 commits April 6, 2026 17:03
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@vazois vazois merged commit 27e700b into microsoft:dev Apr 7, 2026
16 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants