[DEV] Improve Failover Restore / Replication Handling by kevinmichaelbowersox · Pull Request #1674 · microsoft/garnet

kevinmichaelbowersox · 2026-04-06T23:39:27Z

Prevent EnsureReplication from sabotaging in-flight failovers and recover from partial failover failures

Summary

This PR fixes two related bugs in the cluster failover path:

EnsureReplication races with CLUSTER FAILOVER: The ReplicationManager.EnsureReplication background poller can issue CLUSTER REPLICATE during an in-flight failover, sabotaging the promotion and leaving the cluster in an incoherent state.
No recovery when failover fails after failstopwrites: If TakeOverAsPrimary fails after PauseWritesAndWaitForSync has already sent failstopwrites to the primary, the primary is left in a modified state (no slots, replica role) with no node to serve those slots.

The Core Problem

During a CLUSTER FAILOVER DEFAULT, the replica sends failstopwrites to its primary, which immediately and irreversibly modifies the primary's config — making it a replica and reassigning all its slots. The replica then waits for replication offsets to converge before calling TakeOverAsPrimary to claim those slots.

EnsureReplication runs on a background polling timer inside the Garnet process. When it detects a broken replication session (which can happen during failover), it acquires a ReadRole lock and issues CLUSTER REPLICATE to re-establish replication — completely unaware that a failover is in progress.

This creates two failure modes:

Lock contention: EnsureReplication's ReadRole read lock blocks TakeOverAsPrimary's ClusterFailover write lock (TryWriteLock fails non-blockingly, aborting the failover)
State corruption: EnsureReplication's CLUSTER REPLICATE re-establishes replication to the primary the replica is trying to take over from, making offset sync impossible

Race Condition Sequence

Step	Time	Failover Task (replica)	EnsureReplication (replica)	Primary
1	T+0s	`CLUSTER FAILOVER` received → `status = BEGIN_FAILOVER`
2	T+0s	`PauseWritesAndWaitForSync` → sends `failstopwrites(localNodeId)` to primary		Receives `failstopwrites` → `TryStopWrites`: gives up all slots, becomes replica of the failover node, bumps epoch, flushes config to disk
3	T+0s	`status = WAITING_FOR_SYNC` → spinning on `primaryReplicationOffset > ReplicationOffset`
4	T+Ns		Detects broken replication session (`IsReplicating = false`)
5	T+Ns	(still spinning on offset sync)	Checks: `IsReplica? ✅` `RemoteNodeId == primaryId? ✅` `IsReplicating? ❌` → all conditions met
6	T+Ns		Calls `PreventRoleChange()` → `BeginRecovery(ReadRole)` → acquires read lock on `recoverLock`
7	T+Ns		Spawns `Task.Run` → issues `CLUSTER REPLICATE` back to primary → re-establishes replication session
8	T+Ns	Offset sync never converges (active replication makes offsets a moving target) OR failover times out	Holds read lock across entire replication sync
9	T+5m	`PauseWritesAndWaitForSync` returns `false` (timeout) → `BeginAsyncReplicaFailover` returns `false`
10	—	`TakeOverAsPrimary` never called → replica never claims slots
11	—	`finally`: `status = NO_FAILOVER`, `lastFailoverStatus = FAILOVER_ABORTED`

Resulting Incoherent Topology

Node	Role	Slots	Epoch
Primary	master (or confused)	none — gave them up at step 2	153
Replica	replica of Primary	has slots in config (as belonging to Primary's worker entry)	153

No node can serve reads or writes for the affected slots. The primary's config is persisted to disk, so this state survives restarts.

Downstream Impact

Without recovery, this incoherent topology can cause:

Slot unavailability: No primary node owns the affected slots — all reads/writes to keys in those slots fail

Changes

1. Suppress `EnsureReplication` during active failover (`ReplicationManager.cs`)

Added an early-exit guard in EnsureReplication that checks failoverManager.lastFailoverStatus. If the failover is in any active state (BEGIN_FAILOVER, ISSUING_PAUSE_WRITES, WAITING_FOR_SYNC, FAILOVER_IN_PROGRESS, TAKING_OVER_AS_PRIMARY), EnsureReplication returns immediately without acquiring any locks. Terminal states (FAILOVER_COMPLETED, FAILOVER_ABORTED) are intentionally excluded so EnsureReplication resumes normal operation after failover completes.

2. Reset primary on failover failure (`ReplicaFailoverSession.cs`)

Moved the primary reset logic (failstopwrites([])) from the TakeOverAsPrimary failure path into the finally block of BeginAsyncReplicaFailover. This ensures the primary is reset for all failure paths after failstopwrites was sent — including timeout in PauseWritesAndWaitForSync, exceptions, and TakeOverAsPrimary failure. Previously, the reset was only attempted when TakeOverAsPrimary explicitly returned false, which was unreachable when the failover failed during the sync wait phase.

3. Exception injection for testing (`ExceptionInjectionType.cs`, `ReplicaFailoverSession.cs`)

Added Failover_Fail_TakeOverAsPrimary exception injection point (DEBUG only) to simulate TakeOverAsPrimary failure after PauseWritesAndWaitForSync succeeds, enabling deterministic testing of the recovery path.

Tests

ClusterFailoverResetsPrimaryOnTakeOverFailure (new, DEBUG): Verifies that when TakeOverAsPrimary fails after failstopwrites was sent, the primary is reset back to its original state — owns all 16384 slots, has master role, and data remains accessible.
ClusterFailoverSucceedsDuringEnsureReplication (new): Verifies that a CLUSTER FAILOVER completes successfully when EnsureReplication is actively polling (1-second interval).
ClusterEnsureReplicationWorksAfterFailover (new): Verifies that EnsureReplication resumes normal operation after failover completes — the guard only suppresses during active failover states.
ClusterFailoverDuringRecovery (existing, unmodified): Uses TAKEOVER which skips PauseWritesAndWaitForSync, so the new reset path doesn't trigger. Unaffected by these changes.

Copilot

Pull request overview

This PR addresses cluster failover correctness by preventing replication auto-resync logic from interfering with in-flight CLUSTER FAILOVER, and by adding a recovery path to reset the original primary if a DEFAULT failover aborts after failstopwrites has already modified the primary’s persisted config.

Changes:

Suppress ReplicationManager.EnsureReplication auto-resync while a failover is in an active (non-terminal) state.
Ensure the primary is reset on failover failure paths after failstopwrites was confirmed, and add a DEBUG-only exception injection point to test this.
Add cluster tests covering the EnsureReplication/failover interaction and the primary-reset recovery behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

File	Description
`libs/cluster/Server/Replication/ReplicationManager.cs`	Adds a failover-state guard to avoid EnsureReplication resync/locks during active failover.
`libs/cluster/Server/Failover/ReplicaFailoverSession.cs`	Adds DEBUG exception injection, tracks failover success, and attempts to reset primary in `finally` on failed failover.
`libs/common/ExceptionInjectionType.cs`	Adds a new DEBUG-only exception injection enum value for failover takeover failure simulation.
`test/Garnet.test.cluster/ClusterNegativeTests.cs`	Adds tests validating failover success during EnsureReplication polling and recovery/reset behavior on takeover failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

libs/cluster/Server/Failover/ReplicaFailoverSession.cs

libs/cluster/Server/Replication/ReplicationManager.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

libs/cluster/Server/Failover/ReplicaFailoverSession.cs

…-dev

…ttps://github.com/kevinmichaelbowersox/garnet into users/kbowersox/prevent-replicate-on-failover-dev

…-dev

[DEV] Improve Failover Restore / Replication Handling

2ffd538

Copilot AI review requested due to automatic review settings April 6, 2026 23:39

Copilot started reviewing on behalf of kevinmichaelbowersox April 6, 2026 23:40 View session

Switch to volatile

b6cc9ba

Copilot AI reviewed Apr 6, 2026

View reviewed changes

libs/cluster/Server/Failover/ReplicaFailoverSession.cs Show resolved Hide resolved

libs/cluster/Server/Failover/ReplicaFailoverSession.cs Outdated Show resolved Hide resolved

libs/cluster/Server/Replication/ReplicationManager.cs Show resolved Hide resolved

Kevin Bowersox and others added 2 commits April 6, 2026 17:03

Fix Build

0c84b72

Apply suggestions from code review

b464545

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

vazois reviewed Apr 7, 2026

View reviewed changes

libs/cluster/Server/Failover/ReplicaFailoverSession.cs Show resolved Hide resolved

vazois approved these changes Apr 7, 2026

View reviewed changes

vazois and others added 6 commits April 6, 2026 20:22

Merge branch 'dev' into users/kbowersox/prevent-replicate-on-failover…

36f1e4d

…-dev

Remove Debug wrapping on TriggerException

4632463

Merge branch 'users/kbowersox/prevent-replicate-on-failover-dev' of h…

a2e88db

…ttps://github.com/kevinmichaelbowersox/garnet into users/kbowersox/prevent-replicate-on-failover-dev

Merge branch 'dev' into users/kbowersox/prevent-replicate-on-failover…

e4aac77

…-dev

Fix malformed xml

8c53da1

Merge

49b6c29

vazois merged commit 27e700b into microsoft:dev Apr 7, 2026
16 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DEV] Improve Failover Restore / Replication Handling#1674

[DEV] Improve Failover Restore / Replication Handling#1674
vazois merged 10 commits intomicrosoft:devfrom
kevinmichaelbowersox:users/kbowersox/prevent-replicate-on-failover-dev

kevinmichaelbowersox commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kevinmichaelbowersox commented Apr 6, 2026

Prevent EnsureReplication from sabotaging in-flight failovers and recover from partial failover failures

Summary

The Core Problem

Race Condition Sequence

Resulting Incoherent Topology

Downstream Impact

Changes

1. Suppress EnsureReplication during active failover (ReplicationManager.cs)

2. Reset primary on failover failure (ReplicaFailoverSession.cs)

3. Exception injection for testing (ExceptionInjectionType.cs, ReplicaFailoverSession.cs)

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Suppress `EnsureReplication` during active failover (`ReplicationManager.cs`)

2. Reset primary on failover failure (`ReplicaFailoverSession.cs`)

3. Exception injection for testing (`ExceptionInjectionType.cs`, `ReplicaFailoverSession.cs`)