Skip to content

Fix AOF persistence and WATCH for collection-emptying RMW ops [main]#1678

Open
badrishc wants to merge 1 commit intomainfrom
badrishc/fix-aof-object-mutation-persistence-main
Open

Fix AOF persistence and WATCH for collection-emptying RMW ops [main]#1678
badrishc wants to merge 1 commit intomainfrom
badrishc/fix-aof-object-mutation-persistence-main

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

@badrishc badrishc commented Apr 7, 2026

Bug 1: Collection-emptying RMW operations (LPOP, ZREM, HDEL, SREM that empty a collection) were not logged to AOF. InPlaceUpdaterWorker and PostCopyUpdater returned false before setting NeedAofLog when HasRemoveKey was true, so PostRMWOperation never wrote the AOF entry. Data reappeared after restart.

Bug 2: InPlaceUpdaterWorker's HasRemoveKey path did not call IncrementVersion on the watch version map, so WATCH/MULTI/EXEC transactions did not detect the key modification and incorrectly committed.

Bug 3: AofProcessor.RecoverReplay disposed respServerSession in its finally block, but in multi-DB recovery it ran once per database, causing double-dispose and NullReferenceException in GarnetLatencyMetricsSession.Return(). Moved dispose to AofProcessor.Dispose() which runs exactly once.

Fixes #1675

Bug 1: Collection-emptying RMW operations (LPOP, ZREM, HDEL, SREM that
empty a collection) were not logged to AOF. InPlaceUpdaterWorker and
PostCopyUpdater returned false before setting NeedAofLog when
HasRemoveKey was true, so PostRMWOperation never wrote the AOF entry.
Data reappeared after restart.

Bug 2: InPlaceUpdaterWorker's HasRemoveKey path did not call
IncrementVersion on the watch version map, so WATCH/MULTI/EXEC
transactions did not detect the key modification and incorrectly
committed.

Bug 3: AofProcessor.RecoverReplay disposed respServerSession in its
finally block, but in multi-DB recovery it ran once per database,
causing double-dispose and NullReferenceException in
GarnetLatencyMetricsSession.Return(). Moved dispose to
AofProcessor.Dispose() which runs exactly once.

Fixes #1675

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 7, 2026 19:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes AOF persistence and WATCH version tracking for collection-emptying RMW operations, and corrects disposal behavior during multi-DB AOF recovery to avoid double-dispose failures.

Changes:

  • Ensure collection-emptying RMW ops set NeedAofLog so AOF records are written and replay correctly.
  • Increment WATCH version on key-removal paths so WATCH/MULTI/EXEC properly aborts when keys are emptied/deleted.
  • Move replay-session disposal into AofProcessor.Dispose() to prevent double-dispose during multi-DB recovery; add regression tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/Garnet.test/TransactionTests.cs Adds WATCH regression test for list deletion via LPOP.
test/Garnet.test/RespAofTests.cs Adds AOF recovery tests for deletion/emptying behavior across list, zset, hash, set.
test/Garnet.test/MultiDatabaseTests.cs Adds multi-DB AOF recovery regression test for emptying object collections.
libs/server/Storage/Functions/ObjectStore/RMWMethods.cs Sets AOF logging and WATCH version increment on key-removal/emptying paths.
libs/server/AOF/AofProcessor.cs Adjusts disposal to avoid double-dispose in multi-DB recovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


// EXEC should fail because the watched key was modified by LPOP
response = lightClientRequest.SendCommand("EXEC");
expectedResponse = "*-1";
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test only asserts the prefix *-1, which can mask protocol/format issues (e.g., missing CRLF or accidental longer matches). Prefer asserting the full RESP response for a null multi-bulk (*-1\\r\\n) to make the regression check stricter and unambiguous.

Suggested change
expectedResponse = "*-1";
expectedResponse = "*-1\r\n";

Copilot uses AI. Check for mistakes.
Comment on lines +810 to +813
server.Store.CommitAOF(true);
server.Dispose(false);
server = TestUtils.CreateGarnetServer(TestUtils.MethodTestDir, tryRecover: true, enableAOF: true);
server.Start();
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same AOF commit + server restart + recover sequence is duplicated across multiple new tests in this file. Consider extracting a small helper (e.g., RestartServerWithAofRecovery()) to reduce repetition and keep future changes to restart/recovery logic consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +149
if (!rmwInfo.RecordInfo.Modified)
functionsState.watchVersionMap.IncrementVersion(rmwInfo.KeyHash);
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WATCH version increment is currently conditional on !rmwInfo.RecordInfo.Modified. If this removal/emptying path can be reached with RecordInfo.Modified == true, WATCH may still fail to observe the key deletion (since the mutation is a modification regardless). If the intent is to avoid double-increment, consider a mechanism that guarantees exactly-once increment per operation (or add an in-code comment explaining why the conditional is safe here).

Suggested change
if (!rmwInfo.RecordInfo.Modified)
functionsState.watchVersionMap.IncrementVersion(rmwInfo.KeyHash);
functionsState.watchVersionMap.IncrementVersion(rmwInfo.KeyHash);

Copilot uses AI. Check for mistakes.
Comment on lines 88 to +92
activeVectorManager?.WaitForVectorOperationsToComplete();
activeVectorManager?.ShutdownReplayTasks();

var databaseSessionsSnapshot = respServerSession.GetDatabaseSessionsSnapshot();
foreach (var dbSession in databaseSessionsSnapshot)
{
dbSession.StorageSession.basicContext.Session?.Dispose();
dbSession.StorageSession.objectStoreBasicContext.Session?.Dispose();
}
aofReplayCoordinator.Dispose();
respServerSession.Dispose();
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is addressing double-dispose issues, it would be safer if AofProcessor.Dispose() were explicitly idempotent (e.g., guard with a disposed flag / nulling references after disposal). That helps prevent regressions if Dispose() is invoked multiple times by shutdown paths.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AOF inconsistency: DELETE persists correctly but LPOP/ZREM/HDEL are not durable (data reappears after restart)

2 participants