Fix AOF persistence and WATCH for collection-emptying RMW ops [main]#1678
Fix AOF persistence and WATCH for collection-emptying RMW ops [main]#1678
Conversation
Bug 1: Collection-emptying RMW operations (LPOP, ZREM, HDEL, SREM that empty a collection) were not logged to AOF. InPlaceUpdaterWorker and PostCopyUpdater returned false before setting NeedAofLog when HasRemoveKey was true, so PostRMWOperation never wrote the AOF entry. Data reappeared after restart. Bug 2: InPlaceUpdaterWorker's HasRemoveKey path did not call IncrementVersion on the watch version map, so WATCH/MULTI/EXEC transactions did not detect the key modification and incorrectly committed. Bug 3: AofProcessor.RecoverReplay disposed respServerSession in its finally block, but in multi-DB recovery it ran once per database, causing double-dispose and NullReferenceException in GarnetLatencyMetricsSession.Return(). Moved dispose to AofProcessor.Dispose() which runs exactly once. Fixes #1675 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes AOF persistence and WATCH version tracking for collection-emptying RMW operations, and corrects disposal behavior during multi-DB AOF recovery to avoid double-dispose failures.
Changes:
- Ensure collection-emptying RMW ops set
NeedAofLogso AOF records are written and replay correctly. - Increment WATCH version on key-removal paths so
WATCH/MULTI/EXECproperly aborts when keys are emptied/deleted. - Move replay-session disposal into
AofProcessor.Dispose()to prevent double-dispose during multi-DB recovery; add regression tests.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test/TransactionTests.cs | Adds WATCH regression test for list deletion via LPOP. |
| test/Garnet.test/RespAofTests.cs | Adds AOF recovery tests for deletion/emptying behavior across list, zset, hash, set. |
| test/Garnet.test/MultiDatabaseTests.cs | Adds multi-DB AOF recovery regression test for emptying object collections. |
| libs/server/Storage/Functions/ObjectStore/RMWMethods.cs | Sets AOF logging and WATCH version increment on key-removal/emptying paths. |
| libs/server/AOF/AofProcessor.cs | Adjusts disposal to avoid double-dispose in multi-DB recovery. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // EXEC should fail because the watched key was modified by LPOP | ||
| response = lightClientRequest.SendCommand("EXEC"); | ||
| expectedResponse = "*-1"; |
There was a problem hiding this comment.
The test only asserts the prefix *-1, which can mask protocol/format issues (e.g., missing CRLF or accidental longer matches). Prefer asserting the full RESP response for a null multi-bulk (*-1\\r\\n) to make the regression check stricter and unambiguous.
| expectedResponse = "*-1"; | |
| expectedResponse = "*-1\r\n"; |
| server.Store.CommitAOF(true); | ||
| server.Dispose(false); | ||
| server = TestUtils.CreateGarnetServer(TestUtils.MethodTestDir, tryRecover: true, enableAOF: true); | ||
| server.Start(); |
There was a problem hiding this comment.
The same AOF commit + server restart + recover sequence is duplicated across multiple new tests in this file. Consider extracting a small helper (e.g., RestartServerWithAofRecovery()) to reduce repetition and keep future changes to restart/recovery logic consistent.
| if (!rmwInfo.RecordInfo.Modified) | ||
| functionsState.watchVersionMap.IncrementVersion(rmwInfo.KeyHash); |
There was a problem hiding this comment.
The WATCH version increment is currently conditional on !rmwInfo.RecordInfo.Modified. If this removal/emptying path can be reached with RecordInfo.Modified == true, WATCH may still fail to observe the key deletion (since the mutation is a modification regardless). If the intent is to avoid double-increment, consider a mechanism that guarantees exactly-once increment per operation (or add an in-code comment explaining why the conditional is safe here).
| if (!rmwInfo.RecordInfo.Modified) | |
| functionsState.watchVersionMap.IncrementVersion(rmwInfo.KeyHash); | |
| functionsState.watchVersionMap.IncrementVersion(rmwInfo.KeyHash); |
| activeVectorManager?.WaitForVectorOperationsToComplete(); | ||
| activeVectorManager?.ShutdownReplayTasks(); | ||
|
|
||
| var databaseSessionsSnapshot = respServerSession.GetDatabaseSessionsSnapshot(); | ||
| foreach (var dbSession in databaseSessionsSnapshot) | ||
| { | ||
| dbSession.StorageSession.basicContext.Session?.Dispose(); | ||
| dbSession.StorageSession.objectStoreBasicContext.Session?.Dispose(); | ||
| } | ||
| aofReplayCoordinator.Dispose(); | ||
| respServerSession.Dispose(); |
There was a problem hiding this comment.
Since this PR is addressing double-dispose issues, it would be safer if AofProcessor.Dispose() were explicitly idempotent (e.g., guard with a disposed flag / nulling references after disposal). That helps prevent regressions if Dispose() is invoked multiple times by shutdown paths.
Bug 1: Collection-emptying RMW operations (LPOP, ZREM, HDEL, SREM that empty a collection) were not logged to AOF. InPlaceUpdaterWorker and PostCopyUpdater returned false before setting NeedAofLog when HasRemoveKey was true, so PostRMWOperation never wrote the AOF entry. Data reappeared after restart.
Bug 2: InPlaceUpdaterWorker's HasRemoveKey path did not call IncrementVersion on the watch version map, so WATCH/MULTI/EXEC transactions did not detect the key modification and incorrectly committed.
Bug 3: AofProcessor.RecoverReplay disposed respServerSession in its finally block, but in multi-DB recovery it ran once per database, causing double-dispose and NullReferenceException in GarnetLatencyMetricsSession.Return(). Moved dispose to AofProcessor.Dispose() which runs exactly once.
Fixes #1675