Conversation
Warning errors indicate data consistency issues that won't resolve on retry, so attempting to retry them is futile and causes unnecessary delays. This change detects warning errors early and aborts immediately instead of retrying.
|
@meiji163 Here's one last (hopefully!) follow-up on the changes to |
|
btw, CI passes on my fork: ggilder#6 |
There was a problem hiding this comment.
Pull request overview
This PR fixes an abort/retry interaction in gh-ost where enabling --panic-on-warnings could lead to long retry loops (including exponential backoff) instead of failing fast once a warning/abort occurs.
Changes:
- Add abort/context-cancellation checks in migrator retry loops.
- Skip retries for “warnings detected …” errors by immediately triggering the panic/abort path.
- Add unit + integration tests to cover cancellation and warning retry-skipping behavior, and update a localtest expected failure string.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
localtests/panic-on-warnings-update-pk-with-duplicate-on-new-unique-index/expect_failure |
Updates expected error output to match the new fast-fail warning behavior. |
go/logic/migrator.go |
Adds context-cancellation checks in retry loops and short-circuits retries for warning errors. |
go/logic/migrator_test.go |
Adds unit tests for retry cancellation and warning handling, plus an end-to-end integration test for the high-retry/exponential-backoff scenario. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RetrySleepFn(1 * time.Second) | ||
| } | ||
| // Check for abort/context cancellation before each retry | ||
| if abortErr := this.checkAbort(); abortErr != nil { | ||
| return abortErr |
There was a problem hiding this comment.
checkAbort() is called only after the backoff sleep. If the migration context has already been cancelled/aborted, this loop will still sleep the full interval before returning, which can significantly delay shutdown (especially in exponential backoff scenarios). Consider checking checkAbort() before sleeping, and/or making the sleep itself context-aware (e.g., select on ctx.Done() vs a timer) so cancellation interrupts the wait.
| if strings.Contains(err.Error(), "warnings detected") { | ||
| if len(notFatalHint) == 0 { | ||
| _ = base.SendWithContext(this.migrationContext.GetContext(), this.migrationContext.PanicAbort, err) | ||
| } | ||
| return err |
There was a problem hiding this comment.
The unrecoverable-warning detection relies on strings.Contains(err.Error(), "warnings detected"), which is brittle (string changes/wrapping can break it and unrelated errors could match). A more robust approach would be to return a dedicated error type or sentinel from the warning-checking code (e.g., in Applier.executeBatchWithWarningChecking) and detect it here via errors.Is/As (or a helper like isWarningsDetected(err)).
| RetrySleepFn(time.Duration(interval) * time.Second) | ||
| } | ||
| // Check for abort/context cancellation before each retry | ||
| if abortErr := this.checkAbort(); abortErr != nil { | ||
| return abortErr |
There was a problem hiding this comment.
Same as retryOperation(): the abort check happens after the backoff sleep, so a cancelled context can still incur a long wait (up to ExponentialBackoffMaxInterval) before returning. Consider checking checkAbort() before sleeping and/or using a context-aware wait to allow immediate cancellation.
| if strings.Contains(err.Error(), "warnings detected") { | ||
| if len(notFatalHint) == 0 { | ||
| _ = base.SendWithContext(this.migrationContext.GetContext(), this.migrationContext.PanicAbort, err) | ||
| } | ||
| return err |
There was a problem hiding this comment.
Same brittle warning detection as the non-exponential retry: matching on the error string is fragile. Prefer a typed/sentinel warning error (created where the warning is detected) and check via errors.Is/As (or a shared helper) to keep this behavior stable across message changes.
|
Nice catch, thanks! |
Related issue: #1635
Description
When
--panic-on-warningsis enabled and a SQL warning is detected during migration (e.g., duplicate key error during binlog event application), gh-ost enters a retry loop instead of cleanly aborting. This causes migrations to hang (potentially for a long time depending on exponential backoff settings) rather than failing fast with a clear error message.Root Cause
The issue involves two separate problems in the retry logic:
Missing context cancellation checks: The retry loops in
retryOperation()andretryOperationWithExponentialBackoff()did not check for context cancellation at the start of each iteration. When abort was triggered (cancelling the context), retry loops would continue because the abort signal wasn't detected.Unnecessary retries for unrecoverable errors: Warning errors from duplicate key violations represent data consistency issues that won't resolve on retry. The underlying data hasn't changed, so retrying the same operation 60 times (default
MaxRetries) just wastes time before triggering abort. With high retry counts (especially with exponential backoff), this could delay abort by minutes.These retries can never succeed after an abort, because the migration goroutines will already have aborted.
Solution
This PR implements two complementary fixes:
When a warning is detected, the error is sent to
PanicAbortimmediately without retrying. This triggers a fast failure path since the underlying data issue won't be resolved by retrying the same operation.Testing
Added unit tests covering:
TestMigratorRetryAbortsOnContextCancellation- VerifiesretryOperation()detects context cancellationTestMigratorRetryWithExponentialBackoffAbortsOnContextCancellation- Verifies exponential backoff version detects cancellationTestMigratorRetrySkipsRetriesForWarnings- Verifies warnings skip retries inretryOperation()TestMigratorRetryWithExponentialBackoffSkipsRetriesForWarnings- Verifies warnings skip retries in exponential backoff versionTestPanicOnWarningsDuplicateDuringCutoverWithHighRetries- Full end-to-end test with high retry count (30) and exponential backoffAll existing retry tests continue to pass, confirming normal retry behavior is unchanged.
script/cibuildreturns with no formatting errors, build errors or unit test errors.