Skip to content

Fix abort/retry interaction#1655

Merged
meiji163 merged 4 commits intogithub:masterfrom
ggilder:fix-abort-retry-interaction
Apr 2, 2026
Merged

Fix abort/retry interaction#1655
meiji163 merged 4 commits intogithub:masterfrom
ggilder:fix-abort-retry-interaction

Conversation

@ggilder
Copy link
Copy Markdown
Contributor

@ggilder ggilder commented Apr 2, 2026

Related issue: #1635

Description

When --panic-on-warnings is enabled and a SQL warning is detected during migration (e.g., duplicate key error during binlog event application), gh-ost enters a retry loop instead of cleanly aborting. This causes migrations to hang (potentially for a long time depending on exponential backoff settings) rather than failing fast with a clear error message.

Root Cause

The issue involves two separate problems in the retry logic:

  1. Missing context cancellation checks: The retry loops in retryOperation() and retryOperationWithExponentialBackoff() did not check for context cancellation at the start of each iteration. When abort was triggered (cancelling the context), retry loops would continue because the abort signal wasn't detected.

  2. Unnecessary retries for unrecoverable errors: Warning errors from duplicate key violations represent data consistency issues that won't resolve on retry. The underlying data hasn't changed, so retrying the same operation 60 times (default MaxRetries) just wastes time before triggering abort. With high retry counts (especially with exponential backoff), this could delay abort by minutes.

These retries can never succeed after an abort, because the migration goroutines will already have aborted.

Solution

This PR implements two complementary fixes:

  1. Add context cancellation checks to retry loops
  2. Skip retries for warning errors

When a warning is detected, the error is sent to PanicAbort immediately without retrying. This triggers a fast failure path since the underlying data issue won't be resolved by retrying the same operation.

Testing

Added unit tests covering:

  • TestMigratorRetryAbortsOnContextCancellation - Verifies retryOperation() detects context cancellation
  • TestMigratorRetryWithExponentialBackoffAbortsOnContextCancellation - Verifies exponential backoff version detects cancellation
  • TestMigratorRetrySkipsRetriesForWarnings - Verifies warnings skip retries in retryOperation()
  • TestMigratorRetryWithExponentialBackoffSkipsRetriesForWarnings - Verifies warnings skip retries in exponential backoff version
  • Integration test TestPanicOnWarningsDuplicateDuringCutoverWithHighRetries - Full end-to-end test with high retry count (30) and exponential backoff

All existing retry tests continue to pass, confirming normal retry behavior is unchanged.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

ggilder added 4 commits April 2, 2026 10:26
Warning errors indicate data consistency issues that won't resolve on retry, so
attempting to retry them is futile and causes unnecessary delays. This change
detects warning errors early and aborts immediately instead of retrying.
Copilot AI review requested due to automatic review settings April 2, 2026 18:57
@ggilder
Copy link
Copy Markdown
Contributor Author

ggilder commented Apr 2, 2026

@meiji163 Here's one last (hopefully!) follow-up on the changes to panic-on-warnings — discovered a problematic interaction with retries. Let me know if you have any questions/concerns, thanks!

@ggilder
Copy link
Copy Markdown
Contributor Author

ggilder commented Apr 2, 2026

btw, CI passes on my fork: ggilder#6

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an abort/retry interaction in gh-ost where enabling --panic-on-warnings could lead to long retry loops (including exponential backoff) instead of failing fast once a warning/abort occurs.

Changes:

  • Add abort/context-cancellation checks in migrator retry loops.
  • Skip retries for “warnings detected …” errors by immediately triggering the panic/abort path.
  • Add unit + integration tests to cover cancellation and warning retry-skipping behavior, and update a localtest expected failure string.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
localtests/panic-on-warnings-update-pk-with-duplicate-on-new-unique-index/expect_failure Updates expected error output to match the new fast-fail warning behavior.
go/logic/migrator.go Adds context-cancellation checks in retry loops and short-circuits retries for warning errors.
go/logic/migrator_test.go Adds unit tests for retry cancellation and warning handling, plus an end-to-end integration test for the high-retry/exponential-backoff scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 161 to +165
RetrySleepFn(1 * time.Second)
}
// Check for abort/context cancellation before each retry
if abortErr := this.checkAbort(); abortErr != nil {
return abortErr
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAbort() is called only after the backoff sleep. If the migration context has already been cancelled/aborted, this loop will still sleep the full interval before returning, which can significantly delay shutdown (especially in exponential backoff scenarios). Consider checking checkAbort() before sleeping, and/or making the sleep itself context-aware (e.g., select on ctx.Done() vs a timer) so cancellation interrupts the wait.

Copilot uses AI. Check for mistakes.
Comment on lines +172 to +176
if strings.Contains(err.Error(), "warnings detected") {
if len(notFatalHint) == 0 {
_ = base.SendWithContext(this.migrationContext.GetContext(), this.migrationContext.PanicAbort, err)
}
return err
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unrecoverable-warning detection relies on strings.Contains(err.Error(), "warnings detected"), which is brittle (string changes/wrapping can break it and unrelated errors could match). A more robust approach would be to return a dedicated error type or sentinel from the warning-checking code (e.g., in Applier.executeBatchWithWarningChecking) and detect it here via errors.Is/As (or a helper like isWarningsDetected(err)).

Copilot uses AI. Check for mistakes.
Comment on lines 202 to +206
RetrySleepFn(time.Duration(interval) * time.Second)
}
// Check for abort/context cancellation before each retry
if abortErr := this.checkAbort(); abortErr != nil {
return abortErr
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as retryOperation(): the abort check happens after the backoff sleep, so a cancelled context can still incur a long wait (up to ExponentialBackoffMaxInterval) before returning. Consider checking checkAbort() before sleeping and/or using a context-aware wait to allow immediate cancellation.

Copilot uses AI. Check for mistakes.
Comment on lines +213 to +217
if strings.Contains(err.Error(), "warnings detected") {
if len(notFatalHint) == 0 {
_ = base.SendWithContext(this.migrationContext.GetContext(), this.migrationContext.PanicAbort, err)
}
return err
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same brittle warning detection as the non-exponential retry: matching on the error string is fragile. Prefer a typed/sentinel warning error (created where the warning is detected) and check via errors.Is/As (or a shared helper) to keep this behavior stable across message changes.

Copilot uses AI. Check for mistakes.
@meiji163 meiji163 merged commit 8bc63f0 into github:master Apr 2, 2026
12 of 13 checks passed
@meiji163
Copy link
Copy Markdown
Contributor

meiji163 commented Apr 2, 2026

Nice catch, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants