Fix abort/retry interaction by ggilder · Pull Request #1655 · github/gh-ost

ggilder · 2026-04-02T18:57:39Z

Related issue: #1635

Description

When --panic-on-warnings is enabled and a SQL warning is detected during migration (e.g., duplicate key error during binlog event application), gh-ost enters a retry loop instead of cleanly aborting. This causes migrations to hang (potentially for a long time depending on exponential backoff settings) rather than failing fast with a clear error message.

Root Cause

The issue involves two separate problems in the retry logic:

Missing context cancellation checks: The retry loops in retryOperation() and retryOperationWithExponentialBackoff() did not check for context cancellation at the start of each iteration. When abort was triggered (cancelling the context), retry loops would continue because the abort signal wasn't detected.
Unnecessary retries for unrecoverable errors: Warning errors from duplicate key violations represent data consistency issues that won't resolve on retry. The underlying data hasn't changed, so retrying the same operation 60 times (default MaxRetries) just wastes time before triggering abort. With high retry counts (especially with exponential backoff), this could delay abort by minutes.

These retries can never succeed after an abort, because the migration goroutines will already have aborted.

Solution

This PR implements two complementary fixes:

Add context cancellation checks to retry loops
Skip retries for warning errors

When a warning is detected, the error is sent to PanicAbort immediately without retrying. This triggers a fast failure path since the underlying data issue won't be resolved by retrying the same operation.

Testing

Added unit tests covering:

TestMigratorRetryAbortsOnContextCancellation - Verifies retryOperation() detects context cancellation
TestMigratorRetryWithExponentialBackoffAbortsOnContextCancellation - Verifies exponential backoff version detects cancellation
TestMigratorRetrySkipsRetriesForWarnings - Verifies warnings skip retries in retryOperation()
TestMigratorRetryWithExponentialBackoffSkipsRetriesForWarnings - Verifies warnings skip retries in exponential backoff version
Integration test TestPanicOnWarningsDuplicateDuringCutoverWithHighRetries - Full end-to-end test with high retry count (30) and exponential backoff

All existing retry tests continue to pass, confirming normal retry behavior is unchanged.

In case this PR introduced Go code changes:

contributed code is using same conventions as original code
script/cibuild returns with no formatting errors, build errors or unit test errors.

Warning errors indicate data consistency issues that won't resolve on retry, so attempting to retry them is futile and causes unnecessary delays. This change detects warning errors early and aborts immediately instead of retrying.

ggilder · 2026-04-02T18:59:31Z

@meiji163 Here's one last (hopefully!) follow-up on the changes to panic-on-warnings — discovered a problematic interaction with retries. Let me know if you have any questions/concerns, thanks!

ggilder · 2026-04-02T19:01:10Z

btw, CI passes on my fork: ggilder#6

Copilot

Pull request overview

This PR fixes an abort/retry interaction in gh-ost where enabling --panic-on-warnings could lead to long retry loops (including exponential backoff) instead of failing fast once a warning/abort occurs.

Changes:

Add abort/context-cancellation checks in migrator retry loops.
Skip retries for “warnings detected …” errors by immediately triggering the panic/abort path.
Add unit + integration tests to cover cancellation and warning retry-skipping behavior, and update a localtest expected failure string.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`localtests/panic-on-warnings-update-pk-with-duplicate-on-new-unique-index/expect_failure`	Updates expected error output to match the new fast-fail warning behavior.
`go/logic/migrator.go`	Adds context-cancellation checks in retry loops and short-circuits retries for warning errors.
`go/logic/migrator_test.go`	Adds unit tests for retry cancellation and warning handling, plus an end-to-end integration test for the high-retry/exponential-backoff scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T19:04:29Z

go/logic/migrator.go

 			RetrySleepFn(1 * time.Second)
 		}
+		// Check for abort/context cancellation before each retry
+		if abortErr := this.checkAbort(); abortErr != nil {
+			return abortErr


checkAbort() is called only after the backoff sleep. If the migration context has already been cancelled/aborted, this loop will still sleep the full interval before returning, which can significantly delay shutdown (especially in exponential backoff scenarios). Consider checking checkAbort() before sleeping, and/or making the sleep itself context-aware (e.g., select on ctx.Done() vs a timer) so cancellation interrupts the wait.

Copilot · 2026-04-02T19:04:30Z

go/logic/migrator.go

+		if strings.Contains(err.Error(), "warnings detected") {
+			if len(notFatalHint) == 0 {
+				_ = base.SendWithContext(this.migrationContext.GetContext(), this.migrationContext.PanicAbort, err)
+			}
+			return err


The unrecoverable-warning detection relies on strings.Contains(err.Error(), "warnings detected"), which is brittle (string changes/wrapping can break it and unrelated errors could match). A more robust approach would be to return a dedicated error type or sentinel from the warning-checking code (e.g., in Applier.executeBatchWithWarningChecking) and detect it here via errors.Is/As (or a helper like isWarningsDetected(err)).

Copilot · 2026-04-02T19:04:30Z

go/logic/migrator.go

 			RetrySleepFn(time.Duration(interval) * time.Second)
 		}
+		// Check for abort/context cancellation before each retry
+		if abortErr := this.checkAbort(); abortErr != nil {
+			return abortErr


Same as retryOperation(): the abort check happens after the backoff sleep, so a cancelled context can still incur a long wait (up to ExponentialBackoffMaxInterval) before returning. Consider checking checkAbort() before sleeping and/or using a context-aware wait to allow immediate cancellation.

Copilot · 2026-04-02T19:04:30Z

go/logic/migrator.go

+		if strings.Contains(err.Error(), "warnings detected") {
+			if len(notFatalHint) == 0 {
+				_ = base.SendWithContext(this.migrationContext.GetContext(), this.migrationContext.PanicAbort, err)
+			}
+			return err


Same brittle warning detection as the non-exponential retry: matching on the error string is fragile. Prefer a typed/sentinel warning error (created where the warning is detected) and check via errors.Is/As (or a shared helper) to keep this behavior stable across message changes.

meiji163 · 2026-04-02T19:23:13Z

Nice catch, thanks!

ggilder added 4 commits April 2, 2026 10:26

Add failing test for retry + abort issue

e7e34e2

Fix retry after abort issue

cb9054a

Skip retries for warning errors

fe40ebd

Warning errors indicate data consistency issues that won't resolve on retry, so attempting to retry them is futile and causes unnecessary delays. This change detects warning errors early and aborts immediately instead of retrying.

Fix test expectation

3c173ae

ggilder requested review from meiji163, rashiq and timvaillancourt as code owners April 2, 2026 18:57

Copilot AI review requested due to automatic review settings April 2, 2026 18:57

Copilot started reviewing on behalf of ggilder April 2, 2026 18:58 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

meiji163 approved these changes Apr 2, 2026

View reviewed changes

meiji163 merged commit 8bc63f0 into github:master Apr 2, 2026
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix abort/retry interaction#1655

Fix abort/retry interaction#1655
meiji163 merged 4 commits intogithub:masterfrom
ggilder:fix-abort-retry-interaction

ggilder commented Apr 2, 2026

Uh oh!

ggilder commented Apr 2, 2026

Uh oh!

ggilder commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Uh oh!

meiji163 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ggilder commented Apr 2, 2026

Description

Root Cause

Solution

Testing

Uh oh!

ggilder commented Apr 2, 2026

Uh oh!

ggilder commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meiji163 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants