Skip to content

Fix StatefulSet recreate failure handling#1997

Open
realyota wants to merge 1 commit into
Altinity:0.27.1from
realyota:fix-sts-recreate-stuck-delete
Open

Fix StatefulSet recreate failure handling#1997
realyota wants to merge 1 commit into
Altinity:0.27.1from
realyota:fix-sts-recreate-stuck-delete

Conversation

@realyota
Copy link
Copy Markdown
Contributor

@realyota realyota commented May 28, 2026

Fix for #1995

Test was written by Codex.

Important items to consider before making a Pull Request

Please check items PR complies to:

  • [ X] All commits in the PR are squashed. More info
  • [X ] The PR is made into dedicated next-release branch, not into master branch1. More info
  • [ X] The PR is signed. More info

--

1 If you feel your PR does not affect any Go-code or any testable functionality (for example, PR contains docs only or supplementary materials), PR can be made into master branch, but it has to be confirmed by project's maintainer.

Bug: when restart fallback scaled a host StatefulSet down, a timed-out StatefulSet delete was ignored and the following create failure was converted into ErrCRUDRecreate, which the create path also ignored. The reconcile could therefore report success while the old StatefulSet was still deleting and the replacement pod was not created.

Fix: abort recreate when StatefulSet deletion fails, treat create API failures as aborts, and defensively abort if ErrCRUDRecreate reaches the create completion path.

Tests: add focused statefulset reconciler unit coverage for create errors, unexpected recreate actions in the create path, and delete failures during recreate.
@realyota
Copy link
Copy Markdown
Contributor Author

realyota commented May 28, 2026

Interesting, same problem fixed here: #1993 and it is "treating the cause". Both PRs do not fix all problems. #1997 it is more defensive (catches a wider family of failures, not just 409 on scale-down) but it is missing doDeleteStatefulSet early-return when scale-to-0 (Update) fails and Delete never happens. Probably best way will be to merge #1993 and then rebase on #1997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant