PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628
Open
james-nesbitt wants to merge 1 commit into
Open
PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628james-nesbitt wants to merge 1 commit into
james-nesbitt wants to merge 1 commit into
Conversation
The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global Swarm service, then waits (~2 min hardcoded) for every node to report back. On large or mixed-OS clusters with cold image caches this deadline is missed, causing Reset() to fail even though the infrastructure will be torn down by terraform destroy anyway. Observed in CI: smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline smoke-windows (MKE 3.8.8, Win2025): Win2025 node missed the deadline MKE itself documents the recovery path when this happens: 1. Remove the stuck ucp-uninstall-agent service. 2. Force every node to leave the swarm. Implement that as an automatic fallback inside UninstallMKE.Run(): - isUninstallTimeout() detects the specific 'Uninstalling UCP took too long' fatal line that Bootstrap surfaces from the MKE container. - dissolveSwarm() removes the stuck service (best-effort), then forces all non-leader nodes to leave in parallel, then forces the leader to leave last. Per-node failures are logged as warnings so that a single unresponsive host does not block the rest. Other uninstall-ucp errors (connection failures, image pull errors, etc.) are still returned as hard failures unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Jira: https://mirantis.jira.com/browse/PRODENG-3442
Problem
The `uninstall-ucp` bootstrapper deploys `ucp-uninstall-agent` as a global Swarm service and waits ~2 minutes (hardcoded) for every node to acknowledge completion. On large clusters or hosts with cold image caches (fresh CI runners) the deadline is missed, causing `Reset()` to fail even though the cluster and infrastructure are otherwise healthy.
Observed in CI:
The timeout is internal to the MKE container; there is no
--timeoutflag onuninstall-ucp.Fix
MKE documents the manual recovery path for this exact error:
UninstallMKE.Run()now detects the specific"Uninstalling UCP took too long"error and automatically executes that recovery:ucp-uninstall-agent/ucp-uninstall-agent-winservices from the leader (best-effort).All other
uninstall-ucperrors continue to propagate as hard failures unchanged. TheUninstallMCRphase that follows handles MCR cleanup on each host regardless of how the swarm was dissolved.Changes
pkg/product/mke/phase/uninstall_mke.go—isUninstallTimeout()detector +dissolveSwarm()fallbackpkg/product/mke/phase/uninstall_mke_test.go— unit tests forisUninstallTimeoutTesting
Smoke test validation pending on the
smoke-test-refactorbranch (PR #627), whereReset()is currently marked non-fatal as a workaround. Once this fix is merged, that workaround can be reverted.