PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out by james-nesbitt · Pull Request #628 · Mirantis/launchpad

james-nesbitt · 2026-05-11T17:31:57Z

Jira: https://mirantis.jira.com/browse/PRODENG-3442

Problem

The `uninstall-ucp` bootstrapper deploys `ucp-uninstall-agent` as a global Swarm service and waits ~2 minutes (hardcoded) for every node to acknowledge completion. On large clusters or hosts with cold image caches (fresh CI runners) the deadline is missed, causing `Reset()` to fail even though the cluster and infrastructure are otherwise healthy.

Observed in CI:

smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
smoke-windows (MKE 3.8.8, Win2025 worker): Win2025 missed the deadline
smoke-legacy (MKE 3.8.8, 6 Linux nodes): passes cleanly — confirms size/platform dependency

The timeout is internal to the MKE container; there is no --timeout flag on uninstall-ucp.

Fix

MKE documents the manual recovery path for this exact error:

"Remove the ucp-uninstall-agent and ucp-uninstall-agent-win services from a swarm manager, then force each node to leave the swarm."

UninstallMKE.Run() now detects the specific "Uninstalling UCP took too long" error and automatically executes that recovery:

Remove the stuck ucp-uninstall-agent / ucp-uninstall-agent-win services from the leader (best-effort).
Force all non-leader nodes to leave the swarm in parallel (per-node failures logged as warnings, not fatal).
Force the leader to leave last (hard failure if this fails).

All other uninstall-ucp errors continue to propagate as hard failures unchanged. The UninstallMCR phase that follows handles MCR cleanup on each host regardless of how the swarm was dissolved.

Changes

pkg/product/mke/phase/uninstall_mke.go — isUninstallTimeout() detector + dissolveSwarm() fallback
pkg/product/mke/phase/uninstall_mke_test.go — unit tests for isUninstallTimeout

Testing

go test ./pkg/product/mke/phase/...

Smoke test validation pending on the smoke-test-refactor branch (PR #627), where Reset() is currently marked non-fatal as a workaround. Once this fix is merged, that workaround can be reverted.

The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global Swarm service, then waits (~2 min hardcoded) for every node to report back. On large or mixed-OS clusters with cold image caches this deadline is missed, causing Reset() to fail even though the infrastructure will be torn down by terraform destroy anyway. Observed in CI: smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline smoke-windows (MKE 3.8.8, Win2025): Win2025 node missed the deadline MKE itself documents the recovery path when this happens: 1. Remove the stuck ucp-uninstall-agent service. 2. Force every node to leave the swarm. Implement that as an automatic fallback inside UninstallMKE.Run(): - isUninstallTimeout() detects the specific 'Uninstalling UCP took too long' fatal line that Bootstrap surfaces from the MKE container. - dissolveSwarm() removes the stuck service (best-effort), then forces all non-leader nodes to leave in parallel, then forces the leader to leave last. Per-node failures are logged as warnings so that a single unresponsive host does not block the rest. Other uninstall-ucp errors (connection failures, image pull errors, etc.) are still returned as hard failures unchanged.

james-nesbitt added the smoke-test Run all smoke tests label May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628
james-nesbitt wants to merge 1 commit into
mainfrom
PRODENG-3442-reset-swarm-dissolution-fallback

james-nesbitt commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

james-nesbitt commented May 11, 2026

Problem

Fix

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant