Skip to content

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628

Open
james-nesbitt wants to merge 1 commit into
mainfrom
PRODENG-3442-reset-swarm-dissolution-fallback
Open

PRODENG-3442: fall back to forced swarm dissolution when uninstall-ucp times out#628
james-nesbitt wants to merge 1 commit into
mainfrom
PRODENG-3442-reset-swarm-dissolution-fallback

Conversation

@james-nesbitt
Copy link
Copy Markdown
Collaborator

Jira: https://mirantis.jira.com/browse/PRODENG-3442

Problem

The `uninstall-ucp` bootstrapper deploys `ucp-uninstall-agent` as a global Swarm service and waits ~2 minutes (hardcoded) for every node to acknowledge completion. On large clusters or hosts with cold image caches (fresh CI runners) the deadline is missed, causing `Reset()` to fail even though the cluster and infrastructure are otherwise healthy.

Observed in CI:

  • smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
  • smoke-windows (MKE 3.8.8, Win2025 worker): Win2025 missed the deadline
  • smoke-legacy (MKE 3.8.8, 6 Linux nodes): passes cleanly — confirms size/platform dependency

The timeout is internal to the MKE container; there is no --timeout flag on uninstall-ucp.

Fix

MKE documents the manual recovery path for this exact error:

"Remove the ucp-uninstall-agent and ucp-uninstall-agent-win services from a swarm manager, then force each node to leave the swarm."

UninstallMKE.Run() now detects the specific "Uninstalling UCP took too long" error and automatically executes that recovery:

  1. Remove the stuck ucp-uninstall-agent / ucp-uninstall-agent-win services from the leader (best-effort).
  2. Force all non-leader nodes to leave the swarm in parallel (per-node failures logged as warnings, not fatal).
  3. Force the leader to leave last (hard failure if this fails).

All other uninstall-ucp errors continue to propagate as hard failures unchanged. The UninstallMCR phase that follows handles MCR cleanup on each host regardless of how the swarm was dissolved.

Changes

  • pkg/product/mke/phase/uninstall_mke.goisUninstallTimeout() detector + dissolveSwarm() fallback
  • pkg/product/mke/phase/uninstall_mke_test.go — unit tests for isUninstallTimeout

Testing

go test ./pkg/product/mke/phase/...

Smoke test validation pending on the smoke-test-refactor branch (PR #627), where Reset() is currently marked non-fatal as a workaround. Once this fix is merged, that workaround can be reverted.

The uninstall-ucp bootstrapper deploys ucp-uninstall-agent as a global
Swarm service, then waits (~2 min hardcoded) for every node to report
back. On large or mixed-OS clusters with cold image caches this deadline
is missed, causing Reset() to fail even though the infrastructure will
be torn down by terraform destroy anyway.

Observed in CI:
  smoke-modern (MKE 3.9.2, 7 nodes): all nodes missed the deadline
  smoke-windows (MKE 3.8.8, Win2025): Win2025 node missed the deadline

MKE itself documents the recovery path when this happens:
  1. Remove the stuck ucp-uninstall-agent service.
  2. Force every node to leave the swarm.

Implement that as an automatic fallback inside UninstallMKE.Run():
- isUninstallTimeout() detects the specific 'Uninstalling UCP took too
  long' fatal line that Bootstrap surfaces from the MKE container.
- dissolveSwarm() removes the stuck service (best-effort), then forces
  all non-leader nodes to leave in parallel, then forces the leader to
  leave last. Per-node failures are logged as warnings so that a single
  unresponsive host does not block the rest.

Other uninstall-ucp errors (connection failures, image pull errors, etc.)
are still returned as hard failures unchanged.
@james-nesbitt james-nesbitt added the smoke-test Run all smoke tests label May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

smoke-test Run all smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant