Skip to content

fix: auto-heal Completed CHIs with sustained-NotReady hosts#1998

Open
dashashutosh80 wants to merge 1 commit into
Altinity:0.27.1from
dashashutosh80:fix/heal-completed-chi-stuck-not-ready
Open

fix: auto-heal Completed CHIs with sustained-NotReady hosts#1998
dashashutosh80 wants to merge 1 commit into
Altinity:0.27.1from
dashashutosh80:fix/heal-completed-chi-stuck-not-ready

Conversation

@dashashutosh80
Copy link
Copy Markdown
Contributor

@dashashutosh80 dashashutosh80 commented May 29, 2026

Important items to consider before making a Pull Request

Please check items PR complies to:

  • All commits in the PR are squashed. More info
  • The PR is made into dedicated next-release branch, not into master branch1. More info
  • The PR is signed. More info

Summary

Heal Completed CHIs whose host pod silently regresses to Ready=False and stays there. Two independent gaps left the operator deaf to this scenario and produced multi-hour stalls of otherwise-healthy clusters until something external (e.g. a manual operator-deployment restart) bumped the reconcile loop. This change closes both gaps behind a config gate that is on by default and debounced by a 5-minute sustain threshold.

Reproduced on Kubernetes 1.30 (kind) by injecting a 65% bidirectional drop on the ClickHouse pod's 8123/tcp. Pre-fix operator (0.24.2, 0.26.3, vanilla 0.27.1) leaves the CHI Completed with a Ready=False pod indefinitely. Post-fix operator emits a StuckHostRecoveryTriggered event within ~10s of the pod going NotReady, waits out the threshold (debounced; transient flaps are absorbed), emits HostStuckNotReady, and rolls the StatefulSet. End-to-end self-heal in ~10 minutes.

Background

A ClickHouse pod's readiness probe (http://:http/ping) can fail for reasons that do not also trip the liveness probe: intermittent network drops on the data-plane port, a stuck thread pool, slow merges, a long-running mutation. In those situations:

  • The kubelet correctly stops routing traffic to the pod (kube-proxy removes it from EndpointSlice).
  • The CHI's metadata.generation does not change, so isGenerationTheSame() short-circuits any naive resync.
  • The operator wakes up on pod events but its ReconcileUpdate arm has no handler for the Ready→NotReady transition on a Completed CHI.
  • Even if a reconcile did run, shouldForceRestartHost's switch has no case for "pod is Ready=False" — only for CrashLoopBackOff, IP changes, rolling-update markers, unknown version + crushed. Default arm returns false.

The result is a Completed CHI with a Ready=False pod that the operator will not act on until an external trigger fires. Pre-fix log/event timelines from the field show no operator activity against the affected CHI for the full duration of the regression, then immediate recovery the moment the operator pod itself is restarted.

Changes

1. Signal — wake the operator on Ready→NotReady for Completed CHIs

pkg/controller/chi/worker-pod-retry.go adds recoverCompletedReconcileOnPodNotReady, the symmetric counterpart of the existing recoverAbortedReconcileOnPodReady:

  • Fires only on a Ready=True → Ready=False transition (mutually exclusive with the existing Aborted path, so the two handlers can both run on every pod-update with no double-fire).
  • Guards on parent CHI status (Completed, not deleting).
  • Schedules a delayed re-enqueue via time.AfterFunc for threshold + extraDelay (defaults: 5m + 2s), so the decision-side gets to evaluate sustain rather than fire on flaps.
  • Uses cmd_queue.ReconcileAdd (not ReconcileUpdate) so the eventual reconcile bypasses isGenerationTheSame().
  • Each transition rearms its own timer (time.AfterFunc does not coalesce by key — intentionally distinct from workqueue.AddAfter semantics, which would silently drop a flap-then-stick scenario).

Wired into processReconcilePod (worker-boilerplate.go) alongside the existing Aborted-recovery call.

2. Decision — restart Ready=False-past-threshold hosts in shouldForceRestartHost

pkg/controller/chi/worker.go adds a new case to shouldForceRestartHost's switch, gated on the new config option and a sustain check:

  • chop.Config().ShouldRecoverCompletedOnPodNotReady() (default true)
  • w.isPodSustainedNotReady(ctx, host, threshold)

The case is placed before the default: return false arm. All pre-existing cases are untouched.

pkg/controller/chi/worker-status-helpers.go adds isPodSustainedNotReady (kube-fetch wrapper) plus its pure inner predicate podIsSustainedNotReady. The predicate reads Pod.Status.Conditions[PodReady].LastTransitionTime because that is the same signal kube-proxy uses to remove a pod from EndpointSlice — keeping the operator's view of "broken" aligned with k8s's. ConditionUnknown is treated as not-ready. API fetch failures default to false (a transient apiserver hiccup must never cause a restart).

Both call sites (signal and decision) additionally short-circuit when the pod is in a kubelet-driven failure mode (podIsInKubeletFailureMode): ImagePullBackOff, ErrImagePull, InvalidImageName, CrashLoopBackOff, CreateContainerError, RunContainerError, ContainerCannotRun, CreateContainerConfigError, init-container equivalents, or Pod.Status.Phase == Pending. In those states kubelet/scheduler is already handling the failure; an operator-driven StatefulSet rollout would race them without value.

The signal-side handler also skips pods with a non-zero DeletionTimestamp — a Ready=True → Ready=False flip during graceful pod termination is normal lifecycle bookkeeping, not a host regression.

3. Config — reconcile.recovery.from.completed scope

pkg/apis/clickhouse.altinity.com/v1/type_configuration_chop.go:

  • New struct OperatorConfigReconcileRecoveryCompletedScope with two fields:
    • onPodNotReadyretry (default) or none.
    • onPodNotReadyThreshold — duration string, default 5m when unset/empty/unparseable.
  • Two accessors with defaults baked in (ShouldRecoverCompletedOnPodNotReady, CompletedOnPodNotReadyThreshold).

A separate struct from OperatorConfigReconcileRecoveryScope because the Completed scope reacts to the inverse transition and carries the extra OnPodNotReadyThreshold debounce knob, which has no analogue in the Aborted scope.

Default-on rather than opt-in: pre-fix behaviour is a silent stall on a sustained regression; the cost of doing nothing by default is outage, the cost of acting by default is at most one StatefulSet restart per host per threshold window. Operators that disagree can set onPodNotReady: none.

4. Observability — events

pkg/controller/common/announcer/event-emitter.go adds two EventReason constants:

  • StuckHostRecoveryTriggered — operator noticed and scheduled work.
  • HostStuckNotReady — operator decided to restart.

Split deliberately so a single transient flap that clears below threshold leaves only the first event behind, whereas a real outage leaves both. Visible to operators via kubectl describe chi.

No new Prometheus counters are introduced — events on the CHI provide the observability hook without expanding the operator's /metrics surface.

Runtime flow

For a pod that goes Ready=False while its CHI is Completed:

  1. Kubelet PATCHes the pod's PodReady condition. Operator's pod informer enqueues a ReconcilePod (ReconcileUpdate).
  2. processReconcilePod runs recoverCompletedReconcileOnPodNotReady. It confirms the transition, fetches the parent CHI, confirms Status=Completed, computes a delay (threshold − elapsed + 2s, floored at 5s), emits StuckHostRecoveryTriggered, and arms a time.AfterFunc.
  3. Each subsequent Ready→NotReady (e.g. after a kubelet-driven container restart re-trips readiness) rearms an independent timer.
  4. When a timer fires, it enqueues a ReconcileAdd for the CHI. The normal reconcile runs and reaches shouldForceRestartHost.
  5. The new case calls isPodSustainedNotReady. If the live pod's PodReady=False has held for ≥ threshold, the case returns true, emits HostStuckNotReady, and the existing StatefulSet-rollout path restarts the host. Otherwise the switch falls through to the default return false — debouncing absorbed the flap.

Config example

reconcile:
  recovery:
    from:
      completed:
        onPodNotReady: retry          # or "none" to disable
        onPodNotReadyThreshold: 5m    # any time.ParseDuration string

Both fields are optional; omitting them keeps the default-on behaviour with a 5-minute sustain.

Tests

  • pkg/controller/chi/worker-pod-retry_test.go — new table-driven tests for isPodReadyToNotReadyTransition, shouldTriggerStuckHostRecovery, and stuckHostScheduleDelay (including the operator-restart "elapsed already exceeds threshold" edge case).
  • pkg/controller/chi/worker-status-helpers_test.go (new file) — table-driven tests for podIsSustainedNotReady. New file follows the project's per-file pairing convention for _test.go; worker-status-helpers.go had no test file before because there were no pure predicates in it.
  • pkg/apis/clickhouse.altinity.com/v1/type_configuration_chop_recovery_test.go — new tests for ShouldRecoverCompletedOnPodNotReady and CompletedOnPodNotReadyThreshold, covering nil/empty/unparseable/zero/negative/valid/case-insensitive inputs.

End-to-end validation on kind

Repro setup: kind 1.30, one ClickHouse pod, one keeper, operator built from this branch (localhost/altinity-clickhouse-operator:fix-stuck-not-ready).

Inject failure (run on the kind node):

iptables -I OUTPUT -p tcp -d $POD_IP --dport 8123 -m statistic --mode random --probability 0.65 -j DROP
iptables -I INPUT  -p tcp -s $POD_IP --sport 8123 -m statistic --mode random --probability 0.65 -j DROP

Observed operator log lines (annotated):

15:22:24  recoverCompletedReconcileOnPodNotReady → "Stuck-host recovery scheduled ... re-enqueue in 5m1s (threshold 5m0s)"
15:26:15  recoverCompletedReconcileOnPodNotReady → rearmed after kubelet-driven container restart
15:27:29  shouldForceRestartHost                  → "Host force restart is not required"   (debouncing — NotReady only 1m14s)
15:31:20  shouldForceRestartHost                  → "Host pod has been Ready=False past threshold 5m0s — force restart"
15:31:21  reconcileHostStatefulSet                → "Reconcile host STS force restart: 0-0"

Final state: pod recreated (new IP, iptables rule no longer matches), Ready=True, container RestartCount=0, CHI Status=Completed. Self-heal end-to-end in ~10 minutes against a fault that pre-fix operators sat on for 26 hours in production.

Same kubectl describe chi events:

Info  StuckHostRecoveryTriggered  Stuck-host recovery scheduled: pod chi-clickhouse-c1-0-0-0 became NotReady while CHI ... is Completed; re-enqueue in 5m1s (threshold 5m0s)
Info  HostStuckNotReady           Host pod has been Ready=False past threshold 5m0s — force restart. Host: 0-0

Backwards compatibility

  • New config keys default to the safety-net behaviour; no manifest change required to benefit, and existing manifests without the new keys behave the same as a freshly-defaulted operator.
  • No CRD changes.
  • No public Go API removals; only additions (two EventReason constants, two OperatorConfig accessors, one new config struct).
  • The RecoveryActionNone godoc was generalised from "CHI stays Aborted" to "CHI stays in its current state" to reflect its reuse for the Completed scope. Value is unchanged.

Operator opt-out

reconcile:
  recovery:
    from:
      completed:
        onPodNotReady: none

Restores the pre-fix behaviour for the signal and the decision side: recoverCompletedReconcileOnPodNotReady returns immediately and shouldForceRestartHost's new case predicate evaluates false.

Risk assessment

  • Restart storm risk — gated by isPodSustainedNotReady, which requires the pod to be Ready=False for at least threshold (default 5m). A pod that briefly flaps below threshold never reaches the restart path. The decision-side predicate is re-evaluated at fire-time against the live pod, not the pod state captured when the timer was armed; this is intentional and is what makes the 15:27:29 reconcile in the validation trace correctly not restart the host.
  • Stateful-DB restart risk — the only restart action taken is the existing reconcileHostStatefulSet path with forceRestart=true. No new restart primitive is introduced. A host that is sustained Ready=False for ≥5m is already not receiving traffic via kube-proxy, so the restart is recovering throughput, not interrupting it.
  • Apiserver-load risk — the kube fetch in isPodSustainedNotReady runs once per reconcile (which itself is debounced by threshold), not once per pod-event. The time.AfterFunc rearming on flaps does not add apiserver calls until the timer fires.
  • False positives on rollouts/upgrades — a fresh pod has no PodReady condition (or Status=True once Ready) so podIsSustainedNotReady returns false during normal rollout windows.
  • Legitimate Ready=False from the application — ClickHouse can report Ready=False while in replication recovery, schema sync, or other read-only states the application owns. After the threshold elapses this PR will restart the host, which resets that recovery. Operators running heavy replication workloads should tune onPodNotReadyThreshold upward (e.g. 30m) or set onPodNotReady: none for those CHIs.

Scope and known limitations

  • CHI onlypkg/controller/chk/ (ClickHouseKeeper) has the analogous signal/decision gaps in worker-boilerplate.go and shouldForceRestartHost. Out of scope for this PR; can be addressed as a separate change against the same threshold/config infrastructure if requested.
  • Operator-restart recovery gap — if the operator process restarts while a time.AfterFunc timer is armed, the timer is lost. The next Ready→NotReady transition rearms the recovery, but if the pod sits steadily Ready=False across the operator restart, the signal-side handler does not fire. Pre-fix behaviour is identical (worse, actually — pre-fix the pod stays stuck forever); a follow-up could plug processReconcilePod's ReconcileAdd arm to schedule recovery on the initial-list sweep too.

Resolves

This PR aims to resolve #1994

@dashashutosh80 dashashutosh80 force-pushed the fix/heal-completed-chi-stuck-not-ready branch 2 times, most recently from d227fb6 to cd6ee1c Compare May 29, 2026 08:24
Pre-fix, a ClickHouse pod that regressed to Ready=False while its CHI
was Status=Completed could stay stuck indefinitely: the CHI generation
never changes, the operator's ReconcileUpdate path had no handler for
Ready→NotReady on Completed CHIs, and shouldForceRestartHost had no
case for sustained Ready=False. Field observation: 26-hour stalls of
otherwise-healthy clusters until an external trigger (operator pod
restart) bumped the reconcile loop.
Close both gaps:
  Signal side - worker-pod-retry.go adds
  recoverCompletedReconcileOnPodNotReady, wired into processReconcilePod
  alongside the existing recoverAbortedReconcileOnPodReady. Schedules a
  delayed re-enqueue (time.AfterFunc) so the decision-side gets to
  evaluate sustain rather than fire on every flap.
  Decision side - worker.go adds a new case to shouldForceRestartHost
  gated on ShouldRecoverCompletedOnPodNotReady() and a freshly-fetched
  isPoustainedNotReady() check that reads
  Pod.Status.Conditions[PodReady].LastTransitionTime - the same signal
  kube-proxy uses for EndpointSlice membership.
Config: reconcile.recovery.from.completed.{onPodNotReady,
onPodNotReadyThreshold}, default retry / 5m. Default-on; the cost of
not acting by default is multi-hour outage, the cost of acting is at
most one StatefulSet restart per host per threshold window.
Observability: two new EventReasons (StuckHostRecoveryTriggered,
HostStuckNotReady), split so a debounced flap leaves only the schedule
event behind while a real outage leaves both. No new Prometheus
counters - events on the CHI provide the observability hook without
expanding the operator's /metrics surface.
Both signal and decision sides short-circuit when the pod is in a
kubelet-driven failure mode (ImagePullBackOff, CrashLoopBackOff,
Pending, etc.) so the operator does not race kubelet on its own
recovery. The signal side also skips pods with a non-zero
DeletionTimestamp (graceful shutdown is not a regression).
Validated end-to-end on kind 1.30 with a 65% bidirectional iptables
drop on the pod's 8123/tcp: self-heal in ~10 min vs. indefinite stall
pre-fix. New unit tests for each new function (pure predicates and
accessors), all table-driven.

Signed-off-by: dashashutosh80 <dashashutosh80@gmail.com>
@dashashutosh80 dashashutosh80 force-pushed the fix/heal-completed-chi-stuck-not-ready branch from cd6ee1c to d8717b3 Compare May 29, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant