fix: auto-heal Completed CHIs with sustained-NotReady hosts#1998
Open
dashashutosh80 wants to merge 1 commit into
Open
fix: auto-heal Completed CHIs with sustained-NotReady hosts#1998dashashutosh80 wants to merge 1 commit into
dashashutosh80 wants to merge 1 commit into
Conversation
d227fb6 to
cd6ee1c
Compare
Pre-fix, a ClickHouse pod that regressed to Ready=False while its CHI
was Status=Completed could stay stuck indefinitely: the CHI generation
never changes, the operator's ReconcileUpdate path had no handler for
Ready→NotReady on Completed CHIs, and shouldForceRestartHost had no
case for sustained Ready=False. Field observation: 26-hour stalls of
otherwise-healthy clusters until an external trigger (operator pod
restart) bumped the reconcile loop.
Close both gaps:
Signal side - worker-pod-retry.go adds
recoverCompletedReconcileOnPodNotReady, wired into processReconcilePod
alongside the existing recoverAbortedReconcileOnPodReady. Schedules a
delayed re-enqueue (time.AfterFunc) so the decision-side gets to
evaluate sustain rather than fire on every flap.
Decision side - worker.go adds a new case to shouldForceRestartHost
gated on ShouldRecoverCompletedOnPodNotReady() and a freshly-fetched
isPoustainedNotReady() check that reads
Pod.Status.Conditions[PodReady].LastTransitionTime - the same signal
kube-proxy uses for EndpointSlice membership.
Config: reconcile.recovery.from.completed.{onPodNotReady,
onPodNotReadyThreshold}, default retry / 5m. Default-on; the cost of
not acting by default is multi-hour outage, the cost of acting is at
most one StatefulSet restart per host per threshold window.
Observability: two new EventReasons (StuckHostRecoveryTriggered,
HostStuckNotReady), split so a debounced flap leaves only the schedule
event behind while a real outage leaves both. No new Prometheus
counters - events on the CHI provide the observability hook without
expanding the operator's /metrics surface.
Both signal and decision sides short-circuit when the pod is in a
kubelet-driven failure mode (ImagePullBackOff, CrashLoopBackOff,
Pending, etc.) so the operator does not race kubelet on its own
recovery. The signal side also skips pods with a non-zero
DeletionTimestamp (graceful shutdown is not a regression).
Validated end-to-end on kind 1.30 with a 65% bidirectional iptables
drop on the pod's 8123/tcp: self-heal in ~10 min vs. indefinite stall
pre-fix. New unit tests for each new function (pure predicates and
accessors), all table-driven.
Signed-off-by: dashashutosh80 <dashashutosh80@gmail.com>
cd6ee1c to
d8717b3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important items to consider before making a Pull Request
Please check items PR complies to:
next-releasebranch, not intomasterbranch1. More infoSummary
Heal
CompletedCHIs whose host pod silently regresses toReady=Falseand stays there. Two independent gaps left the operator deaf to this scenario and produced multi-hour stalls of otherwise-healthy clusters until something external (e.g. a manual operator-deployment restart) bumped the reconcile loop. This change closes both gaps behind a config gate that is on by default and debounced by a 5-minute sustain threshold.Reproduced on Kubernetes 1.30 (kind) by injecting a 65% bidirectional drop on the ClickHouse pod's
8123/tcp. Pre-fix operator (0.24.2,0.26.3, vanilla0.27.1) leaves the CHICompletedwith aReady=Falsepod indefinitely. Post-fix operator emits aStuckHostRecoveryTriggeredevent within ~10s of the pod going NotReady, waits out the threshold (debounced; transient flaps are absorbed), emitsHostStuckNotReady, and rolls the StatefulSet. End-to-end self-heal in ~10 minutes.Background
A ClickHouse pod's readiness probe (
http://:http/ping) can fail for reasons that do not also trip the liveness probe: intermittent network drops on the data-plane port, a stuck thread pool, slow merges, a long-running mutation. In those situations:metadata.generationdoes not change, soisGenerationTheSame()short-circuits any naive resync.ReconcileUpdatearm has no handler for theReady→NotReadytransition on aCompletedCHI.shouldForceRestartHost's switch has no case for "pod isReady=False" — only forCrashLoopBackOff, IP changes, rolling-update markers, unknown version + crushed. Default arm returnsfalse.The result is a
CompletedCHI with aReady=Falsepod that the operator will not act on until an external trigger fires. Pre-fix log/event timelines from the field show no operator activity against the affected CHI for the full duration of the regression, then immediate recovery the moment the operator pod itself is restarted.Changes
1. Signal — wake the operator on
Ready→NotReadyforCompletedCHIspkg/controller/chi/worker-pod-retry.goaddsrecoverCompletedReconcileOnPodNotReady, the symmetric counterpart of the existingrecoverAbortedReconcileOnPodReady:Ready=True → Ready=Falsetransition (mutually exclusive with the existing Aborted path, so the two handlers can both run on every pod-update with no double-fire).Completed, not deleting).time.AfterFuncforthreshold + extraDelay(defaults:5m + 2s), so the decision-side gets to evaluate sustain rather than fire on flaps.cmd_queue.ReconcileAdd(notReconcileUpdate) so the eventual reconcile bypassesisGenerationTheSame().time.AfterFuncdoes not coalesce by key — intentionally distinct fromworkqueue.AddAftersemantics, which would silently drop a flap-then-stick scenario).Wired into
processReconcilePod(worker-boilerplate.go) alongside the existing Aborted-recovery call.2. Decision — restart
Ready=False-past-threshold hosts inshouldForceRestartHostpkg/controller/chi/worker.goadds a newcasetoshouldForceRestartHost's switch, gated on the new config option and a sustain check:chop.Config().ShouldRecoverCompletedOnPodNotReady()(defaulttrue)w.isPodSustainedNotReady(ctx, host, threshold)The case is placed before the
default: return falsearm. All pre-existing cases are untouched.pkg/controller/chi/worker-status-helpers.goaddsisPodSustainedNotReady(kube-fetch wrapper) plus its pure inner predicatepodIsSustainedNotReady. The predicate readsPod.Status.Conditions[PodReady].LastTransitionTimebecause that is the same signal kube-proxy uses to remove a pod from EndpointSlice — keeping the operator's view of "broken" aligned with k8s's.ConditionUnknownis treated as not-ready. API fetch failures default tofalse(a transient apiserver hiccup must never cause a restart).Both call sites (signal and decision) additionally short-circuit when the pod is in a kubelet-driven failure mode (
podIsInKubeletFailureMode):ImagePullBackOff,ErrImagePull,InvalidImageName,CrashLoopBackOff,CreateContainerError,RunContainerError,ContainerCannotRun,CreateContainerConfigError, init-container equivalents, orPod.Status.Phase == Pending. In those states kubelet/scheduler is already handling the failure; an operator-driven StatefulSet rollout would race them without value.The signal-side handler also skips pods with a non-zero
DeletionTimestamp— aReady=True → Ready=Falseflip during graceful pod termination is normal lifecycle bookkeeping, not a host regression.3. Config —
reconcile.recovery.from.completedscopepkg/apis/clickhouse.altinity.com/v1/type_configuration_chop.go:OperatorConfigReconcileRecoveryCompletedScopewith two fields:onPodNotReady—retry(default) ornone.onPodNotReadyThreshold— duration string, default5mwhen unset/empty/unparseable.ShouldRecoverCompletedOnPodNotReady,CompletedOnPodNotReadyThreshold).A separate struct from
OperatorConfigReconcileRecoveryScopebecause theCompletedscope reacts to the inverse transition and carries the extraOnPodNotReadyThresholddebounce knob, which has no analogue in theAbortedscope.Default-on rather than opt-in: pre-fix behaviour is a silent stall on a sustained regression; the cost of doing nothing by default is outage, the cost of acting by default is at most one StatefulSet restart per host per threshold window. Operators that disagree can set
onPodNotReady: none.4. Observability — events
pkg/controller/common/announcer/event-emitter.goadds twoEventReasonconstants:StuckHostRecoveryTriggered— operator noticed and scheduled work.HostStuckNotReady— operator decided to restart.Split deliberately so a single transient flap that clears below threshold leaves only the first event behind, whereas a real outage leaves both. Visible to operators via
kubectl describe chi.No new Prometheus counters are introduced — events on the CHI provide the observability hook without expanding the operator's
/metricssurface.Runtime flow
For a pod that goes
Ready=Falsewhile its CHI isCompleted:PodReadycondition. Operator's pod informer enqueues aReconcilePod(ReconcileUpdate).processReconcilePodrunsrecoverCompletedReconcileOnPodNotReady. It confirms the transition, fetches the parent CHI, confirmsStatus=Completed, computes a delay (threshold − elapsed + 2s, floored at5s), emitsStuckHostRecoveryTriggered, and arms atime.AfterFunc.Ready→NotReady(e.g. after a kubelet-driven container restart re-trips readiness) rearms an independent timer.ReconcileAddfor the CHI. The normal reconcile runs and reachesshouldForceRestartHost.isPodSustainedNotReady. If the live pod'sPodReady=Falsehas held for ≥ threshold, the case returnstrue, emitsHostStuckNotReady, and the existing StatefulSet-rollout path restarts the host. Otherwise the switch falls through to the defaultreturn false— debouncing absorbed the flap.Config example
Both fields are optional; omitting them keeps the default-on behaviour with a 5-minute sustain.
Tests
pkg/controller/chi/worker-pod-retry_test.go— new table-driven tests forisPodReadyToNotReadyTransition,shouldTriggerStuckHostRecovery, andstuckHostScheduleDelay(including the operator-restart "elapsed already exceeds threshold" edge case).pkg/controller/chi/worker-status-helpers_test.go(new file) — table-driven tests forpodIsSustainedNotReady. New file follows the project's per-file pairing convention for_test.go;worker-status-helpers.gohad no test file before because there were no pure predicates in it.pkg/apis/clickhouse.altinity.com/v1/type_configuration_chop_recovery_test.go— new tests forShouldRecoverCompletedOnPodNotReadyandCompletedOnPodNotReadyThreshold, covering nil/empty/unparseable/zero/negative/valid/case-insensitive inputs.End-to-end validation on kind
Repro setup: kind 1.30, one ClickHouse pod, one keeper, operator built from this branch (
localhost/altinity-clickhouse-operator:fix-stuck-not-ready).Inject failure (run on the kind node):
Observed operator log lines (annotated):
Final state: pod recreated (new IP, iptables rule no longer matches),
Ready=True, containerRestartCount=0, CHIStatus=Completed. Self-heal end-to-end in ~10 minutes against a fault that pre-fix operators sat on for 26 hours in production.Same
kubectl describe chievents:Backwards compatibility
EventReasonconstants, twoOperatorConfigaccessors, one new config struct).RecoveryActionNonegodoc was generalised from "CHI stays Aborted" to "CHI stays in its current state" to reflect its reuse for theCompletedscope. Value is unchanged.Operator opt-out
Restores the pre-fix behaviour for the signal and the decision side:
recoverCompletedReconcileOnPodNotReadyreturns immediately andshouldForceRestartHost's new case predicate evaluatesfalse.Risk assessment
isPodSustainedNotReady, which requires the pod to beReady=Falsefor at leastthreshold(default 5m). A pod that briefly flaps below threshold never reaches the restart path. The decision-side predicate is re-evaluated at fire-time against the live pod, not the pod state captured when the timer was armed; this is intentional and is what makes the 15:27:29 reconcile in the validation trace correctly not restart the host.reconcileHostStatefulSetpath withforceRestart=true. No new restart primitive is introduced. A host that is sustainedReady=Falsefor ≥5m is already not receiving traffic via kube-proxy, so the restart is recovering throughput, not interrupting it.isPodSustainedNotReadyruns once per reconcile (which itself is debounced bythreshold), not once per pod-event. Thetime.AfterFuncrearming on flaps does not add apiserver calls until the timer fires.PodReadycondition (orStatus=Trueonce Ready) sopodIsSustainedNotReadyreturnsfalseduring normal rollout windows.Ready=Falsefrom the application — ClickHouse can reportReady=Falsewhile in replication recovery, schema sync, or other read-only states the application owns. After the threshold elapses this PR will restart the host, which resets that recovery. Operators running heavy replication workloads should tuneonPodNotReadyThresholdupward (e.g.30m) or setonPodNotReady: nonefor those CHIs.Scope and known limitations
pkg/controller/chk/(ClickHouseKeeper) has the analogous signal/decision gaps inworker-boilerplate.goandshouldForceRestartHost. Out of scope for this PR; can be addressed as a separate change against the same threshold/config infrastructure if requested.time.AfterFunctimer is armed, the timer is lost. The nextReady→NotReadytransition rearms the recovery, but if the pod sits steadilyReady=Falseacross the operator restart, the signal-side handler does not fire. Pre-fix behaviour is identical (worse, actually — pre-fix the pod stays stuck forever); a follow-up could plugprocessReconcilePod'sReconcileAddarm to schedule recovery on the initial-list sweep too.Resolves
This PR aims to resolve #1994