design(#127): re-apply sidecar mark-ready on pod replacement#128
Merged
design(#127): re-apply sidecar mark-ready on pod replacement#128
Conversation
Extend buildRunningPlan with a second drift detector for sidecar readiness. When the sidecar returns 503 on an otherwise-Running node (after pod replacement from chaos, drain, OOM, reschedule), spawn a one-task MarkReady plan using the existing plan machinery. Probe lives in the reconciler; planner reads a SidecarReady condition. Idiomatic to the image-drift path already implemented in buildRunningPlan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
bdchatham
added a commit
that referenced
this pull request
Apr 23, 2026
Addresses specialist review on #129: 1. Required: emit MarkReadyReapplied event when the mark-ready plan is spawned (design #128 §"New event"). Gated by a small helper that classifies single-task MarkReady plans. 2. Recommended: add a reconciler-level test asserting the probe is skipped when Phase=Initializing. 3. Integration test now also asserts the MarkReadyReapplied event is emitted during the pod-replacement round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham
added a commit
that referenced
this pull request
Apr 23, 2026
* feat: sidecar mark-ready re-apply on pod replacement Closes #127. Implements design #128. Reconciler probes the sidecar's Healthz on each Running-phase reconcile and stamps a SidecarReady condition. The planner extends buildRunningPlan with a second drift branch: when SidecarReady is False/NotReady, it returns a one-task MarkReady plan. Image drift still wins conflict because its plan already ends with MarkReady. - api/v1alpha1: new ConditionSidecarReady type - internal/task: Healthz added to SidecarClient interface - internal/controller/node: probeSidecarHealth method + sidecar_health_probes counter + transition events (SidecarReadinessLost/Restored) - internal/planner: buildMarkReadyPlan, sidecarNeedsReapproval, buildRunningPlan now returns a MarkReady plan when drift is observed - cmd: wire BuildSidecarClient field on SeiNodeReconciler Tests cover all probe outcomes, planner branches (including image-wins), and a full pod-replacement round-trip integration test. All pre-existing tests pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: emit MarkReadyReapplied event + add Initializing-skipped test Addresses specialist review on #129: 1. Required: emit MarkReadyReapplied event when the mark-ready plan is spawned (design #128 §"New event"). Gated by a small helper that classifies single-task MarkReady plans. 2. Recommended: add a reconciler-level test asserting the probe is skipped when Phase=Initializing. 3. Integration test now also asserts the MarkReadyReapplied event is emitted during the pod-replacement round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: move sidecar probe into planner; drop MarkReadyReapplied event Addresses review feedback on #129: 1. Probe lives in the planner now, invoked from ResolvePlan. Callers pass a task.SidecarClient; nil skips the probe (tests, Initializing). Drift detection + mitigation are now co-located in the planner — the reconciler just wires the client through and emits condition- transition events. 2. MarkReadyReapplied event dropped. SidecarReadinessLost already signals the same condition transition, and the plan itself is visible in status.plan. Also moved `sei.controller.seinode.sidecar_health_probes` counter from the node-controller metrics package to the planner package so it sits alongside the probe that emits it. Probe unit tests moved to internal/planner/sidecar_probe_test.go. Reconciler tests now exercise the phase-gating end-to-end through Reconcile rather than calling a (now-removed) reconciler probe method. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: NodeResolver holds sidecar factory; diff-at-end replaces statusDirty Iteration 3 on #129. Addresses review feedback on where sidecar concerns should live and simplifies the reconciler's status-patching. Changes: 1. planner.NodeResolver struct holds BuildSidecarClient at construction. ResolvePlan is now a method on it. Reconciler becomes sidecar-agnostic — no method-signature pollution with task.SidecarClient, no client construction in the reconcile loop. 2. Reconciler replaces per-branch `statusDirty = true` flags with a single diff-at-end: if !equality.Semantic.DeepEqual(before.Status, node.Status) { r.Status().Patch(ctx, node, statusBase) } MergeFromWithOptimisticLock preserved to avoid lost-update races with executor-side patches. Any future drift check that mutates status is automatically covered — no flag-tracking contract. 3. Adds seinode_sidecar_ready observable gauge (0|1 per SeiNode) so PrometheusRule alerts can fire without kube-state-metrics needing CustomResourceStateMetrics configured for our CRD. Events (SidecarReadinessLost, SidecarReadinessRestored) stay in the reconciler and fire based on a prev/current condition snapshot around the ResolvePlan call. MarkReadyReapplied event remains dropped (condition + plan in status.plan already carry the same info). Both the kubernetes-specialist and platform-engineer validated this shape before implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: trim breadcrumb comments in production code paths Per review feedback: prefer self-explanatory code in production paths; keep comments only for non-obvious WHYs (invariants, subtle ordering). Tests left untouched. - planner.go: drop narrative docstrings on NodeResolver, ResolvePlan, buildRunningPlan, sidecarNeedsReapproval, buildMarkReadyPlan. The "skip probe during Initializing" and "image drift wins" comments stay because they encode non-obvious invariants. - sidecar_probe.go: drop the "Three-way signal" breadcrumb that described what the switch below already shows. - metrics.go: drop field docstrings on sidecarHealthProbes and the sidecarReadyTracker. - controller.go: drop narrative comments around reconcilePeers, ResolvePlan, executor branches, and diff-at-end. Keep the "requeue immediately so the plan is persisted before execution" note — that's an ordering invariant that isn't obvious from the code. - seinode_types.go: trim ConditionSidecarReady docstring to one line, matching the style of the neighbor constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address lint errors (ginkgolinter, gofmt, modernize) - node_update_test.go: Expect(plan.Tasks).To(HaveLen(4)) instead of Expect(len(plan.Tasks)).To(Equal(4)) — ginkgolinter prefers HaveLen. - sidecar_probe_test.go: gofmt import ordering (google/uuid before onsi/gomega). - sidecar_probe_integration_test.go: `for range 5` instead of the classic C-style loop — modernize linter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(metrics): struct key instead of "namespace/name" string concat The map key for the sidecar-ready tracker was a concatenated "namespace/name" string that we split back apart in the Observe callback. Struct key does the same thing without the dance. Namespace and name remain attributes on the emitted metric. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address lint unparam + user review feedback - peers.go: drop unused (bool, error) return from reconcilePeers — the dirty signal is no longer needed since diff-at-end handles status detection. Kept the conditional write (only reassign if changed) so a nil → [] replacement doesn't create spurious diffs. - sidecar_probe_test.go: rename findCond → findSidecarReady and drop the unparam type parameter. - peers.go: remove breadcrumb docstring per review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Design-only PR. Addresses #127 — when a seid pod is replaced (chaos, drain, OOM, reschedule) the new sidecar returns 503 from
/v0/healthz, blocking seid startup forever until an operator manually POSTs mark-ready via port-forward.The design extends the existing
buildRunningPlandrift detector ininternal/planner/planner.gowith a second form of drift (sidecar readiness), emitting a one-task MarkReady plan whenHealthz() == 503. This is idiomatic to the image-drift pattern already implemented on the same code path.Why this shape
buildRunningPlanis the canonical steady-state-drift extension point and its own doc-string says so.ResolvePlan); planner reads a newSidecarReadycondition. Same shape asreconcilePeers/ObserveImage.CurrentImage.FailedPhaseon the plan — identical tobuildNodeUpdatePlan.ResolvePlansemantics, no new machinery.Options considered
Option A (mark-ready is its own plan) chosen. Option B (inline the HTTP POST in the reconciler, bypass the plan system) rejected because it breaks the single-writer invariant from
inplace-update-strategy.mdand skips plan-level observability. Option C (hybrid probe+plan) collapses to Option A in this codebase —buildRunningPlanis already the inline probe hook.Rollout posture
Steps 1-3 (condition, interface widening, reconciler probe with metric) are shippable as status-only observation; step 4 enables auto-remediation. Splitting is optional — full fix is small.
Not in this PR
SidecarReadinessLostalert (goes in platform repo substrate overlay, referenced in the design).Test plan
Closes #127 once implementation lands.
🤖 Generated with Claude Code