Skip to content

design(#127): re-apply sidecar mark-ready on pod replacement#128

Merged
bdchatham merged 1 commit intomainfrom
fix/127-sidecar-mark-ready-re-apply
Apr 23, 2026
Merged

design(#127): re-apply sidecar mark-ready on pod replacement#128
bdchatham merged 1 commit intomainfrom
fix/127-sidecar-mark-ready-re-apply

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Design-only PR. Addresses #127 — when a seid pod is replaced (chaos, drain, OOM, reschedule) the new sidecar returns 503 from /v0/healthz, blocking seid startup forever until an operator manually POSTs mark-ready via port-forward.

The design extends the existing buildRunningPlan drift detector in internal/planner/planner.go with a second form of drift (sidecar readiness), emitting a one-task MarkReady plan when Healthz() == 503. This is idiomatic to the image-drift pattern already implemented on the same code path.

Why this shape

  • buildRunningPlan is the canonical steady-state-drift extension point and its own doc-string says so.
  • Probe lives in the reconciler (before ResolvePlan); planner reads a new SidecarReady condition. Same shape as reconcilePeers / ObserveImage.CurrentImage.
  • One-task plan preserves the plan-based observability contract (metrics, conditions, events, failed-task tracking) without any new reconcile loop.
  • Failure retries automatically via empty FailedPhase on the plan — identical to buildNodeUpdatePlan.
  • Conflict policy: image drift wins (it already ends with MarkReady). If an image-update plan is running, new mark-ready plans are deferred until the current plan terminates — existing ResolvePlan semantics, no new machinery.

Options considered

Option A (mark-ready is its own plan) chosen. Option B (inline the HTTP POST in the reconciler, bypass the plan system) rejected because it breaks the single-writer invariant from inplace-update-strategy.md and skips plan-level observability. Option C (hybrid probe+plan) collapses to Option A in this codebase — buildRunningPlan is already the inline probe hook.

Rollout posture

Steps 1-3 (condition, interface widening, reconciler probe with metric) are shippable as status-only observation; step 4 enables auto-remediation. Splitting is optional — full fix is small.

Not in this PR

  • Implementation (follow-up).
  • Platform-side PrometheusRule for SidecarReadinessLost alert (goes in platform repo substrate overlay, referenced in the design).
  • Sidecar self-detection of existing chain data (future sidecar release, orthogonal).

Test plan

  • Design doc reviewed for operational ergonomics by platform-engineer
  • Design doc reviewed by controller maintainers for pattern fit
  • Implementation PR follows (unit tests + integration round-trip test + chaos re-run on release-6-5)

Closes #127 once implementation lands.

🤖 Generated with Claude Code

Extend buildRunningPlan with a second drift detector for sidecar
readiness. When the sidecar returns 503 on an otherwise-Running node
(after pod replacement from chaos, drain, OOM, reschedule), spawn a
one-task MarkReady plan using the existing plan machinery. Probe lives
in the reconciler; planner reads a SidecarReady condition. Idiomatic
to the image-drift path already implemented in buildRunningPlan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit f9bc594 into main Apr 23, 2026
2 checks passed
bdchatham added a commit that referenced this pull request Apr 23, 2026
Addresses specialist review on #129:
1. Required: emit MarkReadyReapplied event when the mark-ready plan is
   spawned (design #128 §"New event"). Gated by a small helper that
   classifies single-task MarkReady plans.
2. Recommended: add a reconciler-level test asserting the probe is
   skipped when Phase=Initializing.
3. Integration test now also asserts the MarkReadyReapplied event is
   emitted during the pod-replacement round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request Apr 23, 2026
* feat: sidecar mark-ready re-apply on pod replacement

Closes #127.
Implements design #128.

Reconciler probes the sidecar's Healthz on each Running-phase reconcile
and stamps a SidecarReady condition. The planner extends buildRunningPlan
with a second drift branch: when SidecarReady is False/NotReady, it
returns a one-task MarkReady plan. Image drift still wins conflict because
its plan already ends with MarkReady.

- api/v1alpha1: new ConditionSidecarReady type
- internal/task: Healthz added to SidecarClient interface
- internal/controller/node: probeSidecarHealth method + sidecar_health_probes
  counter + transition events (SidecarReadinessLost/Restored)
- internal/planner: buildMarkReadyPlan, sidecarNeedsReapproval,
  buildRunningPlan now returns a MarkReady plan when drift is observed
- cmd: wire BuildSidecarClient field on SeiNodeReconciler

Tests cover all probe outcomes, planner branches (including image-wins),
and a full pod-replacement round-trip integration test. All
pre-existing tests pass unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: emit MarkReadyReapplied event + add Initializing-skipped test

Addresses specialist review on #129:
1. Required: emit MarkReadyReapplied event when the mark-ready plan is
   spawned (design #128 §"New event"). Gated by a small helper that
   classifies single-task MarkReady plans.
2. Recommended: add a reconciler-level test asserting the probe is
   skipped when Phase=Initializing.
3. Integration test now also asserts the MarkReadyReapplied event is
   emitted during the pod-replacement round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: move sidecar probe into planner; drop MarkReadyReapplied event

Addresses review feedback on #129:

1. Probe lives in the planner now, invoked from ResolvePlan. Callers
   pass a task.SidecarClient; nil skips the probe (tests, Initializing).
   Drift detection + mitigation are now co-located in the planner — the
   reconciler just wires the client through and emits condition-
   transition events.

2. MarkReadyReapplied event dropped. SidecarReadinessLost already signals
   the same condition transition, and the plan itself is visible in
   status.plan.

Also moved `sei.controller.seinode.sidecar_health_probes` counter from
the node-controller metrics package to the planner package so it sits
alongside the probe that emits it.

Probe unit tests moved to internal/planner/sidecar_probe_test.go.
Reconciler tests now exercise the phase-gating end-to-end through
Reconcile rather than calling a (now-removed) reconciler probe method.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: NodeResolver holds sidecar factory; diff-at-end replaces statusDirty

Iteration 3 on #129. Addresses review feedback on where sidecar
concerns should live and simplifies the reconciler's status-patching.

Changes:

1. planner.NodeResolver struct holds BuildSidecarClient at construction.
   ResolvePlan is now a method on it. Reconciler becomes sidecar-agnostic
   — no method-signature pollution with task.SidecarClient, no client
   construction in the reconcile loop.

2. Reconciler replaces per-branch `statusDirty = true` flags with a
   single diff-at-end:
     if !equality.Semantic.DeepEqual(before.Status, node.Status) {
         r.Status().Patch(ctx, node, statusBase)
     }
   MergeFromWithOptimisticLock preserved to avoid lost-update races with
   executor-side patches. Any future drift check that mutates status is
   automatically covered — no flag-tracking contract.

3. Adds seinode_sidecar_ready observable gauge (0|1 per SeiNode) so
   PrometheusRule alerts can fire without kube-state-metrics needing
   CustomResourceStateMetrics configured for our CRD.

Events (SidecarReadinessLost, SidecarReadinessRestored) stay in the
reconciler and fire based on a prev/current condition snapshot around
the ResolvePlan call. MarkReadyReapplied event remains dropped
(condition + plan in status.plan already carry the same info).

Both the kubernetes-specialist and platform-engineer validated this
shape before implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: trim breadcrumb comments in production code paths

Per review feedback: prefer self-explanatory code in production paths;
keep comments only for non-obvious WHYs (invariants, subtle ordering).
Tests left untouched.

- planner.go: drop narrative docstrings on NodeResolver, ResolvePlan,
  buildRunningPlan, sidecarNeedsReapproval, buildMarkReadyPlan. The
  "skip probe during Initializing" and "image drift wins" comments
  stay because they encode non-obvious invariants.
- sidecar_probe.go: drop the "Three-way signal" breadcrumb that
  described what the switch below already shows.
- metrics.go: drop field docstrings on sidecarHealthProbes and the
  sidecarReadyTracker.
- controller.go: drop narrative comments around reconcilePeers,
  ResolvePlan, executor branches, and diff-at-end. Keep the
  "requeue immediately so the plan is persisted before execution"
  note — that's an ordering invariant that isn't obvious from the code.
- seinode_types.go: trim ConditionSidecarReady docstring to one line,
  matching the style of the neighbor constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address lint errors (ginkgolinter, gofmt, modernize)

- node_update_test.go: Expect(plan.Tasks).To(HaveLen(4)) instead of
  Expect(len(plan.Tasks)).To(Equal(4)) — ginkgolinter prefers HaveLen.
- sidecar_probe_test.go: gofmt import ordering (google/uuid before
  onsi/gomega).
- sidecar_probe_integration_test.go: `for range 5` instead of the
  classic C-style loop — modernize linter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(metrics): struct key instead of "namespace/name" string concat

The map key for the sidecar-ready tracker was a concatenated
"namespace/name" string that we split back apart in the Observe
callback. Struct key does the same thing without the dance. Namespace
and name remain attributes on the emitted metric.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address lint unparam + user review feedback

- peers.go: drop unused (bool, error) return from reconcilePeers — the
  dirty signal is no longer needed since diff-at-end handles status
  detection. Kept the conditional write (only reassign if changed) so
  a nil → [] replacement doesn't create spurious diffs.
- sidecar_probe_test.go: rename findCond → findSidecarReady and drop
  the unparam type parameter.
- peers.go: remove breadcrumb docstring per review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sei-sidecar mark-ready is not re-applied on pod replacement — stuck validators require manual intervention

1 participant