Validate TWD spec via CRD CEL rules at apply time (fixes #62)#293
Validate TWD spec via CRD CEL rules at apply time (fixes #62)#293
Conversation
Move TemporalWorkerDeployment spec validation from reconcile time to apply time by embedding CEL validation rules in the CRD schema. Previously validation only fired when the optional TWD webhook was enabled; now the API server enforces it universally on every create/update. Rules added to worker_types.go (regenerated in the CRD manifest): - metadata.name length <= 63 (root object) - Progressive strategy requires non-empty steps (RolloutStrategy) - rampPercentage strictly increasing across steps (RolloutStrategy) - gate: input and inputFrom are mutually exclusive (RolloutStrategy) - gate.inputFrom: exactly one of configMapKeyRef/secretKeyRef (RolloutStrategy) - pauseDuration >= 30s per step (RolloutStep) The ValidateCreate() call and 5-minute requeue in the reconciler are removed — the CRD schema now guarantees the spec is valid before the reconciler sees it. The webhook Go code is kept for defence-in-depth when webhook.enabled=true. Closes #62 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues with the initial CEL rules surfaced in k8s 1.27 envtest: 1. gate.input is x-kubernetes-preserve-unknown-fields: true — CEL cannot reference it structurally, so has(self.gate.input) failed to compile. Removed the gate input/inputFrom mutual-exclusivity rule; the webhook Go code still enforces it when enabled. 2. Any list traversal on self.steps at the RolloutStrategy level (map, isSorted) hits the schema-level CEL cost budget in k8s 1.27 because the cost estimator does not propagate maxItems from the steps array schema up to the parent object's rule. Removed the rampPercentage ordering rule from the CRD; the webhook Go code still enforces it when enabled. Remaining CRD rules (all O(1)): name length, Progressive requires steps, pauseDuration >= 30s per step, gate.inputFrom exclusivity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| // +kubebuilder:validation:XValidation:rule="!has(self.gate) || !(has(self.gate.input) && has(self.gate.inputFrom))",message="only one of input or inputFrom may be set" | ||
| // +kubebuilder:validation:XValidation:rule="!has(self.gate) || !has(self.gate.inputFrom) || (has(self.gate.inputFrom.configMapKeyRef) != has(self.gate.inputFrom.secretKeyRef))",message="exactly one of configMapKeyRef or secretKeyRef must be set" |
There was a problem hiding this comment.
would it make more sense to put these validation expressions on GateWorkflowConfig itself instead of RolloutStrategy?
There was a problem hiding this comment.
I actually ended up removing these and validating them elsewhere because:
Two issues with the initial CEL rules surfaced in k8s 1.27 envtest:
gate.input is x-kubernetes-preserve-unknown-fields: true — CEL cannot reference it structurally, so has(self.gate.input) failed to compile. Removed the gate input/inputFrom mutual-exclusivity rule; the webhook Go code still enforces it when enabled.
Any list traversal on self.steps at the RolloutStrategy level (map, isSorted) hits the schema-level CEL cost budget in k8s 1.27 because the cost estimator does not propagate maxItems from the steps array schema up to the parent object's rule. Removed the rampPercentage ordering rule from the CRD; the webhook Go code still enforces it when enabled.
Remaining CRD rules (all O(1)): name length, Progressive requires steps, pauseDuration >= 30s per step, gate.inputFrom exclusivity.
The CRD CEL rules cannot express rampPercentage ordering or gate input/inputFrom exclusivity (list traversal hits the schema cost budget; gate.input is x-kubernetes-preserve-unknown-fields). These checks live only in the webhook Go code, which is optional. Without a fallback, users who deploy without the webhook and submit an invalid spec (e.g. decreasing rampPercentage) would silently get no rollout progress and no signal — a dangerous failure mode since a decreasing ramp could cause auto-upgrade workflows to roll back. Re-add ValidateCreate() in the reconciler as a fallback: - On failure, emit a Warning event and set ConditionProgressing=False / ConditionReady=False with reason InvalidSpec. - Return early with a 5-minute requeue so the invalid spec is never acted on (no plan generated, no Temporal API calls). - Add ReasonInvalidSpec constant to worker_types.go. - Add TestReconcile_InvalidSpec_EmitsEventAndSetsCondition to cover the new path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returning ctrl.Result{Requeue: true, RequeueAfter: 5m} is unnecessary:
the controller watches TWD objects, so any spec correction the user
applies triggers an immediate reconciliation automatically. Remove the
timer and return ctrl.Result{}, nil instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The CRD schema enforces name length, Progressive-requires-steps, pauseDuration >= 30s, and gate.inputFrom exclusivity at admission time. Keeping them in the webhook Go code was dead redundancy. validateRolloutStrategy now only checks the two constraints the CRD cannot enforce (gate.input is x-kubernetes-preserve-unknown-fields, rampPercentage ordering requires list traversal that exceeds the k8s 1.27 CEL cost budget): - rampPercentage strictly increasing across steps - gate.input and gate.inputFrom mutually exclusive Removed the corresponding test cases from the webhook unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests hit a real kube-apiserver via envtest (the existing webhook suite infrastructure) so they verify that the x-kubernetes-validations blocks in the generated CRD are syntactically correct and semantically enforced by the API server — independently of the webhook Go code. Covers all four CRD-level rules: - name length > 63 → rejected - Progressive strategy with no steps → rejected - Progressive step with pauseDuration < 30s → rejected - gate.inputFrom with both configMapKeyRef and secretKeyRef → rejected - valid TWD → accepted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| "sigs.k8s.io/controller-runtime/pkg/webhook/admission" | ||
| ) | ||
|
|
||
| const ( |
There was a problem hiding this comment.
not needed anymore since this is handled through CRD CEL rules
| "valid temporal worker deployment": { | ||
| obj: testhelpers.MakeTWDWithName("valid-worker", ""), | ||
| }, | ||
| "temporal worker deployment with name too long": { |
There was a problem hiding this comment.
covered by api/v1alpha1/temporalworkerdeployment_cel_validation_test.go (same with other removed webhook tests)
…traint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fixes #62 — validates
TemporalWorkerDeploymentspec at apply time instead of reconcile time.Approach
Embed CEL validation rules directly in the CRD schema (
x-kubernetes-validations) so the API server enforces them universally at admission time — no webhook infrastructure required. The optional TWD webhook remains to provide apply-time feedback to those who enable it.CRD CEL rules added (
worker_types.go→ regenerated CRD manifest)metadata.name≤ 63 charsTemporalWorkerDeploymentrootRolloutStrategyrampPercentagesteps capped at 20stepsfield (maxItems)pauseDuration≥ 30s per stepRolloutStepgate.inputFrom: exactly one ofconfigMapKeyRef/secretKeyRefRolloutStrategyTwo constraints from the original webhook could not be expressed as CRD CEL rules in k8s 1.27 (cost budget limit for list traversal;
gate.inputisx-kubernetes-preserve-unknown-fieldsand invisible to CEL):rampPercentagestrictly increasing across stepsgate.inputandgate.inputFrommutually exclusiveFallback for webhook-only constraints
The reconciler calls
ValidateCreate()as a fallback for the two constraints the CRD cannot enforce. On failure it emits aWarningevent and setsConditionProgressing=False/ConditionReady=Falsewith reasonInvalidSpec, then returns without requeueing (the watch re-triggers when the user corrects the spec). AddsReasonInvalidSpecconstant.Webhook cleanup
validateRolloutStrategyin the webhook now only checks the two constraints not in the CRD. All checks that duplicated CRD rules were removed.Testing
TestReconcile_InvalidSpec_EmitsEventAndSetsConditionverifies the event/condition pathtemporalworkerdeployment_cel_validation_test.go): hit a real kube-apiserver to verify each CRD CEL rule is syntactically valid and semantically correct, independently of the webhookTest plan
KUBEBUILDER_ASSETS=... go test ./...passes locallystrategy: Progressiveand no steps — API server rejects immediatelystrategy: Progressiveand > 20 steps — API server rejects immediatelypauseDuration < 30s— API server rejects immediatelyrampPercentage(webhook disabled) — reconciler setsInvalidSpeccondition and emits Warning event, does not act on the spec🤖 Generated with Claude Code