Skip to content

Validate TWD spec via CRD CEL rules at apply time (fixes #62)#293

Open
carlydf wants to merge 8 commits intomainfrom
cdf/crd-cel-validation-issue-62
Open

Validate TWD spec via CRD CEL rules at apply time (fixes #62)#293
carlydf wants to merge 8 commits intomainfrom
cdf/crd-cel-validation-issue-62

Conversation

@carlydf
Copy link
Copy Markdown
Collaborator

@carlydf carlydf commented Apr 23, 2026

Summary

Fixes #62 — validates TemporalWorkerDeployment spec at apply time instead of reconcile time.

Approach

Embed CEL validation rules directly in the CRD schema (x-kubernetes-validations) so the API server enforces them universally at admission time — no webhook infrastructure required. The optional TWD webhook remains to provide apply-time feedback to those who enable it.

CRD CEL rules added (worker_types.go → regenerated CRD manifest)

Rule Placed on Enforced by
metadata.name ≤ 63 chars TemporalWorkerDeployment root CRD
Progressive strategy requires steps RolloutStrategy CRD
rampPercentage steps capped at 20 steps field (maxItems) CRD
pauseDuration ≥ 30s per step RolloutStep CRD
gate.inputFrom: exactly one of configMapKeyRef/secretKeyRef RolloutStrategy CRD

Two constraints from the original webhook could not be expressed as CRD CEL rules in k8s 1.27 (cost budget limit for list traversal; gate.input is x-kubernetes-preserve-unknown-fields and invisible to CEL):

  • rampPercentage strictly increasing across steps
  • gate.input and gate.inputFrom mutually exclusive

Fallback for webhook-only constraints

The reconciler calls ValidateCreate() as a fallback for the two constraints the CRD cannot enforce. On failure it emits a Warning event and sets ConditionProgressing=False / ConditionReady=False with reason InvalidSpec, then returns without requeueing (the watch re-triggers when the user corrects the spec). Adds ReasonInvalidSpec constant.

Webhook cleanup

validateRolloutStrategy in the webhook now only checks the two constraints not in the CRD. All checks that duplicated CRD rules were removed.

Testing

  • Unit tests: updated to match narrowed webhook scope (removed cases now enforced by CRD)
  • Reconciler unit test: TestReconcile_InvalidSpec_EmitsEventAndSetsCondition verifies the event/condition path
  • envtest integration tests (temporalworkerdeployment_cel_validation_test.go): hit a real kube-apiserver to verify each CRD CEL rule is syntactically valid and semantically correct, independently of the webhook

Test plan

  • KUBEBUILDER_ASSETS=... go test ./... passes locally
  • Apply a TWD with name > 63 chars — API server rejects immediately
  • Apply a TWD with strategy: Progressive and no steps — API server rejects immediately
  • Apply a TWD with strategy: Progressive and > 20 steps — API server rejects immediately
  • Apply a TWD with a step pauseDuration < 30s — API server rejects immediately
  • Apply a TWD with decreasing rampPercentage (webhook disabled) — reconciler sets InvalidSpec condition and emits Warning event, does not act on the spec

🤖 Generated with Claude Code

Move TemporalWorkerDeployment spec validation from reconcile time to
apply time by embedding CEL validation rules in the CRD schema.
Previously validation only fired when the optional TWD webhook was
enabled; now the API server enforces it universally on every create/update.

Rules added to worker_types.go (regenerated in the CRD manifest):
- metadata.name length <= 63 (root object)
- Progressive strategy requires non-empty steps (RolloutStrategy)
- rampPercentage strictly increasing across steps (RolloutStrategy)
- gate: input and inputFrom are mutually exclusive (RolloutStrategy)
- gate.inputFrom: exactly one of configMapKeyRef/secretKeyRef (RolloutStrategy)
- pauseDuration >= 30s per step (RolloutStep)

The ValidateCreate() call and 5-minute requeue in the reconciler are
removed — the CRD schema now guarantees the spec is valid before the
reconciler sees it. The webhook Go code is kept for defence-in-depth
when webhook.enabled=true.

Closes #62

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@carlydf carlydf requested review from a team and jlegrone as code owners April 23, 2026 17:28
Two issues with the initial CEL rules surfaced in k8s 1.27 envtest:

1. gate.input is x-kubernetes-preserve-unknown-fields: true — CEL
   cannot reference it structurally, so has(self.gate.input) failed
   to compile. Removed the gate input/inputFrom mutual-exclusivity
   rule; the webhook Go code still enforces it when enabled.

2. Any list traversal on self.steps at the RolloutStrategy level
   (map, isSorted) hits the schema-level CEL cost budget in k8s 1.27
   because the cost estimator does not propagate maxItems from the
   steps array schema up to the parent object's rule. Removed the
   rampPercentage ordering rule from the CRD; the webhook Go code
   still enforces it when enabled.

Remaining CRD rules (all O(1)): name length, Progressive requires
steps, pauseDuration >= 30s per step, gate.inputFrom exclusivity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread api/v1alpha1/worker_types.go Outdated
Comment on lines +384 to +385
// +kubebuilder:validation:XValidation:rule="!has(self.gate) || !(has(self.gate.input) && has(self.gate.inputFrom))",message="only one of input or inputFrom may be set"
// +kubebuilder:validation:XValidation:rule="!has(self.gate) || !has(self.gate.inputFrom) || (has(self.gate.inputFrom.configMapKeyRef) != has(self.gate.inputFrom.secretKeyRef))",message="exactly one of configMapKeyRef or secretKeyRef must be set"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to put these validation expressions on GateWorkflowConfig itself instead of RolloutStrategy?

Copy link
Copy Markdown
Collaborator Author

@carlydf carlydf Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ended up removing these and validating them elsewhere because:

Two issues with the initial CEL rules surfaced in k8s 1.27 envtest:

  1. gate.input is x-kubernetes-preserve-unknown-fields: true — CEL cannot reference it structurally, so has(self.gate.input) failed to compile. Removed the gate input/inputFrom mutual-exclusivity rule; the webhook Go code still enforces it when enabled.

  2. Any list traversal on self.steps at the RolloutStrategy level (map, isSorted) hits the schema-level CEL cost budget in k8s 1.27 because the cost estimator does not propagate maxItems from the steps array schema up to the parent object's rule. Removed the rampPercentage ordering rule from the CRD; the webhook Go code still enforces it when enabled.

Remaining CRD rules (all O(1)): name length, Progressive requires steps, pauseDuration >= 30s per step, gate.inputFrom exclusivity.

carlydf and others added 2 commits April 23, 2026 10:52
The CRD CEL rules cannot express rampPercentage ordering or gate
input/inputFrom exclusivity (list traversal hits the schema cost
budget; gate.input is x-kubernetes-preserve-unknown-fields). These
checks live only in the webhook Go code, which is optional.

Without a fallback, users who deploy without the webhook and submit
an invalid spec (e.g. decreasing rampPercentage) would silently get
no rollout progress and no signal — a dangerous failure mode since
a decreasing ramp could cause auto-upgrade workflows to roll back.

Re-add ValidateCreate() in the reconciler as a fallback:
- On failure, emit a Warning event and set ConditionProgressing=False /
  ConditionReady=False with reason InvalidSpec.
- Return early with a 5-minute requeue so the invalid spec is never
  acted on (no plan generated, no Temporal API calls).
- Add ReasonInvalidSpec constant to worker_types.go.
- Add TestReconcile_InvalidSpec_EmitsEventAndSetsCondition to cover
  the new path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returning ctrl.Result{Requeue: true, RequeueAfter: 5m} is unnecessary:
the controller watches TWD objects, so any spec correction the user
applies triggers an immediate reconciliation automatically. Remove the
timer and return ctrl.Result{}, nil instead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@carlydf carlydf marked this pull request as draft April 23, 2026 18:02
carlydf and others added 2 commits April 23, 2026 11:10
The CRD schema enforces name length, Progressive-requires-steps,
pauseDuration >= 30s, and gate.inputFrom exclusivity at admission
time. Keeping them in the webhook Go code was dead redundancy.

validateRolloutStrategy now only checks the two constraints the CRD
cannot enforce (gate.input is x-kubernetes-preserve-unknown-fields,
rampPercentage ordering requires list traversal that exceeds the k8s
1.27 CEL cost budget):
  - rampPercentage strictly increasing across steps
  - gate.input and gate.inputFrom mutually exclusive

Removed the corresponding test cases from the webhook unit tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests hit a real kube-apiserver via envtest (the existing webhook suite
infrastructure) so they verify that the x-kubernetes-validations blocks
in the generated CRD are syntactically correct and semantically enforced
by the API server — independently of the webhook Go code.

Covers all four CRD-level rules:
  - name length > 63 → rejected
  - Progressive strategy with no steps → rejected
  - Progressive step with pauseDuration < 30s → rejected
  - gate.inputFrom with both configMapKeyRef and secretKeyRef → rejected
  - valid TWD → accepted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
)

const (
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore since this is handled through CRD CEL rules

"valid temporal worker deployment": {
obj: testhelpers.MakeTWDWithName("valid-worker", ""),
},
"temporal worker deployment with name too long": {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

covered by api/v1alpha1/temporalworkerdeployment_cel_validation_test.go (same with other removed webhook tests)

carlydf and others added 2 commits April 23, 2026 11:27
…traint

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@carlydf carlydf marked this pull request as ready for review April 23, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate TemporalWorkerDeployment spec at apply time instead of reconcile time

2 participants