OCPBUGS-84321: ClusterAPIMachineManagement: capi-controllers: excessive restarts of during cluster installation#614
Conversation
…during cluster installation Move capi-controllers Deployment from CVO to operator The capi-controllers Deployment was deployed by CVO as a static manifest, starting simultaneously with the installer. Since the installer needs time to install provider CRDs, capi-controllers would crash repeatedly until they became available. Move the Deployment to the operator's manifests-gen pipeline at installOrder 30, where probe-gated sequencing ensures all CRDs are Established before the Deployment is created. Add a tombstone so CVO deletes its old copy on upgrade. Assisted-by: Claude (Opus 4.6)
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-84321, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
WalkthroughThis PR updates Changescapi-controllers manifest migration
Estimated code review effort: 2 (Simple) | ~15 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
capi-operator-manifests/default/manifests.yaml (1)
1032-1032: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick winConsider setting
automountServiceAccountToken: falseif the API token isn't needed by this pod, or confirm it is required.
serviceAccountName: capi-controllersis set butautomountServiceAccountTokenisn't specified, defaulting to token mounting.As per path instructions, "
**/*.{yaml,yml}: ... automountServiceAccountToken: false unless needed".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@capi-operator-manifests/default/manifests.yaml` at line 1032, The pod spec using serviceAccountName capi-controllers is implicitly mounting a service account token, which should be disabled unless this workload needs API access. Update the manifest where the pod spec is defined to set automountServiceAccountToken to false if the token is not required, or otherwise confirm and keep the default behavior; use the serviceAccountName setting to locate the relevant pod template.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@capi-operator-manifests/default/manifests.yaml`:
- Around line 990-993: The container resource configuration only sets
resources.requests and is missing matching resources.limits for cpu and memory.
Update the container specs in manifests.yaml, including both affected resource
blocks, to add explicit limits alongside the existing requests so each container
has bounded resource usage.
- Around line 970-1027: Add livenessProbe and readinessProbe to both containers
in the manifest: the capi-controllers and machine-api-migration container specs
already expose dedicated healthz ports (healthz-o on 9440 and healthz-m on
9441), so wire the probes to those endpoints using the existing container names
and ports. Ensure each probe is defined alongside the current ports/args for the
corresponding container and points to the appropriate health check path on the
matching healthz port.
- Around line 970-1027: Both container specs in the manifest lack a
securityContext, so they still run with root-capable defaults; add a
securityContext to the capi-controllers and machine-api-migration container
definitions. Set runAsNonRoot and readOnlyRootFilesystem, disable
allowPrivilegeEscalation, and drop ALL capabilities (only add back anything
explicitly required). Use the existing container blocks with names
capi-controllers and machine-api-migration to place the same hardened settings
on both.
- Around line 948-1049: The capi-controllers Deployment manifest is missing
required hardening settings, so update the source manifest for the
capi-controllers Deployment to include a pod/container securityContext, add
resource limits alongside the existing requests, and define liveness and
readiness probes for both containers where appropriate. Make sure these
additions are placed in the same Deployment spec that defines the
capi-controllers and machine-api-migration containers so the rendered bundle
stays in sync.
---
Nitpick comments:
In `@capi-operator-manifests/default/manifests.yaml`:
- Line 1032: The pod spec using serviceAccountName capi-controllers is
implicitly mounting a service account token, which should be disabled unless
this workload needs API access. Update the manifest where the pod spec is
defined to set automountServiceAccountToken to false if the token is not
required, or otherwise confirm and keep the default behavior; use the
serviceAccountName setting to locate the relevant pod template.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c6e49887-bf0f-40ca-be91-6bde2079827a
📒 Files selected for processing (6)
Makefileadmission-policies/default/capi-controllers-deployment.yamladmission-policies/default/kustomization.yamlcapi-operator-manifests/default/manifests.yamlcapi-operator-manifests/default/metadata.yamlmanifests/0000_30_cluster-api_00_tombstones-5.0.yaml
💤 Files with no reviewable changes (1)
- admission-policies/default/capi-controllers-deployment.yaml
| containers: | ||
| - args: | ||
| - --diagnostics-address=:8443 | ||
| command: | ||
| - /capi-controllers | ||
| env: | ||
| - name: RELEASE_VERSION | ||
| value: 0.0.1-snapshot | ||
| image: registry.ci.openshift.org/openshift:cluster-capi-operator | ||
| name: capi-controllers | ||
| ports: | ||
| - containerPort: 9443 | ||
| name: webhook-server | ||
| protocol: TCP | ||
| - containerPort: 8443 | ||
| name: diagnostics-o | ||
| protocol: TCP | ||
| - containerPort: 9440 | ||
| name: healthz-o | ||
| protocol: TCP | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 50Mi | ||
| terminationMessagePolicy: FallbackToLogsOnError | ||
| volumeMounts: | ||
| - mountPath: /tmp/k8s-webhook-server/serving-certs | ||
| name: cert | ||
| readOnly: true | ||
| - mountPath: /tmp/k8s-metrics-server/serving-certs | ||
| name: metrics-cert | ||
| readOnly: true | ||
| - args: | ||
| - --diagnostics-address=:8442 | ||
| - --health-addr=:9441 | ||
| command: | ||
| - /machine-api-migration | ||
| env: | ||
| - name: RELEASE_VERSION | ||
| value: 0.0.1-snapshot | ||
| image: registry.ci.openshift.org/openshift:cluster-capi-operator | ||
| name: machine-api-migration | ||
| ports: | ||
| - containerPort: 8442 | ||
| name: diagnostics-m | ||
| protocol: TCP | ||
| - containerPort: 9441 | ||
| name: healthz-m | ||
| protocol: TCP | ||
| resources: | ||
| requests: | ||
| cpu: 10m | ||
| memory: 50Mi | ||
| terminationMessagePolicy: FallbackToLogsOnError | ||
| volumeMounts: | ||
| - mountPath: /tmp/k8s-metrics-server/serving-certs | ||
| name: metrics-cert | ||
| readOnly: true |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Add liveness and readiness probes.
Neither container defines livenessProbe/readinessProbe, despite both exposing dedicated healthz-* ports (9440, 9441) presumably intended for this purpose.
As per path instructions, "**/*.{yaml,yml}: ... Liveness + readiness probes defined".
🧰 Tools
🪛 Trivy (0.69.3)
[error] 978-1003: Root file system is not read-only
Container 'machine-api-migration' of Deployment 'capi-controllers' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 978-1003: Default security context configured
container capi-controllers in openshift-cluster-api namespace is using the default security context
Rule: KSV-0118
(IaC/Kubernetes)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@capi-operator-manifests/default/manifests.yaml` around lines 970 - 1027, Add
livenessProbe and readinessProbe to both containers in the manifest: the
capi-controllers and machine-api-migration container specs already expose
dedicated healthz ports (healthz-o on 9440 and healthz-m on 9441), so wire the
probes to those endpoints using the existing container names and ports. Ensure
each probe is defined alongside the current ports/args for the corresponding
container and points to the appropriate health check path on the matching
healthz port.
Source: Path instructions
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Add securityContext (runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation, dropped capabilities) to both containers.
Neither capi-controllers nor machine-api-migration containers define a securityContext, so they run with the platform default (root-capable, writable root filesystem, privilege escalation allowed). Static analysis flags this (Checkov CKV_K8S_20/CKV_K8S_23, Trivy KSV-0014/KSV-0118).
🔒 Proposed securityContext additions
containers:
- args:
- --diagnostics-address=:8443
command:
- /capi-controllers
+ securityContext:
+ allowPrivilegeEscalation: false
+ readOnlyRootFilesystem: true
+ runAsNonRoot: true
+ capabilities:
+ drop:
+ - ALL
env:Apply the equivalent block to the machine-api-migration container as well.
As per path instructions, "**/*.{yaml,yml}: If this is a Kubernetes/OpenShift manifest or Helm template: securityContext: runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false; Drop ALL capabilities, add only what is required".
🧰 Tools
🪛 Trivy (0.69.3)
[error] 978-1003: Root file system is not read-only
Container 'machine-api-migration' of Deployment 'capi-controllers' should set 'securityContext.readOnlyRootFilesystem' to true
Rule: KSV-0014
(IaC/Kubernetes)
[error] 978-1003: Default security context configured
container capi-controllers in openshift-cluster-api namespace is using the default security context
Rule: KSV-0118
(IaC/Kubernetes)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@capi-operator-manifests/default/manifests.yaml` around lines 970 - 1027, Both
container specs in the manifest lack a securityContext, so they still run with
root-capable defaults; add a securityContext to the capi-controllers and
machine-api-migration container definitions. Set runAsNonRoot and
readOnlyRootFilesystem, disable allowPrivilegeEscalation, and drop ALL
capabilities (only add back anything explicitly required). Use the existing
container blocks with names capi-controllers and machine-api-migration to place
the same hardened settings on both.
Source: Path instructions
There was a problem hiding this comment.
Probes: The healthz ports (9440, 9441) are declared in the Deployment manifest but neither binary actually configures a health endpoint — there's no HealthProbeBindAddress in cmd/capi-controllers/main.go or cmd/machine-api-migration/main.go. Adding probes against those ports would cause the pods to fail health checks and crash-loop. The original Deployment on main has no probes for the same reason.
securityContext: Same as the other comment — the original Deployment on main has no securityContext. This PR moves the Deployment between deployment mechanisms without changing its spec. Hardening would be a separate effort.
There was a problem hiding this comment.
@sg00dwin, thanks for the clarification — that makes sense on both counts. Since the binaries don't wire up HealthProbeBindAddress for those healthz-* ports, adding probes now would indeed cause crash-looping, and since this PR is purely a deployment-mechanism migration (CVO static manifest → operator manifests-gen) without any spec changes, carrying over the existing (lack of) securityContext is consistent with keeping the diff scoped to ordering only. I'll withdraw both suggestions.
✏️ Learnings added
Learnt from: sg00dwin
Repo: openshift/cluster-capi-operator PR: 614
File: capi-operator-manifests/default/manifests.yaml:970-1027
Timestamp: 2026-07-02T21:04:22.725Z
Learning: In the openshift/cluster-capi-operator repository, the `capi-controllers` Deployment (in capi-operator-manifests/default/manifests.yaml) exposes `healthz-o` (port 9440) and `healthz-m` (port 9441) container ports, but neither `cmd/capi-controllers/main.go` nor `cmd/machine-api-migration/main.go` configures a `HealthProbeBindAddress`, so these ports do not serve actual health endpoints. Do not suggest adding livenessProbe/readinessProbe against these ports unless the binaries are updated to implement them. Additionally, this Deployment intentionally has no securityContext defined, matching the pre-existing static manifest deployed by CVO on main; PRs that migrate the Deployment between deployment mechanisms (e.g., from CVO static manifests to the operator's manifests-gen pipeline) without changing the Deployment spec should not be flagged for missing securityContext hardening, as that would be a separate effort.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
|
/jira refresh |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-84321, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test unit |
LoadAdmissionPolicyProfiles() deserialized every object from manifests.yaml without type filtering. Adding the capi-controllers Deployment to that file broke 68 admission policy tests. Restrict the loader to ValidatingAdmissionPolicy and ValidatingAdmissionPolicyBinding types only. Assisted-by: Claude (Opus 4.6)
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-84321, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sg00dwin: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Problem
During cluster installation, CVO deploys the capi-controllers Deployment simultaneously with the capi-operator installer. The installer needs time to install provider CRDs (e.g.
AWSCluster,Cluster), but capi-controllers starts immediately and crashes because those CRDs don't exist yet. Both containers (capi-controllersandmachine-api-migration) exit withos.Exit(1)and restart ~10 times before the CRDs are eventually available.Solution
The operator's Boxcutter installer uses probe-gated sequencing — components at a higher installOrder only deploy after earlier components are confirmed ready. By moving the Deployment to installOrder 30 (after core CRDs at 10 and providers at 20), the controllers only start once all required CRDs exist.
No changes to controller logic. The fix is purely deployment ordering.
Changes
admission-policies/default/capi-controllers-deployment.yamladmission-policies/default/kustomization.yamlMakefile--self-image-reffor runtime image substitutioncapi-operator-manifests/default/manifests.yamlcapi-operator-manifests/default/metadata.yamlmanifests/0000_30_cluster-api_17_deployment.yamlmanifests/0000_30_cluster-api_00_tombstones-5.0.yamlUpgrade behavior
On upgrade from a release where CVO owns the Deployment:
Brief pod downtime during ownership transfer is expected and acceptable.
Assisted-by: Claude (Opus 4.6)
Summary by CodeRabbit
New Features
capi-controllerscontroller Deployment to the default install manifests.Bug Fixes
capi-controllersdeployment admission policy and aligning controller metadata/behavior for install and uninstall consistency.