Skip to content

OCPBUGS-84321: ClusterAPIMachineManagement: capi-controllers: excessive restarts of during cluster installation#614

Open
sg00dwin wants to merge 2 commits into
openshift:mainfrom
sg00dwin:OCPBUGS-84321-capi-controllers-restart-fix
Open

OCPBUGS-84321: ClusterAPIMachineManagement: capi-controllers: excessive restarts of during cluster installation#614
sg00dwin wants to merge 2 commits into
openshift:mainfrom
sg00dwin:OCPBUGS-84321-capi-controllers-restart-fix

Conversation

@sg00dwin

@sg00dwin sg00dwin commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

  • Move the capi-controllers Deployment from CVO static manifests to the operator's manifests-gen pipeline at installOrder 30
  • This ensures all provider CRDs are installed and confirmed Established before capi-controllers starts, eliminating ~10 crash-restart cycles during cluster installation
  • Add a CVO tombstone to delete the old Deployment on upgrade

Problem

During cluster installation, CVO deploys the capi-controllers Deployment simultaneously with the capi-operator installer. The installer needs time to install provider CRDs (e.g. AWSCluster, Cluster), but capi-controllers starts immediately and crashes because those CRDs don't exist yet. Both containers (capi-controllers and machine-api-migration) exit with os.Exit(1) and restart ~10 times before the CRDs are eventually available.

Solution

The operator's Boxcutter installer uses probe-gated sequencing — components at a higher installOrder only deploy after earlier components are confirmed ready. By moving the Deployment to installOrder 30 (after core CRDs at 10 and providers at 20), the controllers only start once all required CRDs exist.

No changes to controller logic. The fix is purely deployment ordering.

Changes

File Change
admission-policies/default/capi-controllers-deployment.yaml New — Deployment YAML as manifests-gen kustomize input
admission-policies/default/kustomization.yaml Added the new resource
Makefile Added --self-image-ref for runtime image substitution
capi-operator-manifests/default/manifests.yaml Regenerated — now includes the Deployment
capi-operator-manifests/default/metadata.yaml Regenerated — now includes selfImageRef
manifests/0000_30_cluster-api_17_deployment.yaml Deleted — CVO no longer manages this
manifests/0000_30_cluster-api_00_tombstones-5.0.yaml New — CVO tombstone for upgrade cleanup

Upgrade behavior

On upgrade from a release where CVO owns the Deployment:

  1. CVO processes the tombstone and deletes its old copy
  2. The operator creates the new operator-managed copy at installOrder 30

Brief pod downtime during ownership transfer is expected and acceptable.


Assisted-by: Claude (Opus 4.6)

Summary by CodeRabbit

  • New Features

    • Added the capi-controllers controller Deployment to the default install manifests.
    • Updated manifest generation so the controller Deployment uses the correct self image reference and is packaged with the release.
  • Bug Fixes

    • Improved the controller-related release/admission configuration by updating the capi-controllers deployment admission policy and aligning controller metadata/behavior for install and uninstall consistency.

…during cluster installation

Move capi-controllers Deployment from CVO to operator

The capi-controllers Deployment was deployed by CVO as a static manifest,
starting simultaneously with the installer. Since the installer needs time
to install provider CRDs, capi-controllers would crash repeatedly until
they became available.

Move the Deployment to the operator's manifests-gen pipeline at
installOrder 30, where probe-gated sequencing ensures all CRDs are
Established before the Deployment is created. Add a tombstone so CVO
deletes its old copy on upgrade.

Assisted-by: Claude (Opus 4.6)
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 2, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sg00dwin: This pull request references Jira Issue OCPBUGS-84321, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Move the capi-controllers Deployment from CVO static manifests to the operator's manifests-gen pipeline at installOrder 30
  • This ensures all provider CRDs are installed and confirmed Established before capi-controllers starts, eliminating ~10 crash-restart cycles during cluster installation
  • Add a CVO tombstone to delete the old Deployment on upgrade

Problem

During cluster installation, CVO deploys the capi-controllers Deployment simultaneously with the capi-operator installer. The installer needs time to install provider CRDs (e.g. AWSCluster, Cluster), but capi-controllers starts immediately and crashes because those CRDs don't exist yet. Both containers (capi-controllers and machine-api-migration) exit with os.Exit(1) and restart ~10 times before the CRDs are eventually available.

Solution

The operator's Boxcutter installer uses probe-gated sequencing — components at a higher installOrder only deploy after earlier components are confirmed ready. By moving the Deployment to installOrder 30 (after core CRDs at 10 and providers at 20), the controllers only start once all required CRDs exist.

No changes to controller logic. The fix is purely deployment ordering.

Changes

File Change
admission-policies/default/capi-controllers-deployment.yaml New — Deployment YAML as manifests-gen kustomize input
admission-policies/default/kustomization.yaml Added the new resource
Makefile Added --self-image-ref for runtime image substitution
capi-operator-manifests/default/manifests.yaml Regenerated — now includes the Deployment
capi-operator-manifests/default/metadata.yaml Regenerated — now includes selfImageRef
manifests/0000_30_cluster-api_17_deployment.yaml Deleted — CVO no longer manages this
manifests/0000_30_cluster-api_00_tombstones-5.0.yaml New — CVO tombstone for upgrade cleanup

Upgrade behavior

On upgrade from a release where CVO owns the Deployment:

  1. CVO processes the tombstone and deletes its old copy
  2. The operator creates the new operator-managed copy at installOrder 30

Brief pod downtime during ownership transfer is expected and acceptable.


Assisted-by: Claude (Opus 4.6)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5c2e12cd-0b52-461d-b772-cd246f638e51

📥 Commits

Reviewing files that changed from the base of the PR and between 2cd7b9f and 20cd738.

📒 Files selected for processing (1)
  • pkg/admissionpolicy/testutils/util.go

Walkthrough

This PR updates capi-controllers manifest generation to use a self image reference, adds the generated deployment and tombstone manifests, and narrows admission-policy profile loading to validating policy objects.

Changes

capi-controllers manifest migration

Layer / File(s) Summary
Self image wiring
Makefile, capi-operator-manifests/default/metadata.yaml
manifests-gen now receives --self-image-ref, and default metadata sets selfImageRef to the cluster-capi-operator image.
Generated deployment and tombstone
capi-operator-manifests/default/manifests.yaml, manifests/0000_30_cluster-api_00_tombstones-5.0.yaml
A generated capi-controllers Deployment is appended, and a tombstone manifest defines the migrated Deployment with release annotations.
Admission policy manifest support
admission-policies/default/capi-controllers-deployment.yaml, admission-policies/default/kustomization.yaml, pkg/admissionpolicy/testutils/util.go
The deployment manifest is added to the admission-policy kustomization, its annotations are removed, and profile loading now keeps only validating policy and binding objects.

Estimated code review effort: 2 (Simple) | ~15 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Topology-Aware Scheduling Compatibility ⚠️ Warning New capi-controllers manifests hardcode a control-plane nodeSelector and master/control-plane tolerations, with no ControlPlaneTopology gating; this can fail on HyperShift. Gate scheduling by ControlPlaneTopology or exclude external/hosted clusters; avoid unconditional control-plane selectors in the generated manifests.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main fix: reducing capi-controllers restarts during cluster installation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test titles were added or modified; the only changed file is a helper util with no It/Describe/Context/When declarations.
Test Structure And Quality ✅ Passed No Ginkgo test blocks were changed; only a test helper’s object filtering was updated, so the listed test-quality checks aren’t applicable here.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR diff is manifests plus a test utility helper, with no It/Describe/Context/When additions or MicroShift-sensitive test code.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the only code change is a test utility parser, so there’s no SNO-specific test assumption to flag.
Ote Binary Stdout Contract ✅ Passed The only changed Go file just narrows decoded object types; it adds no main/init/TestMain/BeforeSuite/RunSpecs code or stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the only code change is a test-helper filter update, and the manifest changes are non-test assets.
No-Weak-Crypto ✅ Passed HEAD only changes admission policy loader filtering; scans found no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB or secret-comparison code in the patch.
Container-Privileges ✅ Passed No privileged, hostPID/Network/IPC, allowPrivilegeEscalation, SYS_ADMIN, or root-run settings were added in the touched manifests.
No-Sensitive-Data-In-Logs ✅ Passed The diff only filters admission-policy objects; I գտnot see any new logging/print paths or sensitive-value emission in the touched files.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign racheljpg for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
capi-operator-manifests/default/manifests.yaml (1)

1032-1032: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Consider setting automountServiceAccountToken: false if the API token isn't needed by this pod, or confirm it is required.

serviceAccountName: capi-controllers is set but automountServiceAccountToken isn't specified, defaulting to token mounting.

As per path instructions, "**/*.{yaml,yml}: ... automountServiceAccountToken: false unless needed".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@capi-operator-manifests/default/manifests.yaml` at line 1032, The pod spec
using serviceAccountName capi-controllers is implicitly mounting a service
account token, which should be disabled unless this workload needs API access.
Update the manifest where the pod spec is defined to set
automountServiceAccountToken to false if the token is not required, or otherwise
confirm and keep the default behavior; use the serviceAccountName setting to
locate the relevant pod template.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@capi-operator-manifests/default/manifests.yaml`:
- Around line 990-993: The container resource configuration only sets
resources.requests and is missing matching resources.limits for cpu and memory.
Update the container specs in manifests.yaml, including both affected resource
blocks, to add explicit limits alongside the existing requests so each container
has bounded resource usage.
- Around line 970-1027: Add livenessProbe and readinessProbe to both containers
in the manifest: the capi-controllers and machine-api-migration container specs
already expose dedicated healthz ports (healthz-o on 9440 and healthz-m on
9441), so wire the probes to those endpoints using the existing container names
and ports. Ensure each probe is defined alongside the current ports/args for the
corresponding container and points to the appropriate health check path on the
matching healthz port.
- Around line 970-1027: Both container specs in the manifest lack a
securityContext, so they still run with root-capable defaults; add a
securityContext to the capi-controllers and machine-api-migration container
definitions. Set runAsNonRoot and readOnlyRootFilesystem, disable
allowPrivilegeEscalation, and drop ALL capabilities (only add back anything
explicitly required). Use the existing container blocks with names
capi-controllers and machine-api-migration to place the same hardened settings
on both.
- Around line 948-1049: The capi-controllers Deployment manifest is missing
required hardening settings, so update the source manifest for the
capi-controllers Deployment to include a pod/container securityContext, add
resource limits alongside the existing requests, and define liveness and
readiness probes for both containers where appropriate. Make sure these
additions are placed in the same Deployment spec that defines the
capi-controllers and machine-api-migration containers so the rendered bundle
stays in sync.

---

Nitpick comments:
In `@capi-operator-manifests/default/manifests.yaml`:
- Line 1032: The pod spec using serviceAccountName capi-controllers is
implicitly mounting a service account token, which should be disabled unless
this workload needs API access. Update the manifest where the pod spec is
defined to set automountServiceAccountToken to false if the token is not
required, or otherwise confirm and keep the default behavior; use the
serviceAccountName setting to locate the relevant pod template.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c6e49887-bf0f-40ca-be91-6bde2079827a

📥 Commits

Reviewing files that changed from the base of the PR and between 3dffe24 and 2cd7b9f.

📒 Files selected for processing (6)
  • Makefile
  • admission-policies/default/capi-controllers-deployment.yaml
  • admission-policies/default/kustomization.yaml
  • capi-operator-manifests/default/manifests.yaml
  • capi-operator-manifests/default/metadata.yaml
  • manifests/0000_30_cluster-api_00_tombstones-5.0.yaml
💤 Files with no reviewable changes (1)
  • admission-policies/default/capi-controllers-deployment.yaml

Comment thread capi-operator-manifests/default/manifests.yaml
Comment on lines +970 to +1027
containers:
- args:
- --diagnostics-address=:8443
command:
- /capi-controllers
env:
- name: RELEASE_VERSION
value: 0.0.1-snapshot
image: registry.ci.openshift.org/openshift:cluster-capi-operator
name: capi-controllers
ports:
- containerPort: 9443
name: webhook-server
protocol: TCP
- containerPort: 8443
name: diagnostics-o
protocol: TCP
- containerPort: 9440
name: healthz-o
protocol: TCP
resources:
requests:
cpu: 10m
memory: 50Mi
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /tmp/k8s-webhook-server/serving-certs
name: cert
readOnly: true
- mountPath: /tmp/k8s-metrics-server/serving-certs
name: metrics-cert
readOnly: true
- args:
- --diagnostics-address=:8442
- --health-addr=:9441
command:
- /machine-api-migration
env:
- name: RELEASE_VERSION
value: 0.0.1-snapshot
image: registry.ci.openshift.org/openshift:cluster-capi-operator
name: machine-api-migration
ports:
- containerPort: 8442
name: diagnostics-m
protocol: TCP
- containerPort: 9441
name: healthz-m
protocol: TCP
resources:
requests:
cpu: 10m
memory: 50Mi
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /tmp/k8s-metrics-server/serving-certs
name: metrics-cert
readOnly: true

@coderabbitai coderabbitai Bot Jul 2, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Add liveness and readiness probes.

Neither container defines livenessProbe/readinessProbe, despite both exposing dedicated healthz-* ports (9440, 9441) presumably intended for this purpose.

As per path instructions, "**/*.{yaml,yml}: ... Liveness + readiness probes defined".

🧰 Tools
🪛 Trivy (0.69.3)

[error] 978-1003: Root file system is not read-only

Container 'machine-api-migration' of Deployment 'capi-controllers' should set 'securityContext.readOnlyRootFilesystem' to true

Rule: KSV-0014

Learn more

(IaC/Kubernetes)


[error] 978-1003: Default security context configured

container capi-controllers in openshift-cluster-api namespace is using the default security context

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@capi-operator-manifests/default/manifests.yaml` around lines 970 - 1027, Add
livenessProbe and readinessProbe to both containers in the manifest: the
capi-controllers and machine-api-migration container specs already expose
dedicated healthz ports (healthz-o on 9440 and healthz-m on 9441), so wire the
probes to those endpoints using the existing container names and ports. Ensure
each probe is defined alongside the current ports/args for the corresponding
container and points to the appropriate health check path on the matching
healthz port.

Source: Path instructions


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Add securityContext (runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation, dropped capabilities) to both containers.

Neither capi-controllers nor machine-api-migration containers define a securityContext, so they run with the platform default (root-capable, writable root filesystem, privilege escalation allowed). Static analysis flags this (Checkov CKV_K8S_20/CKV_K8S_23, Trivy KSV-0014/KSV-0118).

🔒 Proposed securityContext additions
       containers:
       - args:
         - --diagnostics-address=:8443
         command:
         - /capi-controllers
+        securityContext:
+          allowPrivilegeEscalation: false
+          readOnlyRootFilesystem: true
+          runAsNonRoot: true
+          capabilities:
+            drop:
+            - ALL
         env:

Apply the equivalent block to the machine-api-migration container as well.

As per path instructions, "**/*.{yaml,yml}: If this is a Kubernetes/OpenShift manifest or Helm template: securityContext: runAsNonRoot, readOnlyRootFilesystem, allowPrivilegeEscalation: false; Drop ALL capabilities, add only what is required".

🧰 Tools
🪛 Trivy (0.69.3)

[error] 978-1003: Root file system is not read-only

Container 'machine-api-migration' of Deployment 'capi-controllers' should set 'securityContext.readOnlyRootFilesystem' to true

Rule: KSV-0014

Learn more

(IaC/Kubernetes)


[error] 978-1003: Default security context configured

container capi-controllers in openshift-cluster-api namespace is using the default security context

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@capi-operator-manifests/default/manifests.yaml` around lines 970 - 1027, Both
container specs in the manifest lack a securityContext, so they still run with
root-capable defaults; add a securityContext to the capi-controllers and
machine-api-migration container definitions. Set runAsNonRoot and
readOnlyRootFilesystem, disable allowPrivilegeEscalation, and drop ALL
capabilities (only add back anything explicitly required). Use the existing
container blocks with names capi-controllers and machine-api-migration to place
the same hardened settings on both.

Source: Path instructions

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probes: The healthz ports (9440, 9441) are declared in the Deployment manifest but neither binary actually configures a health endpoint — there's no HealthProbeBindAddress in cmd/capi-controllers/main.go or cmd/machine-api-migration/main.go. Adding probes against those ports would cause the pods to fail health checks and crash-loop. The original Deployment on main has no probes for the same reason.

securityContext: Same as the other comment — the original Deployment on main has no securityContext. This PR moves the Deployment between deployment mechanisms without changing its spec. Hardening would be a separate effort.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sg00dwin, thanks for the clarification — that makes sense on both counts. Since the binaries don't wire up HealthProbeBindAddress for those healthz-* ports, adding probes now would indeed cause crash-looping, and since this PR is purely a deployment-mechanism migration (CVO static manifest → operator manifests-gen) without any spec changes, carrying over the existing (lack of) securityContext is consistent with keeping the diff scoped to ordering only. I'll withdraw both suggestions.


✏️ Learnings added
Learnt from: sg00dwin
Repo: openshift/cluster-capi-operator PR: 614
File: capi-operator-manifests/default/manifests.yaml:970-1027
Timestamp: 2026-07-02T21:04:22.725Z
Learning: In the openshift/cluster-capi-operator repository, the `capi-controllers` Deployment (in capi-operator-manifests/default/manifests.yaml) exposes `healthz-o` (port 9440) and `healthz-m` (port 9441) container ports, but neither `cmd/capi-controllers/main.go` nor `cmd/machine-api-migration/main.go` configures a `HealthProbeBindAddress`, so these ports do not serve actual health endpoints. Do not suggest adding livenessProbe/readinessProbe against these ports unless the binaries are updated to implement them. Additionally, this Deployment intentionally has no securityContext defined, matching the pre-existing static manifest deployed by CVO on main; PRs that migrate the Deployment between deployment mechanisms (e.g., from CVO static manifests to the operator's manifests-gen pipeline) without changing the Deployment spec should not be flagged for missing securityContext hardening, as that would be a separate effort.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread capi-operator-manifests/default/manifests.yaml
@sg00dwin

sg00dwin commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jul 2, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@sg00dwin: This pull request references Jira Issue OCPBUGS-84321, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 2, 2026
@sg00dwin

sg00dwin commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

/test unit

LoadAdmissionPolicyProfiles() deserialized every object from
manifests.yaml without type filtering. Adding the capi-controllers
Deployment to that file broke 68 admission policy tests. Restrict
the loader to ValidatingAdmissionPolicy and ValidatingAdmissionPolicyBinding types only.

Assisted-by: Claude (Opus 4.6)
@openshift-ci-robot

Copy link
Copy Markdown

@sg00dwin: This pull request references Jira Issue OCPBUGS-84321, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Move the capi-controllers Deployment from CVO static manifests to the operator's manifests-gen pipeline at installOrder 30
  • This ensures all provider CRDs are installed and confirmed Established before capi-controllers starts, eliminating ~10 crash-restart cycles during cluster installation
  • Add a CVO tombstone to delete the old Deployment on upgrade

Problem

During cluster installation, CVO deploys the capi-controllers Deployment simultaneously with the capi-operator installer. The installer needs time to install provider CRDs (e.g. AWSCluster, Cluster), but capi-controllers starts immediately and crashes because those CRDs don't exist yet. Both containers (capi-controllers and machine-api-migration) exit with os.Exit(1) and restart ~10 times before the CRDs are eventually available.

Solution

The operator's Boxcutter installer uses probe-gated sequencing — components at a higher installOrder only deploy after earlier components are confirmed ready. By moving the Deployment to installOrder 30 (after core CRDs at 10 and providers at 20), the controllers only start once all required CRDs exist.

No changes to controller logic. The fix is purely deployment ordering.

Changes

File Change
admission-policies/default/capi-controllers-deployment.yaml New — Deployment YAML as manifests-gen kustomize input
admission-policies/default/kustomization.yaml Added the new resource
Makefile Added --self-image-ref for runtime image substitution
capi-operator-manifests/default/manifests.yaml Regenerated — now includes the Deployment
capi-operator-manifests/default/metadata.yaml Regenerated — now includes selfImageRef
manifests/0000_30_cluster-api_17_deployment.yaml Deleted — CVO no longer manages this
manifests/0000_30_cluster-api_00_tombstones-5.0.yaml New — CVO tombstone for upgrade cleanup

Upgrade behavior

On upgrade from a release where CVO owns the Deployment:

  1. CVO processes the tombstone and deletes its old copy
  2. The operator creates the new operator-managed copy at installOrder 30

Brief pod downtime during ownership transfer is expected and acceptable.


Assisted-by: Claude (Opus 4.6)

Summary by CodeRabbit

  • New Features

  • Added the capi-controllers controller Deployment to the default install manifests.

  • Updated manifest generation so the controller Deployment uses the correct self image reference and is packaged with the release.

  • Bug Fixes

  • Improved the controller-related release/admission configuration by updating the capi-controllers deployment admission policy and aligning controller metadata/behavior for install and uninstall consistency.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@sg00dwin: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants