Skip to content

STOR-3030: feat(operator): detect GCP Dedicated and use hyperdisk-balanced StorageClass#573

Open
jsafrane wants to merge 1 commit into
openshift:mainfrom
jsafrane:fix-STOR-3030
Open

STOR-3030: feat(operator): detect GCP Dedicated and use hyperdisk-balanced StorageClass#573
jsafrane wants to merge 1 commit into
openshift:mainfrom
jsafrane:fix-STOR-3030

Conversation

@jsafrane

@jsafrane jsafrane commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

At operator startup, fetch the Infrastructure CR (with retry for up to 1 minute during early cluster bootstrap) and check if the GCP region starts with "u-". If so, the cluster runs on GCP Dedicated where only hyperdisk-balanced disks are supported — create only the hyperdisk-balanced StorageClass as default.

On regular GCP, the existing standard-csi and ssd-csi StorageClasses are used as before.

…geClass

At operator startup, fetch the Infrastructure CR (with retry for up to
1 minute during early cluster bootstrap) and check if the GCP region
starts with "u-". If so, the cluster runs on GCP Dedicated where only
hyperdisk-balanced disks are supported — create only the
hyperdisk-balanced StorageClass as default.

On regular GCP, the existing standard-csi and ssd-csi StorageClasses
are used as before.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot

openshift-ci-robot commented Jul 3, 2026

Copy link
Copy Markdown

@jsafrane: This pull request references STOR-3030 which is a valid jira issue.

Details

In response to this:

At operator startup, fetch the Infrastructure CR (with retry for up to 1 minute during early cluster bootstrap) and check if the GCP region starts with "u-". If so, the cluster runs on GCP Dedicated where only hyperdisk-balanced disks are supported — create only the hyperdisk-balanced StorageClass as default.

On regular GCP, the existing standard-csi and ssd-csi StorageClasses are used as before.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 3, 2026
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

This change modifies the GCP PD CSI driver operator's startup logic to select StorageClass manifest files based on the cluster's GCP region. A new helper function retrieves the Infrastructure custom resource, polling for up to one minute, and inspects the region to determine whether it is a GCP Dedicated region (prefixed with "u-"). Dedicated regions receive a hyperdisk-balanced-only StorageClass, while regular regions receive standard and SSD StorageClasses. Startup now propagates errors if the Infrastructure CR cannot be retrieved. New unit tests validate the selection logic across various region configurations and error handling when the Infrastructure CR is absent.

Changes

Area Change
Operator startup Added region-aware StorageClass file selection via new helper and wired it into controller setup with error propagation
Tests Added tests covering region-based selection scenarios and missing Infrastructure CR error handling

Sequence Diagram(s)

Included in the hidden review stack artifact above.

Related Issues: None referenced in the provided context.

Related PRs: None referenced in the provided context.

Suggested labels: area/gcp-pd-csi-driver, ok-to-test

Suggested reviewers: None determinable from the provided context.

Poem
A rabbit hops through region names,
Seeking "u-" among the claims,
Hyperdisk for dedicated ground,
SSD and standard all around,
One minute waits, then off it bounds. 🐇

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: detecting GCP Dedicated and switching to hyperdisk-balanced StorageClass.
Description check ✅ Passed The description accurately describes the startup retry logic and region-based StorageClass selection in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Added tests use static t.Run names only; no dynamic values, timestamps, or generated identifiers appear in titles.
Test Structure And Quality ✅ Passed PASS: The added tests are table-driven, isolated, use fake clients, bound execution with context timeouts, and include clear failure messages; no cleanup or Ginkgo waits are needed.
Microshift Test Compatibility ✅ Passed The added tests are plain Go unit tests (t.Run), not Ginkgo e2e tests, and no MicroShift-unsupported OpenShift APIs/features or guards are involved.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PASS: The added tests are plain testing unit tests, not Ginkgo e2e tests, and they only exercise config-client logic with no node/topology assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed Only storage-class selection at startup changed; no new node selectors, affinities, replica math, tolerations, or spread/PDB constraints were added.
Ote Binary Stdout Contract ✅ Passed No stdout writes appear in main/init/TestMain/suite setup; the PR only adds klog logging inside RunOperator/helper code, which the check excludes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the only test file is a Go unit test with no IPv4-only or external connectivity assumptions.
No-Weak-Crypto ✅ Passed Changed files only add StorageClass selection logic/tests; no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret comparisons were introduced.
Container-Privileges ✅ Passed No privileged/root/hostNetwork/hostIPC/allowPrivilegeEscalation/SYS_ADMIN settings were added in the changed files; only Go startup logic and tests changed.
No-Sensitive-Data-In-Logs ✅ Passed New logs only emit the GCP region and generic Infrastructure lookup status; no passwords, tokens, PII, or customer data are logged.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested a review from mpatlasov July 3, 2026 11:36
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 3, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.go (1)

251-267: 🩺 Stability & Availability | 🔵 Trivial

Cumulative startup latency now up to ~2 minutes.

This adds up to 1 more minute of blocking retry on top of the existing FeatureGate wait timeout (case <-time.After(1 * time.Minute)), so worst-case operator startup latency roughly doubles. Confirm the operator's liveness/readiness probe initialDelaySeconds/failureThreshold tolerate this before the pod is restarted mid-bootstrap.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.go` around lines 251 -
267, The new blocking retry in starter.go can extend operator bootstrap by about
another minute on top of the existing FeatureGate wait path, so review the
startup flow around getStorageClassFiles and
csiControllerSet.WithStorageClassController to avoid exceeding probe tolerance.
Either shorten or remove the extra blocking wait, or update the operator’s
liveness/readiness probe settings (initialDelaySeconds and failureThreshold) so
the pod is not restarted while startup is still in progress.
legacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go (1)

119-129: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Consider shortening the context timeout for the missing-Infrastructure test.

Since the Infrastructure CR never appears in this test, the retry loop inside getStorageClassFiles will keep polling until the context deadline is reached, meaning this test will consistently take the full 10 seconds to complete. A much shorter timeout (e.g. 1-2s) would likely still exercise at least one poll iteration and let the test fail fast, keeping the suite quick.

♻️ Suggested fix
 func TestGetStorageClassFilesNoInfrastructure(t *testing.T) {
 	configClient := fakeconfig.NewSimpleClientset()
-	// Use a short timeout so the retry loop finishes quickly in tests.
-	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+	// Use a short timeout so the retry loop finishes quickly in tests.
+	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
 	defer cancel()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go` around
lines 119 - 129, The missing-Infrastructure test in
TestGetStorageClassFilesNoInfrastructure uses a much longer context timeout than
needed, so it waits for the full retry period before failing. Shorten the
timeout in the test’s context.WithTimeout setup to a small value that still
allows at least one poll in getStorageClassFiles, and keep the existing
assertion that an error is returned. This change should be made in the
TestGetStorageClassFilesNoInfrastructure function only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.go`:
- Around line 251-267: The new blocking retry in starter.go can extend operator
bootstrap by about another minute on top of the existing FeatureGate wait path,
so review the startup flow around getStorageClassFiles and
csiControllerSet.WithStorageClassController to avoid exceeding probe tolerance.
Either shorten or remove the extra blocking wait, or update the operator’s
liveness/readiness probe settings (initialDelaySeconds and failureThreshold) so
the pod is not restarted while startup is still in progress.

In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go`:
- Around line 119-129: The missing-Infrastructure test in
TestGetStorageClassFilesNoInfrastructure uses a much longer context timeout than
needed, so it waits for the full retry period before failing. Shorten the
timeout in the test’s context.WithTimeout setup to a small value that still
allows at least one poll in getStorageClassFiles, and keep the existing
assertion that an error is returned. This change should be made in the
TestGetStorageClassFilesNoInfrastructure function only.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 49560999-d65b-4560-8fc9-5e90f7e2c29c

📥 Commits

Reviewing files that changed from the base of the PR and between 2fc1e7d and 723e2b4.

📒 Files selected for processing (2)
  • legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.go
  • legacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@jsafrane: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-csi 723e2b4 link true /test e2e-gcp-csi
ci/prow/aws-efs-operator-e2e 723e2b4 link true /test aws-efs-operator-e2e
ci/prow/e2e-azure 723e2b4 link true /test e2e-azure
ci/prow/e2e-gcp-csi-manual-oidc 723e2b4 link false /test e2e-gcp-csi-manual-oidc
ci/prow/e2e-aws-csi 723e2b4 link true /test e2e-aws-csi
ci/prow/e2e-aws-ovn-upgrade 723e2b4 link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants