STOR-3030: feat(operator): detect GCP Dedicated and use hyperdisk-balanced StorageClass#573
STOR-3030: feat(operator): detect GCP Dedicated and use hyperdisk-balanced StorageClass#573jsafrane wants to merge 1 commit into
Conversation
…geClass At operator startup, fetch the Infrastructure CR (with retry for up to 1 minute during early cluster bootstrap) and check if the GCP region starts with "u-". If so, the cluster runs on GCP Dedicated where only hyperdisk-balanced disks are supported — create only the hyperdisk-balanced StorageClass as default. On regular GCP, the existing standard-csi and ssd-csi StorageClasses are used as before. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@jsafrane: This pull request references STOR-3030 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThis change modifies the GCP PD CSI driver operator's startup logic to select StorageClass manifest files based on the cluster's GCP region. A new helper function retrieves the Infrastructure custom resource, polling for up to one minute, and inspects the region to determine whether it is a GCP Dedicated region (prefixed with "u-"). Dedicated regions receive a hyperdisk-balanced-only StorageClass, while regular regions receive standard and SSD StorageClasses. Startup now propagates errors if the Infrastructure CR cannot be retrieved. New unit tests validate the selection logic across various region configurations and error handling when the Infrastructure CR is absent. Changes
Sequence Diagram(s)Included in the hidden review stack artifact above. Related Issues: None referenced in the provided context. Related PRs: None referenced in the provided context. Suggested labels: area/gcp-pd-csi-driver, ok-to-test Suggested reviewers: None determinable from the provided context. Poem 🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.go (1)
251-267: 🩺 Stability & Availability | 🔵 TrivialCumulative startup latency now up to ~2 minutes.
This adds up to 1 more minute of blocking retry on top of the existing FeatureGate wait timeout (
case <-time.After(1 * time.Minute)), so worst-case operator startup latency roughly doubles. Confirm the operator's liveness/readiness probeinitialDelaySeconds/failureThresholdtolerate this before the pod is restarted mid-bootstrap.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.go` around lines 251 - 267, The new blocking retry in starter.go can extend operator bootstrap by about another minute on top of the existing FeatureGate wait path, so review the startup flow around getStorageClassFiles and csiControllerSet.WithStorageClassController to avoid exceeding probe tolerance. Either shorten or remove the extra blocking wait, or update the operator’s liveness/readiness probe settings (initialDelaySeconds and failureThreshold) so the pod is not restarted while startup is still in progress.legacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go (1)
119-129: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueConsider shortening the context timeout for the missing-Infrastructure test.
Since the Infrastructure CR never appears in this test, the retry loop inside
getStorageClassFileswill keep polling until the context deadline is reached, meaning this test will consistently take the full 10 seconds to complete. A much shorter timeout (e.g. 1-2s) would likely still exercise at least one poll iteration and let the test fail fast, keeping the suite quick.♻️ Suggested fix
func TestGetStorageClassFilesNoInfrastructure(t *testing.T) { configClient := fakeconfig.NewSimpleClientset() - // Use a short timeout so the retry loop finishes quickly in tests. - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + // Use a short timeout so the retry loop finishes quickly in tests. + ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) defer cancel()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go` around lines 119 - 129, The missing-Infrastructure test in TestGetStorageClassFilesNoInfrastructure uses a much longer context timeout than needed, so it waits for the full retry period before failing. Shorten the timeout in the test’s context.WithTimeout setup to a small value that still allows at least one poll in getStorageClassFiles, and keep the existing assertion that an error is returned. This change should be made in the TestGetStorageClassFilesNoInfrastructure function only.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.go`:
- Around line 251-267: The new blocking retry in starter.go can extend operator
bootstrap by about another minute on top of the existing FeatureGate wait path,
so review the startup flow around getStorageClassFiles and
csiControllerSet.WithStorageClassController to avoid exceeding probe tolerance.
Either shorten or remove the extra blocking wait, or update the operator’s
liveness/readiness probe settings (initialDelaySeconds and failureThreshold) so
the pod is not restarted while startup is still in progress.
In `@legacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go`:
- Around line 119-129: The missing-Infrastructure test in
TestGetStorageClassFilesNoInfrastructure uses a much longer context timeout than
needed, so it waits for the full retry period before failing. Shorten the
timeout in the test’s context.WithTimeout setup to a small value that still
allows at least one poll in getStorageClassFiles, and keep the existing
assertion that an error is returned. This change should be made in the
TestGetStorageClassFilesNoInfrastructure function only.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 49560999-d65b-4560-8fc9-5e90f7e2c29c
📒 Files selected for processing (2)
legacy/gcp-pd-csi-driver-operator/pkg/operator/starter.golegacy/gcp-pd-csi-driver-operator/pkg/operator/storageclass_test.go
|
@jsafrane: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
At operator startup, fetch the Infrastructure CR (with retry for up to 1 minute during early cluster bootstrap) and check if the GCP region starts with "u-". If so, the cluster runs on GCP Dedicated where only hyperdisk-balanced disks are supported — create only the hyperdisk-balanced StorageClass as default.
On regular GCP, the existing standard-csi and ssd-csi StorageClasses are used as before.