Skip to content

feat: SeiNode import existing PVC + ensure-data-pvc create-path fix#113

Open
bdchatham wants to merge 1 commit intomainfrom
feat/seinode-import-existing-pvc
Open

feat: SeiNode import existing PVC + ensure-data-pvc create-path fix#113
bdchatham wants to merge 1 commit intomainfrom
feat/seinode-import-existing-pvc

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Closes #105. Fixes #104 as part of the create-path refactor.

Implements the merged design:

  • Direction: `docs/design-seinode-import-volume.md` (PR #106)
  • LLD: `docs/design-seinode-import-volume-lld.md` (PR #112)

Summary

New optional spec field `spec.dataVolume.import.pvcName` lets a SeiNode adopt a pre-existing PVC rather than provisioning one. The controller validates the referenced PVC against the seven requirements from the LLD and never mutates it. Transient failures loop at the executor's existing poll interval; unrecoverable failures mark the plan Failed with a specific `Reason`. Status surfaced via a new `ImportPVCReady` Condition on SeiNode.

The `ensure-data-pvc` create path is refactored to Get-then-Create, treating an existing non-controlled PVC as a terminal error rather than silently binding to it (fixes #104).

What changed

CRD / API types (`api/v1alpha1/seinode_types.go`, +generated)

  • New `DataVolumeSpec` with optional `Import *DataVolumeImport`
  • `DataVolumeImport.PVCName` — immutable, DNS1123-validated, 1-253 chars
  • `ConditionImportPVCReady` constant
  • Ten `ReasonImport*` constants documented as a public alerting contract

Task (`internal/task/ensure_pvc.go`)

  • `Execute()` branches on presence of `import.pvcName`
  • `executeCreate` — Get-then-Create, ownership-checked. Existing PVC controlled by this SeiNode = Complete (crash recovery). Existing PVC NOT controlled = Terminal (ensure-data-pvc marks Complete on AlreadyExists against a terminating PVC (PVC-race bug) #104 fix).
  • `executeImport` — seven-check validation with transient/terminal classification per LLD §3
  • Transient failures write `": "` into `PlannedTask.Error` for the planner to read
  • Uses executor's existing `TaskPollInterval` — no custom backoff

Planner (`internal/planner/planner.go`)

  • New `ReconcileImportPVCCondition` mirroring the existing `NodeUpdateInProgress` pattern
  • Sets/removes `ImportPVCReady` based on current task state
  • Called from the reconciler between plan execution and status flush

Finalizer (`internal/controller/node/controller.go`)

  • `deleteNodeDataPVC` skips deletion when `import.pvcName` is set
  • Operator owns imported storage lifecycle

RBAC

  • `persistentvolumes: get/list/watch` — required for PV consistency checks (requirements 6-7). No write verbs.

Tests

  • 18 unit tests in `internal/task/ensure_pvc_test.go`: create-path happy path, create-path crash-recovery idempotence, create-path non-controlled-conflict (ensure-data-pvc marks Complete on AlreadyExists against a terminating PVC (PVC-race bug) #104 guard), import happy path, each of seven validation requirements in transient + terminal classification, late-PVC convergence, poll-cadence regression guard.
  • 5 integration tests in `internal/controller/node/import_pvc_test.go`:
    • `TestController_Import_Happy_ReachesRunning`
    • `TestController_Import_LatePVC_Converges`
    • `TestController_Import_TerminalFailure_MarksFailed`
    • `TestController_Import_Deletion_PreservesPVC`
    • `TestController_NoImport_Deletion_DeletesPVC` (regression guard)
  • All existing tests continue to pass.

Deliberately deferred

  • Metrics and Events from LLD §8 — three counters (`importPVCValidationTotal`, `importPVCTerminalTotal`, `importPVCTimeToValid`) and two Events (`PVCImportValidated`, `PVCImportFailed`). The feature is functionally complete without them. Keeping this PR focused on correctness + refactor; observability lands in a small follow-up PR once we've shipped this and can instrument real usage.

Deviations from the LLD (flagged)

Two small implementation decisions worth naming:

  1. `statusDirty` flip for condition-only mutations. The reconciler's existing `statusDirty` flag gates the status flush. To ensure a condition-only change gets persisted, added a 3-line compare helper (`conditionsSnapshot` + `conditionsEqual`) that flips `statusDirty` when the condition list mutated. This is implementation plumbing not called out in the LLD but follows directly from "condition is set before flush — must trigger flush when it changes."

  2. Transient API errors on Get. When `Get` returns an unexpected non-`NotFound` error (networking blip), classified as transient with `PVCNotFound` / `UnderlyingPVMissing` reason. The reason is slightly imprecise but behavior is correct (retry). Alternative would be to return the error and let the executor retry at a higher layer. Easy to flip if reviewers prefer.

Test plan

  • `make test` — all pass (existing + 23 new)
  • `make lint` — `0 issues`
  • `make generate manifests` — everything regenerated, committed
  • No unintended file modifications
  • Post-merge: deploy to a dev cluster, verify archive SeiNode with `import.pvcName` referencing a pre-populated PVC reaches Running
  • Post-merge: verify existing archives without the field continue to behave exactly as before (backward compat)

Manual migration context

This PR is what would let us do today's pacific-1 archive import as a first-class feature rather than as a workaround. The manual trick we used yesterday (pre-create PVC with matching name, rely on the `AlreadyExists` bug) becomes `spec.dataVolume.import.pvcName: data-archive-0-0` and the controller handles it properly.

🤖 Generated with Claude Code

Implements the direction + LLD merged in #106 and #112.

New spec field spec.dataVolume.import.pvcName (optional, immutable,
DNS1123-validated) lets a SeiNode adopt a pre-existing PVC instead
of provisioning one. The controller validates the referenced PVC
against seven requirements (exists, not terminating, Bound, RWO,
capacity >= mode default, underlying PV capacity consistent, PV
not Failed) and never mutates it. Transient failures loop at the
executor's TaskPollInterval; unrecoverable failures mark the plan
Failed with a specific Reason.

Tracked via the new ImportPVCReady Condition on SeiNode status.
Reason strings are treated as a public alerting contract — renames
are breaking changes.

ensure-data-pvc now uses Get-then-Create on the create path. If
a PVC with the expected name exists and is NOT controlled by this
SeiNode, the task fails terminally (this is the #104 fix). If it
IS controlled, treat as Complete (crash-recovery idempotence).

Finalizer (deleteNodeDataPVC) skips PVC deletion when import is set
— operator owns the storage lifecycle for imported volumes.

RBAC adds get/list/watch on persistentvolumes for the PV
consistency checks; no write verbs.

Tests: 18 unit tests in internal/task/ensure_pvc_test.go covering
create-path happy/crash-recovery/non-controlled-conflict and each
of the seven import-path validation requirements in both transient
and terminal classification. 5 integration tests in
internal/controller/node/import_pvc_test.go covering end-to-end
import reaches-Running, late-PVC-converges, terminal-marks-Failed,
deletion-preserves-imported-PVC, and a regression guard for the
non-import deletion path.

Deferred to a follow-up: metrics and events from LLD §8. The feature
is functionally complete without them; separate PR to keep this one
focused.

Closes #105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant