Skip to content

fix(controller): handle long rule names in bootstrap annotation keys#224

Open
vishnukothakapu wants to merge 1 commit into
kubernetes-sigs:mainfrom
vishnukothakapu:fix-annotation-key-length
Open

fix(controller): handle long rule names in bootstrap annotation keys#224
vishnukothakapu wants to merge 1 commit into
kubernetes-sigs:mainfrom
vishnukothakapu:fix-annotation-key-length

Conversation

@vishnukothakapu

@vishnukothakapu vishnukothakapu commented May 7, 2026

Copy link
Copy Markdown

Description

This PR fixes a bug where NodeReadinessRule resources with long names (longer than 43 characters) caused the controller to fail when patching Node annotations. Kubernetes strictly limits the name part of an annotation key to 63 characters. Since our key pattern was readiness.k8s.io/bootstrap-completed-<rule-name>, long rule names resulted in invalid annotation keys.

This PR implements a stable, UID-based key design:

  • Key format: readiness.k8s.io/bootstrap-completed-<ruleUID>. Since rule.GetUID() is guaranteed by the API server to be ~36 characters (RFC 4122 UUID), this completely eliminates the 63-character limit issue without requiring custom hashing.
  • Value format: {"rule": "<name>"}. The value is stored as a JSON payload containing the original rule name to preserve human readability for debugging.
  • Migration: The controller seamlessly reads legacy readiness.k8s.io/bootstrap-completed-<name> keys and idempotently migrates them to the new UID-based format during node reconciliation.

This approach also cleanly addresses the stale annotation issue raised in #247: when a rule is deleted and recreated with the same name, it receives a fresh UID from the API server, ensuring it correctly tracks a new lifecycle state without being bypassed by a stale annotation key.

Related Issue

Fixes #223, #247, #248

Type of Change

/kind bug

Testing

  • Added internal/controller/helper_unit_test.go: Unit tests for bootstrapAnnotationKey, bootstrapAnnotationValue, and legacy migration helper functions.
  • Updated internal/controller/node_controller_reproduction_test.go: Reproduction test that confirms the controller successfully uses UID-based annotations for rules with very long names, and explicitly tests legacy annotation migration idempotency.
  • Verified with go build ./... and go vet ./internal/controller/....

Checklist

  • make test passes
  • make lint passes

Does this PR introduce a user-facing change?

Yes. Existing readiness.k8s.io/bootstrap-completed-<rule-name> annotations will be automatically migrated to readiness.k8s.io/bootstrap-completed-<ruleUID> keys on the node, with a small JSON payload representing the rule name.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2026
@netlify

netlify Bot commented May 7, 2026

Copy link
Copy Markdown

Deploy Preview for node-readiness-controller canceled.

Name Link
🔨 Latest commit 1b45c95
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/6a450254e0b6b8000833524f

@k8s-ci-robot k8s-ci-robot requested a review from mrunalp May 7, 2026 08:45
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vishnukothakapu
Once this PR has been reviewed and has the lgtm label, please assign ajaysundark for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from tallclair May 7, 2026 08:45
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Welcome @vishnukothakapu!

It looks like this is your first PR to kubernetes-sigs/node-readiness-controller 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/node-readiness-controller has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 7, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @vishnukothakapu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 7, 2026
Comment thread internal/controller/helper.go Outdated
Comment thread internal/controller/helper.go Outdated
@vishnukothakapu vishnukothakapu force-pushed the fix-annotation-key-length branch from 538cca4 to d42de51 Compare May 7, 2026 13:08
@ajaysundark ajaysundark self-requested a review May 9, 2026 13:02
@ajaysundark

Copy link
Copy Markdown
Contributor

Thanks for catching this. My only thoughts on this is that it takes away the human observability on this when a bootsrap-rule is done. :/

Comment thread internal/controller/helper.go Outdated
@ajaysundark

Copy link
Copy Markdown
Contributor

The annotation restrictions are <dns-prefix: 253>/<name: 63>; metadata.name can also be 253 chars in length.

Hashing the rule name is one option, couple of alternatives to consider:

  1. restricting the length of the rule-name to 63 (we can also move the "bootstrap-completed" to the value?)
  2. alternatively, safer to use a standard key like "readiness.k8s.io/rule-status" and go with a json payload for value. Value doesnt seem to have a restriction. it'll allow us to capture the rule-name fully inside the payload.

We need to evaluate the pros/cons on the implementation. @vishnukothakapu Do you want to evaluate the alternatives and propose a plan here?

@vishnukothakapu

Copy link
Copy Markdown
Author

Thanks for catching this. My only thoughts on this is that it takes away the human observability on this when a bootsrap-rule is done. :/

Good point. I agree the full hash reduces readability during debugging. I’ll explore the hybrid approach with a readable prefix + short hash and compare it with the other alternatives discussed.

The annotation restrictions are <dns-prefix: 253>/<name: 63>; metadata.name can also be 253 chars in length.

Hashing the rule name is one option, couple of alternatives to consider:

  1. restricting the length of the rule-name to 63 (we can also move the "bootstrap-completed" to the value?)
  2. alternatively, safer to use a standard key like "readiness.k8s.io/rule-status" and go with a json payload for value. Value doesnt seem to have a restriction. it'll allow us to capture the rule-name fully inside the payload.

We need to evaluate the pros/cons on the implementation. @vishnukothakapu Do you want to evaluate the alternatives and propose a plan here?

Thanks @ajaysundark , these are good points. I’ll evaluate the tradeoffs between the current hashing approach, the hybrid readable-prefix approach, and the single annotation JSON payload design, then propose a direction based on readability, implementation complexity, and backward compatibility.

@ajaysundark

Copy link
Copy Markdown
Contributor

/assign @vishnukothakapu

@vishnukothakapu

Copy link
Copy Markdown
Author

Hi @AvineshTripathi & @ajaysundark,
Thanks for the suggestion! I decided to go with the hybrid approach (truncated-name + short-hash) instead of the JSON payload approach because it preserves Kubernetes native selector support while still avoiding annotation length issues.
This keeps the annotations readable for debugging, ensures uniqueness with a deterministic hash, and stays compatible with the current architecture and existing workflows. I have updated the implementation and adjusted the related unit and integration tests accordingly.

@ajaysundark

Copy link
Copy Markdown
Contributor

instead of the JSON payload approach because it preserves Kubernetes native selector support

Could you clarify your thoughts further on this? What are the downsides of using a json payload. This is also how kubectl saves last applied configurations in objects today - ref: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/declarative-config/#how-to-create-objects

@AvineshTripathi / @Karthik-K-N I think fixing this short-term with a hash based approach for length immunity doesnt feel right. A more reliable long term solution would be to maintain the rule-status inside a JSON payload to track individual rule evaluation data. It would also address concerns such as #247

@vishnukothakapu

Copy link
Copy Markdown
Author

Hi @ajaysundark, thanks for the clarification, that makes sense. I can see how the structured JSON payload approach becomes more reliable long-term, especially with concerns like stale rule state and rule recreation handling from #247.

My initial preference for the hybrid approach was mainly to keep the fix minimal and avoid larger behavioral changes in this PR. But I agree the stable key + structured payload model feels more extensible and better suited for tracking rule lifecycle/state going forward.

@vishnukothakapu

Copy link
Copy Markdown
Author

instead of the JSON payload approach because it preserves Kubernetes native selector support

Could you clarify your thoughts further on this? What are the downsides of using a json payload. This is also how kubectl saves last applied configurations in objects today - ref: https://kubernetes.io/docs/tasks/manage-kubernetes-objects/declarative-config/#how-to-create-objects

@AvineshTripathi / @Karthik-K-N I think fixing this short-term with a hash based approach for length immunity doesnt feel right. A more reliable long term solution would be to maintain the rule-status inside a JSON payload to track individual rule evaluation data. It would also address concerns such as #247

@ajaysundark, the JSON payload approach is the right long-term design. It eliminates the key length issue at the root, preserves full observability (complete rule name in the value), and cleanly resolves #247, a deleted+recreated rule would no longer be blocked by a stale annotation. The main considerations are:

  1. handling read-modify-write races via RetryOnConflict (same pattern already used in addTaintBySpec), and
  2. documenting existing per-rule annotations as no-op leftovers in the changelog. Happy to update this PR to implement this if the direction is agreed.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2026
@vishnukothakapu vishnukothakapu force-pushed the fix-annotation-key-length branch from 261f03d to 4f850ec Compare May 22, 2026 10:14
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2026
@vishnukothakapu vishnukothakapu force-pushed the fix-annotation-key-length branch from 66f3eeb to 86fbd4d Compare May 22, 2026 10:41
@ajaysundark

Copy link
Copy Markdown
Contributor

We discussed this bug further in the meeting today.

A quick summary -

we agreed

readiness.k8s.io/bootstrap-completed-<ruleUID> = <>

or better, to overcome the 'opaque keys' argument:

readiness.k8s.io/bootstrap-completed-<ruleUID> = {
    "rule": "<name>", // for readability
}

maybe a preferred design for few reasons:

  1. a single json-state with "read-modify-write" handled by multiple rule-workers may not work well in scale where parallel workers can result in cascading failures.
  2. The value has 256K size / storage limitations. Though it isnt an immediate restriction for NRC as we dont expect # rules to be too many, this is a valid concern.

Unique-key creation:

  1. Each NodeReadinessRule object already has a unique key assigned to it via metadata.uid by the k8s API server. So a new 'hashing' mechanism will be unnecessary. The api-machinery (RFC 4122 UUID) guarantees ~36 chars, and immutable for the object's lifetime, and globally unique in the cluster). So rule.GetUID() is a stable, fixed-length suffix option here.
  2. also addresses a case where a rule gets deleted and recreated with the same name, as each object creation gets a fresh UID in API server, which is the behavior we want (xref Bootstrap-only rule deletion leaves stale completion annotations, causing evaluation bypass on rule recreation #247)

Migration path from "legacy" keys to UID based keys:

  1. On reconcile after the fix is implemented, the controller reads any legacy readiness.k8s.io/bootstrap-completed-<name> keys on the node, writes the equivalent UID-key record and deletes the legacy key. This is idempotent, once per node action.
  2. Rules whose names were too long would never had succeeded writing the legacy key, as the patch would have always failed, so there is no migrate plan needed for them
  3. New nodes simply will start setting the correct key under the new UID format.

cc @AvineshTripathi

@ajaysundark

Copy link
Copy Markdown
Contributor

@vishnukothakapu are you still available to take this fix?

@ajaysundark ajaysundark added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 24, 2026
@vishnukothakapu

Copy link
Copy Markdown
Author

Thanks @ajaysundark, I'm still available and happy to take this forward.

I agree with the UID-based annotation approach. Using rule.GetUID() avoids key length issues, eliminates the need for hashing, and correctly handles the delete/recreate scenario from #247.

I'll rework the implementation to use UID-based keys, add the legacy annotation migration logic, update the tests, and push an update to this PR soon.

@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vishnukothakapu
Once this PR has been reviewed and has the lgtm label, please assign ajaysundark for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vishnukothakapu vishnukothakapu force-pushed the fix-annotation-key-length branch from 6041d35 to 946480d Compare June 30, 2026 14:10
@kubernetes-prow kubernetes-prow Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2026
@vishnukothakapu

Copy link
Copy Markdown
Author

Hi @ajaysundark, @AvineshTripathi, and @Karthik-K-N,

I've pushed the updated implementation based on the design discussed above.

Tests are updated and passing. Ready for your review!

@vishnukothakapu vishnukothakapu force-pushed the fix-annotation-key-length branch from 946480d to 2ff507d Compare June 30, 2026 14:26
@ajaysundark

Copy link
Copy Markdown
Contributor

/ok-to-test

@kubernetes-prow kubernetes-prow Bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 30, 2026
@vishnukothakapu vishnukothakapu force-pushed the fix-annotation-key-length branch from 1bed1e4 to 1776f2a Compare June 30, 2026 18:51
@vishnukothakapu vishnukothakapu force-pushed the fix-annotation-key-length branch from 1776f2a to 1b45c95 Compare July 1, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Annotation key length limit exceeded for long NodeReadinessRule names

5 participants