feat: Add selector and bootstrap observability metrics#286
Conversation
Signed-off-by: Rawad Hossain <rawad.hossain00@gmail.com>
✅ Deploy Preview for node-readiness-controller canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rawadhossain The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @rawadhossain. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Description
This PR adds three new metrics. Two metrics are fully implemented, while the third is left with TODOs pending discussion.
What each metric does
node_readiness_selector_matched_nodes_totalTracks how many nodes currently match a rule's spec. If a rule's
NodeSelectormatches no nodes, controller performs no work and produces no other signal. This metric makes those misconfigurations immediately visible.node_readiness_bootstrap_completion_errors_totalCounts failures writing the bootstrap completion annotation. If this write fails, the node continues to be re-evaluated even though bootstrap completed. This metric makes those failures visible.
node_readiness_bootstrap_nrc_duration_secondsMeasures only the time NRC itself held a node, from the first taint until bootstrap completion. Excludes pre-NRC boot time.
It's registered, but recording logic deferred pending discussion on the timestamp anchor.
The two approaches I see are:
Option A:
readiness.k8s.io/taint-applied-<rule>tonode.ObjectMetawhen the taint is first applied and record duration from that timestamp to bootstrap completion.Option B:
lastTransitionTimeas the start timestamp.Status().Patch()call andnodes/statuswrite permissions.I left TODOs, so implementation can be completed once we agree on the approach.
Related to Issue #182
Type of Change
/kind feature
Testing
Checklist
make testpassesmake lintpasses