Skip to content

store the pod and machine snapshot metrics on pending pods that never got scheduled too#5155

Open
droslean wants to merge 1 commit intoopenshift:mainfrom
droslean:pod-gracefull
Open

store the pod and machine snapshot metrics on pending pods that never got scheduled too#5155
droslean wants to merge 1 commit intoopenshift:mainfrom
droslean:pod-gracefull

Conversation

@droslean
Copy link
Copy Markdown
Member

@droslean droslean commented May 5, 2026

/cc @openshift/test-platform

Summary by CodeRabbit

  • Bug Fixes
    • Improved error handling for step pods that fail during initialization phase, ensuring metrics are properly recorded and system state is updated during cleanup operations.

… got scheduled too

Signed-off-by: Nikolaos Moraitis <nmoraiti@redhat.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci Bot requested a review from a team May 5, 2026 13:56
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a4c9197a-1c5b-4eb8-871b-94cb20499b04

📥 Commits

Reviewing files that changed from the base of the PR and between 483a4a2 and cd33d01.

📒 Files selected for processing (1)
  • pkg/steps/multi_stage/run.go

📝 Walkthrough

Walkthrough

When a step pod fails while in Pending phase, the cleanup path now records pod lifecycle metrics and machine snapshot metrics before deleting the pod. Two metric recording calls are added to the error handling block in the pod execution function.

Changes

Pod Lifecycle Metrics on Failure

Layer / File(s) Summary
Metrics Recording
pkg/steps/multi_stage/run.go
Pod lifecycle and machine snapshot metrics are captured via StorePodLifecycleMetrics() and StoreMachinesSnapshot() before pending pod deletion in the error path of runPod.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Coverage For New Features ⚠️ Warning New metrics calls (StorePodLifecycleMetrics, StoreMachinesSnapshot) added to pending pod error handling, but test only verifies pod deletion, not metrics calls. Add test assertions to verify StorePodLifecycleMetrics and StoreMachinesSnapshot are called when handling pending pod errors.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly matches the changeset: adding pod and machine snapshot metrics recording for pending pods that never got scheduled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Go Error Handling ✅ Passed New metric calls (StorePodLifecycleMetrics, StoreMachinesSnapshot) return void. Error handling is idiomatic: errors wrapped with fmt.Errorf %w, proper nil checks, no ignored errors, no panic.
Stable And Deterministic Test Names ✅ Passed The PR only modifies pkg/steps/multi_stage/run.go, adding metrics recording code. The repository does not use Ginkgo for testing and contains no Ginkgo test names. The custom check is not applicable.
Test Structure And Quality ✅ Passed Repository uses traditional Go testing, not Ginkgo. Custom check targets Ginkgo patterns (It/Describe/BeforeEach blocks). Not applicable.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Changes only modify pkg/steps/multi_stage/run.go to record pod metrics. Check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR modifies infrastructure code, not e2e tests. Check applies only to new Ginkgo tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds metrics collection for pending pods but introduces no scheduling constraints. Change affects observability code only, not deployment manifests, operator configs, or scheduling rules.
Ote Binary Stdout Contract ✅ Passed Added metrics recording calls don't write to stdout - they record to internal event channel. Code is in normal runPod() method, not process-level. ci-operator is not an OTE binary.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Check not applicable. PR modifies infrastructure code (pkg/steps/multi_stage/run.go), not Ginkgo e2e tests. No new test code with IPv4 assumptions or external connectivity is added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@droslean
Copy link
Copy Markdown
Member Author

droslean commented May 5, 2026

/override ci/prow/e2e

@droslean
Copy link
Copy Markdown
Member Author

droslean commented May 5, 2026

/override ci/prow/images

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007, droslean

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 5, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007, droslean

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

@droslean: Overrode contexts on behalf of droslean: ci/prow/e2e

Details

In response to this:

/override ci/prow/e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

@droslean: Overrode contexts on behalf of droslean: ci/prow/images

Details

In response to this:

/override ci/prow/images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 483a4a2 and 2 for PR HEAD cd33d01 in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

@droslean: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes cd33d01 link false /test breaking-changes
ci/prow/integration cd33d01 link true /test integration
ci/prow/validate-vendor cd33d01 link true /test validate-vendor
ci/prow/frontend-checks cd33d01 link true /test frontend-checks
ci/prow/checkconfig cd33d01 link true /test checkconfig

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants