Skip to content

feat(guardrails): Update benchmark job; add latency analysis step#359

Open
JashG wants to merge 9 commits into
mainfrom
jgulabrai/guardrails-benchmark-analysis
Open

feat(guardrails): Update benchmark job; add latency analysis step#359
JashG wants to merge 9 commits into
mainfrom
jgulabrai/guardrails-benchmark-analysis

Conversation

@JashG

@JashG JashG commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

guardrails-benchmark job

Updates the guardrails-benchmark CI step:

  • Run two variants of the benchmark: one without Guardrails, and one with Guardrails (content safety input and output checks). This enables checking overhead added by the nemo-guardrails plugin.

guardrails-benchmark-analyze job

Updates the guardrails-benchmark-analyze CI step:

  • Outputs three tables:
    • Measured Latencies: Total e2e latency for both benchmarks.
    • Platform Overhead: Latency added by NMP (i.e. latencies without counting the mock LLM calls).
    • Guardrails Overhead vs. Baseline: Latency added by Guardrails compared to pre-defined baselines. These baselines are the average of several runs in CI. I added a "reasonable +/- threshold" to account for potential latency deviations.
  • The job fails if the Guardrails overhead exceeds the reasonable threshold vs. the baseline. Currently, this job won't block the pipeline.

Benchmark harness

  • run.py now accepts --variant {with-guardrails, without-guardrails, all}. all runs both variants sequentially against the same NMP, writing them to the same run directory under aiperf_results/<variant>/.
  • seeding.py creates a second control VirtualModel (no-guardrails-vm) alongside the existing guarded one.
  • aiperf_runner.py accepts a model_ref so each variant's sweep targets the correct VM.
  • New analyze.py (stdlib-only) reads both variants from a run dir and prints three tables: measured latencies, platform overhead (mock-LLM time subtracted), and the baseline check.

Summary by CodeRabbit

  • New Features
    • Added variant-based guardrails benchmarks (“with guardrails” vs “without guardrails”) with separate run artifacts.
    • Introduced a post-run benchmark analyzer to compare latency deltas and estimate platform overhead, with optional strict baseline checks.
    • Added deterministic mock LLM configurations to improve repeatable benchmark runs.
  • Bug Fixes
    • Improved benchmark/CI consistency with deterministic run identifiers and stricter regression gating.
  • Documentation
    • Updated benchmark READMEs with the new variant workflow and mock LLM setup details.
  • Tests
    • Updated unit tests for the new “without guardrails” control virtual model and expected outputs.

@github-actions github-actions Bot added the feat label Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 21929/28733 76.3% 61.0%
Integration Tests N/A N/A N/A

@JashG JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from 6115c89 to f78ffa8 Compare June 22, 2026 14:12
@JashG JashG marked this pull request as ready for review June 23, 2026 14:10
@JashG JashG requested review from a team as code owners June 23, 2026 14:10
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f1d0554-4045-4d8d-b136-9e9276a27bba

📥 Commits

Reviewing files that changed from the base of the PR and between b3c9b97 and db48258.

📒 Files selected for processing (2)
  • .github/actions/changes/action.yaml
  • .github/workflows/ci.yaml
✅ Files skipped from review due to trivial changes (1)
  • .github/actions/changes/action.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/ci.yaml

📝 Walkthrough

Walkthrough

Adds guardrails benchmark change detection, dual benchmark variants, post-run analysis, and supporting benchmark configs, seeding, tests, workflow, and docs.

Changes

Guardrails benchmark workflow

Layer / File(s) Summary
Change detection and benchmark trigger
.github/actions/changes/action.yaml, .github/workflows/ci.yaml
Adds a guardrails-benchmark change flag, extends workflow outputs, and uses it to gate the benchmark matrix and analysis jobs.
Variant constants and control VM
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/{constants.py,seeding.py}, plugins/nemo-guardrails/tests/unit/benchmarks/test_seeding.py
Adds variant constants and a no-guardrails virtual model; seeding returns both VM names and tests assert the new control VM and artifact.
Run paths and mock LLM configs
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/paths.py, plugins/nemo-guardrails/benchmarks/configs/mock_llm/*, plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md
Run path helpers add mock LLM env inputs and per-variant paths; mock LLM env files and their README define deterministic benchmark settings.
Variant benchmark execution
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/{run.py,aiperf_runner.py}
The harness validates local configs, selects a variant-specific VM reference, prepares per-variant runtime configs, runs each variant, and aggregates outcomes.
Latency comparison and baseline checks
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py, plugins/nemo-guardrails/benchmarks/README.md
The analyzer loads per-concurrency results, compares both variants, renders measured and overhead tables, checks delta p50 against baseline tolerances, and the README documents the new analysis flow.
CI analysis job and benchmark docs
.github/workflows/ci.yaml, plugins/nemo-guardrails/benchmarks/README.md
The CI workflow adds the matrix benchmark job, the follow-up analysis job, and updated status dependencies; the README documents the new layout and regression gate flow.

Possibly related PRs

Suggested reviewers

  • gabwow
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main change: updating the guardrails benchmark job and adding latency analysis.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jgulabrai/guardrails-benchmark-analysis

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/ci.yaml (1)

1306-1334: 🩺 Stability & Availability | 🟠 Major

Add guardrails-benchmark and guardrails-benchmark-analyze to ci-status dependencies.

The ci-status job (lines 1306-1334) omits both benchmark jobs. If ci-status is the required gate and benchmark jobs fail (when triggered by changes or dispatch), those failures are missed.

Suggested patch
  ci-status:
    name: CI status
    needs:
      - changes
+     - guardrails-benchmark
+     - guardrails-benchmark-analyze
      - actionlint
      - docker-bake-graph
      - build-cpu-smoke-images
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yaml around lines 1306 - 1334, The ci-status job is
missing guardrails-benchmark and guardrails-benchmark-analyze from its needs
dependencies list. Add both of these job names to the needs array under the
ci-status job (after the other existing job names like web-studio-e2e and
opa-policy-test) to ensure benchmark job failures are reflected in the overall
CI status gate.
🧹 Nitpick comments (1)
plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md (1)

1-22: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Restructure this page to one Diataxis type with required sections.

This page is currently descriptive but missing required doc scaffolding. Convert it into a single quadrant (likely REFERENCE), add a top prerequisites block, add concrete CLI/Python examples in a tab-set, and end with a “Next Steps” section linking to related benchmark docs.
As per coding guidelines, Each documentation page should fit ONE Diataxis quadrant, Always list prerequisites at the top, Provide both Python SDK and CLI examples in tab-sets, and Include 'Next Steps' section at the end.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md` around lines 1
- 22, Restructure the Mock LLM configurations README to follow Diataxis
REFERENCE format by adding a prerequisites section at the top that lists
required dependencies and setup steps, reorganizing the existing descriptive
content into a properly structured REFERENCE page, adding a new section with
concrete CLI and Python SDK examples in tab-set format demonstrating how to use
and configure the app-llm.env and content-safety-llm.env files, and ending with
a "Next Steps" section that links to related benchmark documentation and guides
for running benchmarks with these configurations.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yaml:
- Around line 1142-1147: The analyze job's if condition prevents it from running
when the guardrails-benchmark job fails, even though the job handles missing
artifacts gracefully with continue-on-error. Modify the if condition that
currently checks for !cancelled() and the workflow_dispatch or changes.outputs
conditions to use always() instead, so the analyze job runs regardless of the
guardrails-benchmark job's status and can process partial artifacts when
available.
- Around line 1151-1154: The checkout action for nemo-platform is missing the
persist-credentials security setting. Add `persist-credentials: false` to the
`with:` section of the "Checkout nemo-platform" step to prevent the GitHub
Actions authentication token from being stored in the Git configuration, since
this job only needs to download and upload artifacts without performing Git
operations.

In `@plugins/nemo-guardrails/benchmarks/README.md`:
- Around line 227-229: The worked example in the benchmark README for c=16
tolerance incorrectly states the tolerance as 300 ms when the actual value used
in analyze.py is 200 ms. Update lines 227-229 to replace the 300 ms tolerance
value with the correct 200 ms value from analyze.py line 46, and recalculate the
pass/fail example thresholds accordingly so that the example accurately
demonstrates how the 200 ms tolerance determines whether a performance run
passes or fails based on the delta_p50 values.

In `@plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/seeding.py`:
- Around line 181-194: The control VM creation in the seeding flow can reuse an
existing virtual model unchanged when create(..., exist_ok=True) returns it, so
the empty request_middleware and response_middleware settings may be ignored.
Update the no_guardrails_vm handling in the seeding logic by either calling
client.inference.virtual_models.update() after creation to explicitly clear
middleware, or deleting any existing NO_GUARDRAILS_VM_NAME before recreating it,
so the control VM always stays middleware-free.

---

Outside diff comments:
In @.github/workflows/ci.yaml:
- Around line 1306-1334: The ci-status job is missing guardrails-benchmark and
guardrails-benchmark-analyze from its needs dependencies list. Add both of these
job names to the needs array under the ci-status job (after the other existing
job names like web-studio-e2e and opa-policy-test) to ensure benchmark job
failures are reflected in the overall CI status gate.

---

Nitpick comments:
In `@plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md`:
- Around line 1-22: Restructure the Mock LLM configurations README to follow
Diataxis REFERENCE format by adding a prerequisites section at the top that
lists required dependencies and setup steps, reorganizing the existing
descriptive content into a properly structured REFERENCE page, adding a new
section with concrete CLI and Python SDK examples in tab-set format
demonstrating how to use and configure the app-llm.env and
content-safety-llm.env files, and ending with a "Next Steps" section that links
to related benchmark documentation and guides for running benchmarks with these
configurations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1cd5064a-464d-418f-a424-914014ecf570

📥 Commits

Reviewing files that changed from the base of the PR and between c48f52c and 98f5d45.

📒 Files selected for processing (13)
  • .github/actions/changes/action.yaml
  • .github/workflows/ci.yaml
  • plugins/nemo-guardrails/benchmarks/README.md
  • plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md
  • plugins/nemo-guardrails/benchmarks/configs/mock_llm/app-llm.env
  • plugins/nemo-guardrails/benchmarks/configs/mock_llm/content-safety-llm.env
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/aiperf_runner.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/constants.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/paths.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/run.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/seeding.py
  • plugins/nemo-guardrails/tests/unit/benchmarks/test_seeding.py

Comment thread .github/workflows/ci.yaml
Comment thread .github/workflows/ci.yaml
Comment thread plugins/nemo-guardrails/benchmarks/README.md Outdated
@JashG JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from ad12028 to dcaf12f Compare June 23, 2026 14:36
@JashG JashG requested review from albcui and gabwow June 23, 2026 14:37
@JashG JashG changed the title feat(guardrails): Benchmark analysis feat(guardrails): Update benchmark job; add latency analysis step Jun 23, 2026
Comment thread .github/workflows/ci.yaml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add guardrails-benchmark/guardrails-benchmark-analyze to this list to gate CI? Is the reasoning that we haven't established tight baselines yet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this only runs right now on changes to services/guardrails and plugins/nemo-guardrails, we should add it to the list. Updated.

```

Why only delta_p50 (and not absolute with-guardrails p50)? delta_p50
isolates the middleware's contribution — shared CI runner noise cancels

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared CI runner noise cancels ...

This is not quite true because in CI, the two jobs could run on separate runners (machines). If runnerA happens to be slow and B fast, the delta is inflated (and vice versa). So the inter-runner noise is actually not cancelled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated

the numbers.

Worked example: at c=16 the override is 200 ms, so a run with observed
delta_p50 = 1689 (diff +199 from baseline 1390) passes; observed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the math is off... 1390 + 199 = 1589

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh thanks, fixed.

Comment thread .github/workflows/ci.yaml Outdated
Comment thread plugins/nemo-guardrails/benchmarks/README.md Outdated
@JashG JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from 08025e7 to b3c9b97 Compare June 29, 2026 22:01
JashG added 9 commits June 29, 2026 18:06
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
@JashG JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from b3c9b97 to db48258 Compare June 29, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants