feat(guardrails): Update benchmark job; add latency analysis step by JashG · Pull Request #359 · NVIDIA-NeMo/nemo-platform

JashG · 2026-06-16T21:09:20Z

Summary

guardrails-benchmark job

Updates the guardrails-benchmark CI step:

Run two variants of the benchmark: one without Guardrails, and one with Guardrails (content safety input and output checks). This enables checking overhead added by the nemo-guardrails plugin.

guardrails-benchmark-analyze job

Updates the guardrails-benchmark-analyze CI step:

Outputs three tables:
- Measured Latencies: Total e2e latency for both benchmarks.
- Platform Overhead: Latency added by NMP (i.e. latencies without counting the mock LLM calls).
- Guardrails Overhead vs. Baseline: Latency added by Guardrails compared to pre-defined baselines. These baselines are the average of several runs in CI. I added a "reasonable +/- threshold" to account for potential latency deviations.
The job fails if the Guardrails overhead exceeds the reasonable threshold vs. the baseline. Currently, this job won't block the pipeline.

Benchmark harness

run.py now accepts --variant {with-guardrails, without-guardrails, all}. all runs both variants sequentially against the same NMP, writing them to the same run directory under aiperf_results/<variant>/.
seeding.py creates a second control VirtualModel (no-guardrails-vm) alongside the existing guarded one.
aiperf_runner.py accepts a model_ref so each variant's sweep targets the correct VM.
New analyze.py (stdlib-only) reads both variants from a run dir and prints three tables: measured latencies, platform overhead (mock-LLM time subtracted), and the baseline check.

Summary by CodeRabbit

New Features
- Added variant-based guardrails benchmarks (“with guardrails” vs “without guardrails”) with separate run artifacts.
- Introduced a post-run benchmark analyzer to compare latency deltas and estimate platform overhead, with optional strict baseline checks.
- Added deterministic mock LLM configurations to improve repeatable benchmark runs.
Bug Fixes
- Improved benchmark/CI consistency with deterministic run identifiers and stricter regression gating.
Documentation
- Updated benchmark READMEs with the new variant workflow and mock LLM setup details.
Tests
- Updated unit tests for the new “without guardrails” control virtual model and expected outputs.

github-actions · 2026-06-16T21:20:16Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	21929/28733	76.3%	61.0%
Integration Tests	N/A	N/A	N/A

coderabbitai · 2026-06-23T14:20:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f1d0554-4045-4d8d-b136-9e9276a27bba

📥 Commits

Reviewing files that changed from the base of the PR and between b3c9b97 and db48258.

📒 Files selected for processing (2)

.github/actions/changes/action.yaml
.github/workflows/ci.yaml

✅ Files skipped from review due to trivial changes (1)

.github/actions/changes/action.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/ci.yaml

📝 Walkthrough

Walkthrough

Adds guardrails benchmark change detection, dual benchmark variants, post-run analysis, and supporting benchmark configs, seeding, tests, workflow, and docs.

Changes

Guardrails benchmark workflow

Layer / File(s)	Summary
Change detection and benchmark trigger `.github/actions/changes/action.yaml`, `.github/workflows/ci.yaml`	Adds a `guardrails-benchmark` change flag, extends workflow outputs, and uses it to gate the benchmark matrix and analysis jobs.
Variant constants and control VM `plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/{constants.py,seeding.py}`, `plugins/nemo-guardrails/tests/unit/benchmarks/test_seeding.py`	Adds variant constants and a no-guardrails virtual model; seeding returns both VM names and tests assert the new control VM and artifact.
Run paths and mock LLM configs `plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/paths.py`, `plugins/nemo-guardrails/benchmarks/configs/mock_llm/*`, `plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md`	Run path helpers add mock LLM env inputs and per-variant paths; mock LLM env files and their README define deterministic benchmark settings.
Variant benchmark execution `plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/{run.py,aiperf_runner.py}`	The harness validates local configs, selects a variant-specific VM reference, prepares per-variant runtime configs, runs each variant, and aggregates outcomes.
Latency comparison and baseline checks `plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py`, `plugins/nemo-guardrails/benchmarks/README.md`	The analyzer loads per-concurrency results, compares both variants, renders measured and overhead tables, checks delta p50 against baseline tolerances, and the README documents the new analysis flow.
CI analysis job and benchmark docs `.github/workflows/ci.yaml`, `plugins/nemo-guardrails/benchmarks/README.md`	The CI workflow adds the matrix benchmark job, the follow-up analysis job, and updated status dependencies; the README documents the new layout and regression gate flow.

Possibly related PRs

NVIDIA-NeMo/nemo-platform#80: Introduced the guardrails benchmark harness that this PR extends with dual variants and analysis.

Suggested reviewers

gabwow

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title matches the main change: updating the guardrails benchmark job and adding latency analysis.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jgulabrai/guardrails-benchmark-analysis

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/ci.yaml (1)
1306-1334: 🩺 Stability & Availability | 🟠 Major

Add guardrails-benchmark and guardrails-benchmark-analyze to ci-status dependencies.

The ci-status job (lines 1306-1334) omits both benchmark jobs. If ci-status is the required gate and benchmark jobs fail (when triggered by changes or dispatch), those failures are missed.
Suggested patch
  ci-status:
    name: CI status
    needs:
      - changes
+     - guardrails-benchmark
+     - guardrails-benchmark-analyze
      - actionlint
      - docker-bake-graph
      - build-cpu-smoke-images
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yaml around lines 1306 - 1334, The ci-status job is
missing guardrails-benchmark and guardrails-benchmark-analyze from its needs
dependencies list. Add both of these job names to the needs array under the
ci-status job (after the other existing job names like web-studio-e2e and
opa-policy-test) to ensure benchmark job failures are reflected in the overall
CI status gate.

🧹 Nitpick comments (1)

plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md (1)
1-22: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Restructure this page to one Diataxis type with required sections.

This page is currently descriptive but missing required doc scaffolding. Convert it into a single quadrant (likely REFERENCE), add a top prerequisites block, add concrete CLI/Python examples in a tab-set, and end with a “Next Steps” section linking to related benchmark docs.
As per coding guidelines, Each documentation page should fit ONE Diataxis quadrant, Always list prerequisites at the top, Provide both Python SDK and CLI examples in tab-sets, and Include 'Next Steps' section at the end.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md` around lines 1
- 22, Restructure the Mock LLM configurations README to follow Diataxis
REFERENCE format by adding a prerequisites section at the top that lists
required dependencies and setup steps, reorganizing the existing descriptive
content into a properly structured REFERENCE page, adding a new section with
concrete CLI and Python SDK examples in tab-set format demonstrating how to use
and configure the app-llm.env and content-safety-llm.env files, and ending with
a "Next Steps" section that links to related benchmark documentation and guides
for running benchmarks with these configurations.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yaml:
- Around line 1142-1147: The analyze job's if condition prevents it from running
when the guardrails-benchmark job fails, even though the job handles missing
artifacts gracefully with continue-on-error. Modify the if condition that
currently checks for !cancelled() and the workflow_dispatch or changes.outputs
conditions to use always() instead, so the analyze job runs regardless of the
guardrails-benchmark job's status and can process partial artifacts when
available.
- Around line 1151-1154: The checkout action for nemo-platform is missing the
persist-credentials security setting. Add `persist-credentials: false` to the
`with:` section of the "Checkout nemo-platform" step to prevent the GitHub
Actions authentication token from being stored in the Git configuration, since
this job only needs to download and upload artifacts without performing Git
operations.

In `@plugins/nemo-guardrails/benchmarks/README.md`:
- Around line 227-229: The worked example in the benchmark README for c=16
tolerance incorrectly states the tolerance as 300 ms when the actual value used
in analyze.py is 200 ms. Update lines 227-229 to replace the 300 ms tolerance
value with the correct 200 ms value from analyze.py line 46, and recalculate the
pass/fail example thresholds accordingly so that the example accurately
demonstrates how the 200 ms tolerance determines whether a performance run
passes or fails based on the delta_p50 values.

In `@plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/seeding.py`:
- Around line 181-194: The control VM creation in the seeding flow can reuse an
existing virtual model unchanged when create(..., exist_ok=True) returns it, so
the empty request_middleware and response_middleware settings may be ignored.
Update the no_guardrails_vm handling in the seeding logic by either calling
client.inference.virtual_models.update() after creation to explicitly clear
middleware, or deleting any existing NO_GUARDRAILS_VM_NAME before recreating it,
so the control VM always stays middleware-free.

---

Outside diff comments:
In @.github/workflows/ci.yaml:
- Around line 1306-1334: The ci-status job is missing guardrails-benchmark and
guardrails-benchmark-analyze from its needs dependencies list. Add both of these
job names to the needs array under the ci-status job (after the other existing
job names like web-studio-e2e and opa-policy-test) to ensure benchmark job
failures are reflected in the overall CI status gate.

---

Nitpick comments:
In `@plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md`:
- Around line 1-22: Restructure the Mock LLM configurations README to follow
Diataxis REFERENCE format by adding a prerequisites section at the top that
lists required dependencies and setup steps, reorganizing the existing
descriptive content into a properly structured REFERENCE page, adding a new
section with concrete CLI and Python SDK examples in tab-set format
demonstrating how to use and configure the app-llm.env and
content-safety-llm.env files, and ending with a "Next Steps" section that links
to related benchmark documentation and guides for running benchmarks with these
configurations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1cd5064a-464d-418f-a424-914014ecf570

📥 Commits

Reviewing files that changed from the base of the PR and between c48f52c and 98f5d45.

📒 Files selected for processing (13)

.github/actions/changes/action.yaml
.github/workflows/ci.yaml
plugins/nemo-guardrails/benchmarks/README.md
plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md
plugins/nemo-guardrails/benchmarks/configs/mock_llm/app-llm.env
plugins/nemo-guardrails/benchmarks/configs/mock_llm/content-safety-llm.env
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/aiperf_runner.py
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/constants.py
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/paths.py
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/run.py
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/seeding.py
plugins/nemo-guardrails/tests/unit/benchmarks/test_seeding.py

albcui · 2026-06-24T19:03:59Z

Why not add guardrails-benchmark/guardrails-benchmark-analyze to this list to gate CI? Is the reasoning that we haven't established tight baselines yet?

Since this only runs right now on changes to services/guardrails and plugins/nemo-guardrails, we should add it to the list. Updated.

albcui · 2026-06-24T19:11:19Z

+```
+
+Why only delta_p50 (and not absolute with-guardrails p50)? delta_p50
+isolates the middleware's contribution — shared CI runner noise cancels


shared CI runner noise cancels ...

This is not quite true because in CI, the two jobs could run on separate runners (machines). If runnerA happens to be slow and B fast, the delta is inflated (and vice versa). So the inter-runner noise is actually not cancelled.

Good catch, updated

albcui · 2026-06-24T19:14:41Z

+  the numbers.
+
+Worked example: at c=16 the override is 200 ms, so a run with observed
+delta_p50 = 1689 (diff +199 from baseline 1390) passes; observed


the math is off... 1390 + 199 = 1589

Heh thanks, fixed.

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

github-actions Bot added the feat label Jun 16, 2026

JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from 6115c89 to f78ffa8 Compare June 22, 2026 14:12

JashG marked this pull request as ready for review June 23, 2026 14:10

JashG requested review from a team as code owners June 23, 2026 14:10

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread .github/workflows/ci.yaml

Comment thread .github/workflows/ci.yaml

Comment thread plugins/nemo-guardrails/benchmarks/README.md Outdated

Comment thread plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/seeding.py

JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from ad12028 to dcaf12f Compare June 23, 2026 14:36

JashG requested review from albcui and gabwow June 23, 2026 14:37

JashG changed the title ~~feat(guardrails): Benchmark analysis~~ feat(guardrails): Update benchmark job; add latency analysis step Jun 23, 2026

albcui reviewed Jun 24, 2026

View reviewed changes

JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from 08025e7 to b3c9b97 Compare June 29, 2026 22:01

JashG added 9 commits June 29, 2026 18:06

feat(guardrails): Benchmark analysis in-progress

af48582

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

Fix CI

53c430d

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

always run benchmark

ed87986

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

Cleanup

047129f

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

Updates

2603eea

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

Clean up CI config and baselines

0f25cc4

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

Minor ReadME cleanup

d6105f2

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

Address CodeRabbit

6c7b9fe

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

Address MR comments

db48258

Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>

JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from b3c9b97 to db48258 Compare June 29, 2026 22:06

Uh oh!

Conversation

JashG commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

guardrails-benchmark job

guardrails-benchmark-analyze job

Benchmark harness

Summary by CodeRabbit

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albcui Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

JashG Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

albcui Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

JashG Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

albcui Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

JashG Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JashG commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading