Adds Agentic Retrieval into harness by mahikaw · Pull Request #2267 · NVIDIA/NeMo-Retriever

mahikaw · 2026-06-25T00:10:39Z

Summary

Adds opt-in agentic retrieval evaluation to harness runs.

Adds agentic: true harness config support and forwards it through the harness execution path.
Supports agentic BEIR and agentic audio recall evaluation.
Keeps standard retrieval unchanged by default.
Adds validation for agentic settings, including required LLM model, supported evaluation modes, candidate depth vs
requested metric depth, runtime bounds, and endpoint-aware temperature limits.
Documents agentic harness usage.

Validation

Focused agentic/harness tests: 132 passed
Hosted agentic harness run completed through ingest, indexing, agentic retrieval, final selection, and metric
output.

Follow-ups

Add structured traces for agent-loop and selection as harness artifacts using the repo’s existing runtime/reporting
terminology.
Add agentic runs to routine harness flows, such as sweep/nightly

copy-pr-bot · 2026-06-25T00:10:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-25T20:31:52Z

Greptile Summary

This PR adds opt-in agentic (ReAct) retrieval evaluation to the harness, routing BEIR and audio-recall benchmarks through a graph-backed LLM agent when agentic: true is configured. Standard dense retrieval is left entirely unchanged.

Harness integration: beir_runner.py forks at retrieval time — dense queries continue through the existing _dense_retrieve loop while the agentic path calls a new concurrent _agentic_retrieve batch; scoring, artifact writing, and metrics are shared between both strategies.
Shared validation helpers: A new agentic_options.py centralises integer-coercion and temperature/backend-depth guards, reused consistently across the CLI, query app, and AgenticRetrievalConfig.__post_init__.
Config propagation: Nine new query.* override paths are added to the harness resolver, and build_agentic_config in workflow.py provides a single derivation point used by both the single-query CLI and the batch harness path.

Confidence Score: 5/5

Safe to merge; the agentic path is strictly opt-in and the dense path is unchanged.

The change is well-scoped: validation helpers are tested in isolation, the harness integration tests cover both the happy path and the structured-failure path, and all 132 unit tests pass. The two findings are a defensive assert that should be a RuntimeError and a stale SPDX year — neither affects runtime correctness on any active code path.

No files require special attention; beir_runner.py has a minor defensive-guard issue worth addressing before the next caller extends the function.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/query/agentic_options.py	New shared validation helper module for agentic options; SPDX copyright year shows `2024-25` but the file is new in 2026.
nemo_retriever/src/nemo_retriever/harness/beir_runner.py	Splits dense and agentic retrieval paths; introduces an `assert` guard for the `query_plan is not None` invariant that would be silently dropped in optimized builds.
nemo_retriever/src/nemo_retriever/query/agentic.py	Adds `run_agentic_beir_evaluation`, `run_agentic_audio_recall_evaluation`, and `agentic_beir_retrieve`; validation consolidated into `AgenticRetrievalConfig.__post_init__` via shared helpers.
nemo_retriever/src/nemo_retriever/query/workflow.py	Extracts `build_agentic_config` so both the single-query CLI and batch harness BEIR path share one config-derivation point; clean refactor.
nemo_retriever/src/nemo_retriever/cli/pipeline/main.py	Adds nine agentic Typer options, `_run_agentic_evaluation` helper, and validation block; service-mode guard is correctly registered.
nemo_retriever/tests/test_harness_agentic_eval.py	New test file covering all three harness integration points; patches at the correct seams and verifies both happy-path wiring and structured failure on invalid config.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[harness run_benchmark] --> B[resolve_query_plan]
    A --> C[build_query_request]
    B --> D[run_beir_queries]
    C --> D

    D --> E{query_request.agentic.enabled?}

    E -->|Yes| F[_agentic_retrieve]
    E -->|No| G[_dense_retrieve]

    F --> F1[build_agentic_config]
    F1 --> F2[agentic_beir_retrieve]
    F2 --> F3[AgenticRetriever.retrieve]
    F3 --> F4[_agentic_result_to_ranked_doc_ids]
    F4 --> H

    G --> G1[query_plan.create_retriever]
    G1 --> G2[per-query retriever loop]
    G2 --> G3[build_beir_run_from_hits]
    G3 --> H

    H[build_beir_run_from_ranked_doc_ids] --> I[compute_beir_metrics]
    I --> J[write artifacts: beir_metrics.json / beir_run.trec / query_results.jsonl]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[harness run_benchmark] --> B[resolve_query_plan]
    A --> C[build_query_request]
    B --> D[run_beir_queries]
    C --> D

    D --> E{query_request.agentic.enabled?}

    E -->|Yes| F[_agentic_retrieve]
    E -->|No| G[_dense_retrieve]

    F --> F1[build_agentic_config]
    F1 --> F2[agentic_beir_retrieve]
    F2 --> F3[AgenticRetriever.retrieve]
    F3 --> F4[_agentic_result_to_ranked_doc_ids]
    F4 --> H

    G --> G1[query_plan.create_retriever]
    G1 --> G2[per-query retriever loop]
    G2 --> G3[build_beir_run_from_hits]
    G3 --> H

    H[build_beir_run_from_ranked_doc_ids] --> I[compute_beir_metrics]
    I --> J[write artifacts: beir_metrics.json / beir_run.trec / query_results.jsonl]

_{Reviews (8): Last reviewed commit: "Add agentic (ReAct) retrieval and agenti..." | Re-trigger Greptile}

PR #2270 (harness revamp) replaced the subprocess-based harness/run.py and deleted its tests, while this branch (#2267) had added agentic (ReAct) BEIR evaluation to that old harness. Re-port it onto the new in-process harness: - Agentic options now flow via the benchmark spec query section (query.agentic*, registered in QUERY_OVERRIDE_PATHS) into QueryRequest.agentic, replacing the deleted CLI-flag/HarnessConfig path. - run_beir_queries forks only at retrieval: shared dataset load + scoring + artifact writing; _dense_retrieve (per-query loop) vs _agentic_retrieve (one concurrent ReAct batch), each returning a uniform (latencies, run). - Factor build_agentic_config/build_agentic_retriever (workflow.py) and agentic_beir_retrieve (agentic.py), shared by the CLI single-query path and the harness batch path. - Strip the now-dead HarnessConfig.agentic_* fields (the new run path no longer uses HarnessConfig). - Add tests/test_harness_agentic_eval.py; fix docs/cli/benchmarking.md override examples to the query.agentic* namespace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Mahika Wason <mwason@nvidia.com>

Adds an opt-in agentic retrieval mode (ReAct loop -> RRF fusion -> source-priority selection, reusing the existing Retriever/LanceDB; the standard dense path is unchanged) and wires it into the revamped, in-process harness as a BEIR evaluation strategy. - Agentic query options flow via the benchmark spec query section (query.agentic*, registered in QUERY_OVERRIDE_PATHS) into QueryRequest.agentic. - run_beir_queries forks only at retrieval: shared dataset load + scoring + artifact writing; dense per-query loop vs one concurrent agentic ReAct batch, each returning a uniform (latencies, run). - build_agentic_config/build_agentic_retriever (query/workflow.py) and agentic_beir_retrieve (query/agentic.py) centralize agentic config derivation, shared by the CLI single-query path and the harness batch path. - Tests in tests/test_harness_agentic_eval.py; docs in docs/cli/benchmarking.md (query.agentic* override examples). Signed-off-by: Mahika Wason <mwason@nvidia.com>

jioffe502 · 2026-07-01T20:16:35Z

+Minimal BEIR override example:
+
+```bash
+retriever harness run --dataset jp20 --preset single_gpu \


we might not have presets still

jioffe502 · 2026-07-01T20:20:15Z

this is the entry point to: "retriever pipeline run /files"

we are deprecating it this release so right now it has to be backwards compatible, but we are cutting off the oxygen to it slowly.

you can see in the actual code, how a lot of it is just invoking the new cli tools retriever ingest and retriever query in a single file

entry point for harness related things is all in the harness folder nemo_retriever/harness

mahikaw marked this pull request as ready for review June 25, 2026 20:21

mahikaw requested review from a team as code owners June 25, 2026 20:21

mahikaw requested a review from drobison00 June 25, 2026 20:21

greptile-apps Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread nemo_retriever/src/nemo_retriever/query/agentic_options.py Outdated

Comment thread nemo_retriever/src/nemo_retriever/query/agentic_options.py

Comment thread nemo_retriever/src/nemo_retriever/harness/run.py Outdated

mahikaw requested a review from jioffe502 June 26, 2026 17:38

mahikaw force-pushed the dev/mahikaw/agentic_harness_eval_pr branch 2 times, most recently from 9b5e999 to 31b7f7d Compare June 30, 2026 19:20

mahikaw force-pushed the dev/mahikaw/agentic_harness_eval_pr branch 2 times, most recently from ed2c597 to 0f5011c Compare July 1, 2026 19:29

jioffe502 reviewed Jul 1, 2026

View reviewed changes

jioffe502 approved these changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds Agentic Retrieval into harness#2267

Adds Agentic Retrieval into harness#2267
mahikaw wants to merge 1 commit into
mainfrom
dev/mahikaw/agentic_harness_eval_pr

mahikaw commented Jun 25, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

greptile-apps Bot commented Jun 25, 2026 •

edited

Loading

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jioffe502 Jul 1, 2026

Uh oh!

jioffe502 Jul 1, 2026

Uh oh!

jioffe502 Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mahikaw commented Jun 25, 2026

Summary

Validation

Follow-ups

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

greptile-apps Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jioffe502 Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

jioffe502 Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

jioffe502 Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 25, 2026 •

edited

Loading