Skip to content

Adds Agentic Retrieval into harness#2267

Open
mahikaw wants to merge 1 commit into
mainfrom
dev/mahikaw/agentic_harness_eval_pr
Open

Adds Agentic Retrieval into harness#2267
mahikaw wants to merge 1 commit into
mainfrom
dev/mahikaw/agentic_harness_eval_pr

Conversation

@mahikaw

@mahikaw mahikaw commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds opt-in agentic retrieval evaluation to harness runs.

  • Adds agentic: true harness config support and forwards it through the harness execution path.
  • Supports agentic BEIR and agentic audio recall evaluation.
  • Keeps standard retrieval unchanged by default.
  • Adds validation for agentic settings, including required LLM model, supported evaluation modes, candidate depth vs
    requested metric depth, runtime bounds, and endpoint-aware temperature limits.
  • Documents agentic harness usage.

Validation

  • Focused agentic/harness tests: 132 passed
  • Hosted agentic harness run completed through ingest, indexing, agentic retrieval, final selection, and metric
    output.

Follow-ups

  • Add structured traces for agent-loop and selection as harness artifacts using the repo’s existing runtime/reporting
    terminology.
  • Add agentic runs to routine harness flows, such as sweep/nightly

@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mahikaw mahikaw marked this pull request as ready for review June 25, 2026 20:21
@mahikaw mahikaw requested review from a team as code owners June 25, 2026 20:21
@mahikaw mahikaw requested a review from drobison00 June 25, 2026 20:21
@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds opt-in agentic (ReAct) retrieval evaluation to the harness, routing BEIR and audio-recall benchmarks through a graph-backed LLM agent when agentic: true is configured. Standard dense retrieval is left entirely unchanged.

  • Harness integration: beir_runner.py forks at retrieval time — dense queries continue through the existing _dense_retrieve loop while the agentic path calls a new concurrent _agentic_retrieve batch; scoring, artifact writing, and metrics are shared between both strategies.
  • Shared validation helpers: A new agentic_options.py centralises integer-coercion and temperature/backend-depth guards, reused consistently across the CLI, query app, and AgenticRetrievalConfig.__post_init__.
  • Config propagation: Nine new query.* override paths are added to the harness resolver, and build_agentic_config in workflow.py provides a single derivation point used by both the single-query CLI and the batch harness path.

Confidence Score: 5/5

Safe to merge; the agentic path is strictly opt-in and the dense path is unchanged.

The change is well-scoped: validation helpers are tested in isolation, the harness integration tests cover both the happy path and the structured-failure path, and all 132 unit tests pass. The two findings are a defensive assert that should be a RuntimeError and a stale SPDX year — neither affects runtime correctness on any active code path.

No files require special attention; beir_runner.py has a minor defensive-guard issue worth addressing before the next caller extends the function.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/query/agentic_options.py New shared validation helper module for agentic options; SPDX copyright year shows 2024-25 but the file is new in 2026.
nemo_retriever/src/nemo_retriever/harness/beir_runner.py Splits dense and agentic retrieval paths; introduces an assert guard for the query_plan is not None invariant that would be silently dropped in optimized builds.
nemo_retriever/src/nemo_retriever/query/agentic.py Adds run_agentic_beir_evaluation, run_agentic_audio_recall_evaluation, and agentic_beir_retrieve; validation consolidated into AgenticRetrievalConfig.__post_init__ via shared helpers.
nemo_retriever/src/nemo_retriever/query/workflow.py Extracts build_agentic_config so both the single-query CLI and batch harness BEIR path share one config-derivation point; clean refactor.
nemo_retriever/src/nemo_retriever/cli/pipeline/main.py Adds nine agentic Typer options, _run_agentic_evaluation helper, and validation block; service-mode guard is correctly registered.
nemo_retriever/tests/test_harness_agentic_eval.py New test file covering all three harness integration points; patches at the correct seams and verifies both happy-path wiring and structured failure on invalid config.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[harness run_benchmark] --> B[resolve_query_plan]
    A --> C[build_query_request]
    B --> D[run_beir_queries]
    C --> D

    D --> E{query_request.agentic.enabled?}

    E -->|Yes| F[_agentic_retrieve]
    E -->|No| G[_dense_retrieve]

    F --> F1[build_agentic_config]
    F1 --> F2[agentic_beir_retrieve]
    F2 --> F3[AgenticRetriever.retrieve]
    F3 --> F4[_agentic_result_to_ranked_doc_ids]
    F4 --> H

    G --> G1[query_plan.create_retriever]
    G1 --> G2[per-query retriever loop]
    G2 --> G3[build_beir_run_from_hits]
    G3 --> H

    H[build_beir_run_from_ranked_doc_ids] --> I[compute_beir_metrics]
    I --> J[write artifacts: beir_metrics.json / beir_run.trec / query_results.jsonl]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[harness run_benchmark] --> B[resolve_query_plan]
    A --> C[build_query_request]
    B --> D[run_beir_queries]
    C --> D

    D --> E{query_request.agentic.enabled?}

    E -->|Yes| F[_agentic_retrieve]
    E -->|No| G[_dense_retrieve]

    F --> F1[build_agentic_config]
    F1 --> F2[agentic_beir_retrieve]
    F2 --> F3[AgenticRetriever.retrieve]
    F3 --> F4[_agentic_result_to_ranked_doc_ids]
    F4 --> H

    G --> G1[query_plan.create_retriever]
    G1 --> G2[per-query retriever loop]
    G2 --> G3[build_beir_run_from_hits]
    G3 --> H

    H[build_beir_run_from_ranked_doc_ids] --> I[compute_beir_metrics]
    I --> J[write artifacts: beir_metrics.json / beir_run.trec / query_results.jsonl]
Loading

Reviews (8): Last reviewed commit: "Add agentic (ReAct) retrieval and agenti..." | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/query/agentic_options.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/query/agentic_options.py
Comment thread nemo_retriever/src/nemo_retriever/harness/run.py Outdated
@mahikaw mahikaw requested a review from jioffe502 June 26, 2026 17:38
mahikaw added a commit that referenced this pull request Jun 30, 2026
PR #2270 (harness revamp) replaced the subprocess-based harness/run.py
and deleted its tests, while this branch (#2267) had added agentic (ReAct)
BEIR evaluation to that old harness. Re-port it onto the new in-process
harness:

- Agentic options now flow via the benchmark spec query section
  (query.agentic*, registered in QUERY_OVERRIDE_PATHS) into
  QueryRequest.agentic, replacing the deleted CLI-flag/HarnessConfig path.
- run_beir_queries forks only at retrieval: shared dataset load + scoring
  + artifact writing; _dense_retrieve (per-query loop) vs _agentic_retrieve
  (one concurrent ReAct batch), each returning a uniform (latencies, run).
- Factor build_agentic_config/build_agentic_retriever (workflow.py) and
  agentic_beir_retrieve (agentic.py), shared by the CLI single-query path
  and the harness batch path.
- Strip the now-dead HarnessConfig.agentic_* fields (the new run path no
  longer uses HarnessConfig).
- Add tests/test_harness_agentic_eval.py; fix docs/cli/benchmarking.md
  override examples to the query.agentic* namespace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
@mahikaw mahikaw force-pushed the dev/mahikaw/agentic_harness_eval_pr branch 2 times, most recently from 9b5e999 to 31b7f7d Compare June 30, 2026 19:20
Adds an opt-in agentic retrieval mode (ReAct loop -> RRF fusion ->
source-priority selection, reusing the existing Retriever/LanceDB; the
standard dense path is unchanged) and wires it into the revamped, in-process
harness as a BEIR evaluation strategy.

- Agentic query options flow via the benchmark spec query section
  (query.agentic*, registered in QUERY_OVERRIDE_PATHS) into
  QueryRequest.agentic.
- run_beir_queries forks only at retrieval: shared dataset load + scoring +
  artifact writing; dense per-query loop vs one concurrent agentic ReAct
  batch, each returning a uniform (latencies, run).
- build_agentic_config/build_agentic_retriever (query/workflow.py) and
  agentic_beir_retrieve (query/agentic.py) centralize agentic config
  derivation, shared by the CLI single-query path and the harness batch path.
- Tests in tests/test_harness_agentic_eval.py; docs in
  docs/cli/benchmarking.md (query.agentic* override examples).

Signed-off-by: Mahika Wason <mwason@nvidia.com>
@mahikaw mahikaw force-pushed the dev/mahikaw/agentic_harness_eval_pr branch 2 times, most recently from ed2c597 to 0f5011c Compare July 1, 2026 19:29
Minimal BEIR override example:

```bash
retriever harness run --dataset jp20 --preset single_gpu \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not have presets still

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the entry point to: "retriever pipeline run /files"

we are deprecating it this release so right now it has to be backwards compatible, but we are cutting off the oxygen to it slowly.

you can see in the actual code, how a lot of it is just invoking the new cli tools retriever ingest and retriever query in a single file

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entry point for harness related things is all in the harness folder nemo_retriever/harness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants