From 6259003414503694f162931c74666946cecf5493 Mon Sep 17 00:00:00 2001 From: Paulo Lacerda Date: Mon, 1 Jun 2026 16:17:26 -0300 Subject: [PATCH 1/3] feat(skills) + docs(tutorials): close data-plane RBAC gap that blocked first eval run (#224) * docs(tutorials): document data-plane RBAC step missing from Foundry portal Creating a Foundry project through the portal only assigns the user 'Foundry User' at the project scope. That role does not cover OpenAI data-plane actions on the parent AI Services account, where chat completions actually live - so every AI-assisted evaluator and every cloud-eval grader fails with PermissionDenied the first time a fresh workspace tries to run eval. Subscription Owner is also insufficient because the built-in Owner role has actions: ['*'] but dataActions: []. All three tutorials (prompt-agent quickstart, hosted-agent quickstart, end-to-end) now document the one-time 'az role assignment create' that grants 'Cognitive Services OpenAI User' at the resource-group scope of the Foundry account, with the exact error signature so future readers can self-diagnose if they skipped it. A future AgentOps Doctor check will detect the missing assignment pre-run; until then, this step is a documented manual prerequisite. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(skills): preflight data-plane RBAC in agentops-eval skill The agentops-eval coding-agent skill now resolves the Foundry project endpoint from .azure//.env or .agentops/.env, looks up the backing AI Services account + resource group with az cognitiveservices account list, fetches the signed-in object ID, and runs an idempotent az role assignment create for 'Cognitive Services OpenAI User' at the resource-group scope BEFORE 'agentops eval analyze' / 'agentops eval run'. This mirrors the new manual step added in the same PR to all three tutorials and keeps the skill experience aligned: users running the skill against a fresh Foundry project no longer hit the 401 PermissionDenied that the portal's default 'Foundry User'-at-project assignment leaves behind. CHANGELOG entry added under [Unreleased]. Plugin skills mirror under plugins/agentops/skills/ regenerated via scripts/sync-skills.ps1 to keep the VS Code extension copy identical. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- CHANGELOG.md | 24 ++++++++++ docs/tutorial-end-to-end.md | 28 +++++++++++ docs/tutorial-hosted-agent-quickstart.md | 26 ++++++++++ docs/tutorial-prompt-agent-quickstart.md | 34 +++++++++++++ .../agentops/skills/agentops-eval/SKILL.md | 48 +++++++++++++++++++ .../templates/skills/agentops-eval/SKILL.md | 48 +++++++++++++++++++ 6 files changed, 208 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index a5e4392..c0b0334 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,30 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres ## [Unreleased] +### Changed +- **`agentops-eval` coding-agent skill now preflights the data-plane RBAC + step that the Foundry portal does not assign by default.** Creating a + Foundry project through the portal only grants the user `Foundry User` + at the *project* scope, which does not cover + `Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action` + on the parent AI Services account where chat completions actually live. + Subscription `Owner` is also insufficient because the built-in `Owner` + role definition has `actions: ["*"]` but `dataActions: []`. The first + `agentops eval run` against a fresh workspace therefore failed with + `PermissionDenied` on every AI-assisted evaluator and every cloud-eval + grader. The skill's new **Step 0.5 - Ensure data-plane RBAC on the AI + Services account** resolves the Foundry project endpoint from + `.azure//.env` or `.agentops/.env`, looks up the backing AI + Services account + resource group with + `az cognitiveservices account list`, fetches the signed-in object ID + with `az ad signed-in-user show`, and runs an idempotent + `az role assignment create` for `Cognitive Services OpenAI User` at + the resource-group scope before handing off to `agentops eval analyze`. + This keeps the skill experience consistent with the new manual + instructions added to the prompt-agent, hosted-agent, and end-to-end + tutorials, so users running the skill against a fresh Foundry project + no longer hit the same 401 the manual tutorials previously hid. + ## [0.3.4] - 2026-06-01 ### Fixed diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md index 27fdbe5..4074a9d 100644 --- a/docs/tutorial-end-to-end.md +++ b/docs/tutorial-end-to-end.md @@ -286,6 +286,34 @@ for creating agents, tools, tracing, evaluation, and red-team scans: https://github.com/Azure-Samples/microsoft-foundry-e2e-agent-observability-workshop/tree/2026-04-aie-europe ``` +### Grant your identity data-plane access to the AI Services account + +Both options above (prompt agent and hosted HTTP agent) eventually drive +an `agentops eval run` that calls chat-completions on the AI Services +account behind your Foundry project — either through Foundry's cloud +graders or through the local AI-assisted evaluators. Creating a project +through the portal assigns you `Foundry User` **only at the project +scope**, which does not cover OpenAI data-plane actions on the parent +account. Subscription `Owner` is also insufficient: its built-in role +definition has `actions: ["*"]` but `dataActions: []`. Skipping this is +what causes the eval to fail later with `PermissionDenied` on +`Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/ +completions/action`. + +Run the assignment once per resource group that hosts a Foundry account +you will evaluate against. Replace ``, +``, and `` with your own values (use +`az ad signed-in-user show --query id -o tsv` to get the object ID): + +```powershell +az role assignment create ` + --assignee ` + --role "Cognitive Services OpenAI User" ` + --scope /subscriptions//resourceGroups/ +``` + +Propagation usually completes within 30–120 seconds. + ## 2. Create the travel eval dataset ```powershell diff --git a/docs/tutorial-hosted-agent-quickstart.md b/docs/tutorial-hosted-agent-quickstart.md index fffdbae..9c7ae2e 100644 --- a/docs/tutorial-hosted-agent-quickstart.md +++ b/docs/tutorial-hosted-agent-quickstart.md @@ -310,6 +310,32 @@ If the deployed endpoint needs a bearer token: $env:HOSTED_AGENT_TOKEN = "" ``` +### Grant your identity data-plane access to the AI Services account + +The local AI-assisted evaluators that AgentOps runs in step 8 call +chat-completions on the AI Services account that backs your Foundry +project. Creating a project through the portal only assigns you +`Foundry User` **at the project scope**, which does not cover the +OpenAI data-plane action on the parent account. Even subscription +`Owner` is insufficient: the built-in `Owner` role has `actions: ["*"]` +but `dataActions: []`. Skipping this once causes the eval to fail with +`PermissionDenied` on `Microsoft.CognitiveServices/accounts/OpenAI/ +deployments/chat/completions/action`. + +Run the assignment once per resource group hosting a Foundry account +you will evaluate against (replace ``, +``, and `` with your values; get the +object ID with `az ad signed-in-user show --query id -o tsv`): + +```powershell +az role assignment create ` + --assignee ` + --role "Cognitive Services OpenAI User" ` + --scope /subscriptions//resourceGroups/ +``` + +Propagation usually completes within 30–120 seconds. + ## 5. Initialize AgentOps interactively ```powershell diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md index 8126f67..d622b25 100644 --- a/docs/tutorial-prompt-agent-quickstart.md +++ b/docs/tutorial-prompt-agent-quickstart.md @@ -241,6 +241,40 @@ Show me the planned changes and the resulting endpoints before applying. If the skill is not available, use Path A. +### Grant your identity data-plane access to the AI Services account + +Creating a project through the portal only assigns you `Foundry User` **at +the project scope**. That role does not cover the OpenAI data-plane actions +that live on the parent AI Services *account* — the chat-completions call +that backs every AI-assisted evaluator and every cloud-eval grader. Even +`Owner` on the subscription is not enough: the built-in `Owner` role +definition has `actions: ["*"]` but `dataActions: []`, so it grants full +control plane and zero data plane on Cognitive Services accounts. + +Skipping this step is what causes the eval grader to fail later with:: + + PermissionDenied: The principal `` lacks the required + data action `Microsoft.CognitiveServices/accounts/OpenAI/deployments/ + chat/completions/action` to perform `POST /openai/deployments/...` + +Run the assignment once per resource group that hosts a Foundry account +you will evaluate against. Replace ``, ``, +and `` with your own values (you can get the object ID +with `az ad signed-in-user show --query id -o tsv`): + +```powershell +az role assignment create ` + --assignee ` + --role "Cognitive Services OpenAI User" ` + --scope /subscriptions//resourceGroups/ +``` + +Repeat the command with the `travel-agent-dev` resource group if the dev +project lives in a different RG. The assignment usually propagates within +30–120 seconds. AgentOps Doctor will detect the missing assignment in a +future release, but until then this is a manual one-time setup step per +new environment. + ## 4. Seed `travel-agent` in the sandbox project You only author the agent in **one place**: your sandbox Foundry diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md index 093a0ea..662fb53 100644 --- a/plugins/agentops/skills/agentops-eval/SKILL.md +++ b/plugins/agentops/skills/agentops-eval/SKILL.md @@ -25,6 +25,54 @@ with a `name:version` or URL. (`--project-endpoint`, `--agent`, `--dataset`, …) for non-interactive runs. Run `agentops init show` later to inspect the resolved config. +## Step 0.5 - Ensure data-plane RBAC on the AI Services account + +AgentOps eval (cloud graders **and** local AI-assisted evaluators) calls +`/openai/deployments/.../chat/completions` on the AI Services account +that backs the Foundry project. Creating a project through the Foundry +portal only assigns the user `Foundry User` at the *project* scope, +which does **not** cover OpenAI data-plane actions on the parent +account. Subscription `Owner` is also insufficient because the built-in +`Owner` role has `actions: ["*"]` but `dataActions: []`. The first +`agentops eval run` against a fresh workspace will otherwise fail with: + +``` +PermissionDenied … lacks the required data action +'Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action' +``` + +Run this preflight before Step 1 - it is idempotent (Azure returns +`RoleAssignmentExists` if already granted) and takes ~5 seconds: + +```bash +# 1. Resolve the AI Services account from agentops.yaml / .azure//.env +PROJECT_ENDPOINT=$(grep -h '^AZURE_AI_FOUNDRY_PROJECT_ENDPOINT' .azure/*/.env .agentops/.env 2>/dev/null | tail -1 | cut -d= -f2- | tr -d '"') +ACCOUNT_HOST=$(echo "$PROJECT_ENDPOINT" | awk -F[/:] '{print $4}') +ACCOUNT_NAME=$(echo "$ACCOUNT_HOST" | cut -d. -f1) + +# 2. Resolve subscription, resource group, and signed-in object ID +SUB_ID=$(az account show --query id -o tsv) +RG=$(az cognitiveservices account list --subscription "$SUB_ID" --query "[?name=='$ACCOUNT_NAME'].resourceGroup | [0]" -o tsv) +OBJ_ID=$(az ad signed-in-user show --query id -o tsv) + +# 3. Grant data-plane access at the RG scope (covers sandbox + future evals) +az role assignment create \ + --assignee "$OBJ_ID" \ + --role "Cognitive Services OpenAI User" \ + --scope "/subscriptions/$SUB_ID/resourceGroups/$RG" +``` + +PowerShell equivalent: replace `$(...)` with the PowerShell variable +assignments shown in `docs/tutorial-prompt-agent-quickstart.md`. + +If the user has not run `az login` yet, do that first. If +`az cognitiveservices account list` returns an empty RG, the AI Services +account lives in a different subscription - ask the user which one. + +Skip this step only if the user explicitly says the role is already +assigned, or if a previous `agentops eval run` succeeded against the +same Foundry account. + ## Step 1 - Analyze evaluation setup Run the deterministic local triage first: diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md index 093a0ea..662fb53 100644 --- a/src/agentops/templates/skills/agentops-eval/SKILL.md +++ b/src/agentops/templates/skills/agentops-eval/SKILL.md @@ -25,6 +25,54 @@ with a `name:version` or URL. (`--project-endpoint`, `--agent`, `--dataset`, …) for non-interactive runs. Run `agentops init show` later to inspect the resolved config. +## Step 0.5 - Ensure data-plane RBAC on the AI Services account + +AgentOps eval (cloud graders **and** local AI-assisted evaluators) calls +`/openai/deployments/.../chat/completions` on the AI Services account +that backs the Foundry project. Creating a project through the Foundry +portal only assigns the user `Foundry User` at the *project* scope, +which does **not** cover OpenAI data-plane actions on the parent +account. Subscription `Owner` is also insufficient because the built-in +`Owner` role has `actions: ["*"]` but `dataActions: []`. The first +`agentops eval run` against a fresh workspace will otherwise fail with: + +``` +PermissionDenied … lacks the required data action +'Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action' +``` + +Run this preflight before Step 1 - it is idempotent (Azure returns +`RoleAssignmentExists` if already granted) and takes ~5 seconds: + +```bash +# 1. Resolve the AI Services account from agentops.yaml / .azure//.env +PROJECT_ENDPOINT=$(grep -h '^AZURE_AI_FOUNDRY_PROJECT_ENDPOINT' .azure/*/.env .agentops/.env 2>/dev/null | tail -1 | cut -d= -f2- | tr -d '"') +ACCOUNT_HOST=$(echo "$PROJECT_ENDPOINT" | awk -F[/:] '{print $4}') +ACCOUNT_NAME=$(echo "$ACCOUNT_HOST" | cut -d. -f1) + +# 2. Resolve subscription, resource group, and signed-in object ID +SUB_ID=$(az account show --query id -o tsv) +RG=$(az cognitiveservices account list --subscription "$SUB_ID" --query "[?name=='$ACCOUNT_NAME'].resourceGroup | [0]" -o tsv) +OBJ_ID=$(az ad signed-in-user show --query id -o tsv) + +# 3. Grant data-plane access at the RG scope (covers sandbox + future evals) +az role assignment create \ + --assignee "$OBJ_ID" \ + --role "Cognitive Services OpenAI User" \ + --scope "/subscriptions/$SUB_ID/resourceGroups/$RG" +``` + +PowerShell equivalent: replace `$(...)` with the PowerShell variable +assignments shown in `docs/tutorial-prompt-agent-quickstart.md`. + +If the user has not run `az login` yet, do that first. If +`az cognitiveservices account list` returns an empty RG, the AI Services +account lives in a different subscription - ask the user which one. + +Skip this step only if the user explicitly says the role is already +assigned, or if a previous `agentops eval run` succeeded against the +same Foundry account. + ## Step 1 - Analyze evaluation setup Run the deterministic local triage first: From faf48cf311a6629c0d3559ad8b119eea2bb75526 Mon Sep 17 00:00:00 2001 From: Paulo Lacerda Date: Mon, 1 Jun 2026 16:57:58 -0300 Subject: [PATCH 2/3] fix: distinguish grader execution errors from quality-gate failures (#226) When cloud/local evaluator workers error out on a subset of rows (most commonly data-plane RBAC that is still propagating), no dataset row has every grader return a score, so items_passed_all is 0 and gentops eval run reports Threshold status: FAILED even though every computable threshold passed. This produced confusing phantom quality failures on the first run after granting the Cognitive Services OpenAI User role. - CLI now detects errored graders combined with all-thresholds-passed and prints a Warning clarifying this is an execution failure, names the RBAC-propagation cause, surfaces the first grader error, and advises waiting + re-running. Exit-code contract unchanged. - Added _grader_error_summary helper + focused unit tests. - Corrected RBAC propagation guidance (several minutes, intermittent FAILED-with-green-thresholds symptom) in the prompt-agent, hosted-agent and end-to-end tutorials and the agentops-eval skill; re-synced plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- CHANGELOG.md | 22 +++ docs/tutorial-end-to-end.md | 10 +- docs/tutorial-hosted-agent-quickstart.md | 10 +- docs/tutorial-prompt-agent-quickstart.md | 21 ++- .../agentops/skills/agentops-eval/SKILL.md | 11 ++ src/agentops/cli/app.py | 47 ++++++ .../templates/skills/agentops-eval/SKILL.md | 11 ++ tests/unit/test_eval_run_grader_errors.py | 150 ++++++++++++++++++ 8 files changed, 276 insertions(+), 6 deletions(-) create mode 100644 tests/unit/test_eval_run_grader_errors.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 0b3a8bb..d67d95a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,28 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres ## [Unreleased] +### Changed +- **`agentops eval run` now distinguishes a grader *execution* failure from a + quality-gate failure.** When evaluator workers error out on a subset of rows + (auth/RBAC/timeout), no row has every grader return a score, so + `items_passed_all` is `0` and the run reports `Threshold status: FAILED` even + though every threshold that *could* be computed passed. The CLI now detects + this case (errored graders combined with all thresholds passing) and prints a + `Warning` explaining that this is an execution error, not a quality + regression, names the most common cause (data-plane RBAC granted moments + earlier that is still propagating to the evaluator workers), surfaces the + first underlying grader error, and advises waiting a few minutes before + re-running. The exit-code contract is unchanged. Added the + `_grader_error_summary` helper plus focused unit tests. +- **Corrected the RBAC propagation guidance in the tutorials and the + `agentops-eval` skill.** Data-plane role assignments on Cognitive Services + accounts can take several minutes (not 30-120 seconds) to reach the + independent, per-row evaluator workers, which can produce an *intermittent* + `FAILED` with otherwise-green thresholds on the first run after granting + access. The prompt-agent, hosted-agent, and end-to-end tutorials and the + skill now describe this symptom and tell readers to wait and re-run rather + than lower thresholds. + ## [0.3.5] - 2026-06-01 ### Changed diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md index 4074a9d..e338baf 100644 --- a/docs/tutorial-end-to-end.md +++ b/docs/tutorial-end-to-end.md @@ -312,7 +312,15 @@ az role assignment create ` --scope /subscriptions//resourceGroups/ ``` -Propagation usually completes within 30–120 seconds. +> **Give the assignment a few minutes to propagate.** Data-plane role +> assignments on the AI Services account do **not** take effect +> instantly — propagation to the evaluator workers can take several +> minutes (occasionally up to ~15). Evaluators authenticate per call, so +> the **first eval right after granting the role may show intermittent +> `AuthenticationError` on a subset of graders and report +> `Threshold status: FAILED` even when every threshold is green**. This +> is a grader execution failure, not a quality regression — wait a few +> minutes and re-run the eval. ## 2. Create the travel eval dataset diff --git a/docs/tutorial-hosted-agent-quickstart.md b/docs/tutorial-hosted-agent-quickstart.md index 9c7ae2e..188f076 100644 --- a/docs/tutorial-hosted-agent-quickstart.md +++ b/docs/tutorial-hosted-agent-quickstart.md @@ -334,7 +334,15 @@ az role assignment create ` --scope /subscriptions//resourceGroups/ ``` -Propagation usually completes within 30–120 seconds. +> **Give the assignment a few minutes to propagate.** Data-plane role +> assignments on the AI Services account do **not** take effect +> instantly — propagation to the local/Foundry evaluator workers can +> take several minutes (occasionally up to ~15). Evaluators authenticate +> per call, so the **first eval right after granting the role may show +> intermittent `AuthenticationError` on a subset of graders and report +> `Threshold status: FAILED` even when every threshold is green**. This +> is a grader execution failure, not a quality regression — wait a few +> minutes and re-run the eval. ## 5. Initialize AgentOps interactively diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md index d622b25..b2843d7 100644 --- a/docs/tutorial-prompt-agent-quickstart.md +++ b/docs/tutorial-prompt-agent-quickstart.md @@ -270,10 +270,23 @@ az role assignment create ` ``` Repeat the command with the `travel-agent-dev` resource group if the dev -project lives in a different RG. The assignment usually propagates within -30–120 seconds. AgentOps Doctor will detect the missing assignment in a -future release, but until then this is a manual one-time setup step per -new environment. +project lives in a different RG. + +> **Give the assignment a few minutes to propagate.** Data-plane role +> assignments on the AI Services account do **not** take effect +> instantly — propagation to the Foundry evaluator workers can take +> several minutes (occasionally up to ~15). The cloud eval runs each +> grader as an independent worker that authenticates separately, so the +> **first run right after granting the role may show intermittent +> `AuthenticationError` on a subset of graders and report +> `Threshold status: FAILED` even when every threshold is green** (no +> single row had all graders succeed). This is a grader execution +> failure, not a quality regression. Wait a few minutes and re-run +> `agentops eval run` — once propagation finishes, every grader scores +> and the gate passes. + +AgentOps Doctor will detect the missing assignment in a future release, +but until then this is a manual one-time setup step per new environment. ## 4. Seed `travel-agent` in the sandbox project diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md index 662fb53..b5b2701 100644 --- a/plugins/agentops/skills/agentops-eval/SKILL.md +++ b/plugins/agentops/skills/agentops-eval/SKILL.md @@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already assigned, or if a previous `agentops eval run` succeeded against the same Foundry account. +**Propagation:** data-plane role assignments do not take effect +instantly — allow several minutes (occasionally up to ~15) before the +first eval. The cloud/local graders authenticate per call, so if the +user runs an eval immediately after this preflight and sees intermittent +`AuthenticationError` on a subset of graders plus +`Threshold status: FAILED` while the visible thresholds are green, that +is propagation lag (a grader **execution** failure), not a quality +regression. Tell the user to wait a few minutes and re-run +`agentops eval run`; do not treat it as a failing gate or start changing +thresholds. + ## Step 1 - Analyze evaluation setup Run the deterministic local triage first: diff --git a/src/agentops/cli/app.py b/src/agentops/cli/app.py index edeff00..dafdc3e 100644 --- a/src/agentops/cli/app.py +++ b/src/agentops/cli/app.py @@ -2055,10 +2055,57 @@ def _run_flat_schema_eval( if result.summary.overall_passed: typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}") return + + # Distinguish a genuine quality-gate failure from grader *execution* + # errors. When evaluator workers error (auth/RBAC/timeout) on a subset of + # rows, no row has every grader succeed, so `items_passed_all` is 0 and the + # gate reports FAILED even though every threshold that *could* be computed + # passed. Surfacing this prevents users from chasing a phantom quality + # regression - the most common cause is data-plane RBAC granted moments + # earlier that is still propagating to the evaluator workers. + errored, total, first_error = _grader_error_summary(result) + all_thresholds_passed = ( + result.summary.thresholds_total > 0 + and result.summary.thresholds_passed == result.summary.thresholds_total + ) + if errored and all_thresholds_passed: + typer.echo( + f"{_cli_warn('Warning')}: {errored} of {total} grader execution(s) " + "errored, so no dataset row had every grader return a score. This is " + "a grader execution failure, not a quality regression - every " + "threshold that could be computed passed. The most common cause is " + "data-plane RBAC granted recently that is still propagating to the " + "evaluator workers; wait a few minutes and re-run `agentops eval run`.", + err=True, + ) + if first_error: + typer.echo(f"{_cli_warn('Warning')}: first grader error: {first_error}", err=True) + typer.echo(f"{_cli_label('Threshold status')}: {style('FAILED', 'bold', 'red')}") raise typer.Exit(code=exit_code_from(result)) +def _grader_error_summary(result) -> tuple[int, int, Optional[str]]: + """Return ``(errored_metric_count, total_metric_count, first_error)``. + + Walks every per-row metric in the run so the CLI can tell a grader + *execution* failure (auth/RBAC/timeout) apart from a quality-gate failure. + The first non-empty error string is lifted out as the actionable cause. + """ + errored = 0 + total = 0 + first_error: Optional[str] = None + for row in result.rows: + for metric in row.metrics: + total += 1 + err = getattr(metric, "error", None) + if isinstance(err, str) and err.strip(): + errored += 1 + if first_error is None: + first_error = err.strip() + return errored, total, first_error + + def _default_flat_output_dir(config_path: Path) -> Path: base = config_path.parent / ".agentops" / "results" timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ") diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md index 662fb53..b5b2701 100644 --- a/src/agentops/templates/skills/agentops-eval/SKILL.md +++ b/src/agentops/templates/skills/agentops-eval/SKILL.md @@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already assigned, or if a previous `agentops eval run` succeeded against the same Foundry account. +**Propagation:** data-plane role assignments do not take effect +instantly — allow several minutes (occasionally up to ~15) before the +first eval. The cloud/local graders authenticate per call, so if the +user runs an eval immediately after this preflight and sees intermittent +`AuthenticationError` on a subset of graders plus +`Threshold status: FAILED` while the visible thresholds are green, that +is propagation lag (a grader **execution** failure), not a quality +regression. Tell the user to wait a few minutes and re-run +`agentops eval run`; do not treat it as a failing gate or start changing +thresholds. + ## Step 1 - Analyze evaluation setup Run the deterministic local triage first: diff --git a/tests/unit/test_eval_run_grader_errors.py b/tests/unit/test_eval_run_grader_errors.py new file mode 100644 index 0000000..565e53c --- /dev/null +++ b/tests/unit/test_eval_run_grader_errors.py @@ -0,0 +1,150 @@ +"""CLI behaviour when graders *execute* but a subset errors out. + +A grader execution error (auth/RBAC/timeout) is not a quality regression, but +because ``items_passed_all`` requires every grader on a row to succeed, a single +errored grader flips ``overall_passed`` to ``False`` and the run reports +``Threshold status: FAILED`` even though every computable threshold passed. + +The CLI must surface that distinction loudly so users (the most common trigger +is data-plane RBAC that is still propagating) do not chase a phantom quality +failure or start lowering thresholds. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +from typer.testing import CliRunner + +from agentops.cli.app import _grader_error_summary, app +from agentops.core.results import ( + RowMetric, + RowResult, + RunResult, + RunSummary, + TargetInfo, + ThresholdEvaluation, +) + +runner = CliRunner() + +_AUTH_ERROR = ( + "FAILED_EXECUTION: (UserError) OpenAI API hits AuthenticationError: " + "Principal does not have access to API/Operation." +) + + +def _result_with_partial_grader_errors() -> RunResult: + """One row where coherence scored but similarity errored on auth.""" + row = RowResult( + row_index=0, + input="plan a trip", + expected="an itinerary", + response="here is an itinerary", + metrics=[ + RowMetric(name="coherence", value=5.0), + RowMetric(name="similarity", value=None, error=_AUTH_ERROR), + ], + ) + summary = RunSummary( + items_total=1, + items_passed_all=0, # the errored grader means no row passed all + items_pass_rate=0.0, + thresholds_total=1, + thresholds_passed=1, # every computable threshold passed + threshold_pass_rate=1.0, + overall_passed=False, + ) + return RunResult( + started_at="2026-06-01T00:00:00+00:00", + finished_at="2026-06-01T00:01:00+00:00", + duration_seconds=60.0, + target=TargetInfo(kind="foundry_prompt", raw="travel-agent:2"), + dataset_path="dataset.jsonl", + evaluators=["CoherenceEvaluator", "SimilarityEvaluator"], + rows=[row], + aggregate_metrics={"coherence": 5.0}, + thresholds=[ + ThresholdEvaluation( + metric="coherence", + criteria=">=", + expected=">=3", + actual="5", + passed=True, + ) + ], + summary=summary, + ) + + +def test_grader_error_summary_counts_and_lifts_first_error() -> None: + errored, total, first_error = _grader_error_summary( + _result_with_partial_grader_errors() + ) + assert (errored, total) == (1, 2) + assert first_error is not None + assert "AuthenticationError" in first_error + + +def _write_minimal_config(tmp_path: Path) -> Path: + dataset = tmp_path / "dataset.jsonl" + dataset.write_text(json.dumps({"input": "hi", "expected": "hi"}), encoding="utf-8") + config = tmp_path / "agentops.yaml" + config.write_text( + json.dumps( + {"version": 1, "agent": "model:gpt-4o", "dataset": str(dataset)} + ), + encoding="utf-8", + ) + return config + + +def test_eval_run_warns_on_partial_grader_errors(tmp_path, monkeypatch) -> None: + config = _write_minimal_config(tmp_path) + output = tmp_path / "out" + output.mkdir() + + crafted = _result_with_partial_grader_errors() + import agentops.pipeline.orchestrator as orch + + monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: crafted) + + result = runner.invoke( + app, + ["eval", "run", "--config", str(config), "--output", str(output)], + ) + + # A grader-execution failure keeps the gate-failed exit code... + assert result.exit_code == 2, result.output + # ...but the user is told it is an execution error, not a quality failure. + assert "grader execution(s) errored" in result.output + assert "propagating" in result.output + assert "AuthenticationError" in result.output + assert "FAILED" in result.output + + +def test_eval_run_no_warning_when_no_grader_errors(tmp_path, monkeypatch) -> None: + config = _write_minimal_config(tmp_path) + output = tmp_path / "out" + output.mkdir() + + clean = _result_with_partial_grader_errors() + # Drop the errored grader so the row is clean and the gate genuinely passes. + clean.rows[0].metrics = [RowMetric(name="coherence", value=5.0)] + clean.summary.items_passed_all = 1 + clean.summary.items_pass_rate = 1.0 + clean.summary.overall_passed = True + + import agentops.pipeline.orchestrator as orch + + monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: clean) + + result = runner.invoke( + app, + ["eval", "run", "--config", str(config), "--output", str(output)], + ) + + assert result.exit_code == 0, result.output + assert "PASSED" in result.output + assert "grader execution(s) errored" not in result.output From 1ae0a127976c6d7c6bcaa6bcc473edeb7e68f4fa Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" <41898282+github-actions[bot]@users.noreply.github.com> Date: Mon, 1 Jun 2026 19:59:22 +0000 Subject: [PATCH 3/3] chore: prepare release 0.3.6 --- .claude-plugin/marketplace.json | 2 +- .github/plugin/marketplace.json | 2 +- CHANGELOG.md | 2 ++ plugins/agentops/package.json | 2 +- plugins/agentops/plugin.json | 2 +- 5 files changed, 6 insertions(+), 4 deletions(-) diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index b829338..ae7fec2 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -13,7 +13,7 @@ "name": "agentops-accelerator", "source": "../../plugins/agentops", "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.", - "version": "0.3.5", + "version": "0.3.6", "keywords": [ "agentops", "evaluation", diff --git a/.github/plugin/marketplace.json b/.github/plugin/marketplace.json index b829338..ae7fec2 100644 --- a/.github/plugin/marketplace.json +++ b/.github/plugin/marketplace.json @@ -13,7 +13,7 @@ "name": "agentops-accelerator", "source": "../../plugins/agentops", "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.", - "version": "0.3.5", + "version": "0.3.6", "keywords": [ "agentops", "evaluation", diff --git a/CHANGELOG.md b/CHANGELOG.md index d67d95a..f7d77d3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,8 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres ## [Unreleased] +## [0.3.6] - 2026-06-01 + ### Changed - **`agentops eval run` now distinguishes a grader *execution* failure from a quality-gate failure.** When evaluator workers error out on a subset of rows diff --git a/plugins/agentops/package.json b/plugins/agentops/package.json index aa6462e..9706810 100644 --- a/plugins/agentops/package.json +++ b/plugins/agentops/package.json @@ -2,7 +2,7 @@ "name": "agentops-accelerator", "displayName": "AgentOps Accelerator — Skills for GitHub Copilot", "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.", - "version": "0.3.5", + "version": "0.3.6", "publisher": "AgentOpsAccelerator", "icon": "icon.png", "license": "MIT", diff --git a/plugins/agentops/plugin.json b/plugins/agentops/plugin.json index 1b1e656..59bb9fa 100644 --- a/plugins/agentops/plugin.json +++ b/plugins/agentops/plugin.json @@ -1,7 +1,7 @@ { "name": "agentops-accelerator", "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.", - "version": "0.3.5", + "version": "0.3.6", "author": { "name": "AgentOps Accelerator", "url": "https://github.com/Azure/agentops"