From 6259003414503694f162931c74666946cecf5493 Mon Sep 17 00:00:00 2001
From: Paulo Lacerda <pclacerda@gmail.com>
Date: Mon, 1 Jun 2026 16:17:26 -0300
Subject: [PATCH 1/3] feat(skills) + docs(tutorials): close data-plane RBAC gap
 that blocked first eval run (#224)

* docs(tutorials): document data-plane RBAC step missing from Foundry portal

Creating a Foundry project through the portal only assigns the user
'Foundry User' at the project scope. That role does not cover OpenAI
data-plane actions on the parent AI Services account, where chat
completions actually live - so every AI-assisted evaluator and every
cloud-eval grader fails with PermissionDenied the first time a fresh
workspace tries to run eval. Subscription Owner is also insufficient
because the built-in Owner role has actions: ['*'] but dataActions: [].

All three tutorials (prompt-agent quickstart, hosted-agent quickstart,
end-to-end) now document the one-time 'az role assignment create' that
grants 'Cognitive Services OpenAI User' at the resource-group scope of
the Foundry account, with the exact error signature so future readers
can self-diagnose if they skipped it. A future AgentOps Doctor check
will detect the missing assignment pre-run; until then, this step is a
documented manual prerequisite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(skills): preflight data-plane RBAC in agentops-eval skill

The agentops-eval coding-agent skill now resolves the Foundry project
endpoint from .azure/<env>/.env or .agentops/.env, looks up the
backing AI Services account + resource group with az cognitiveservices
account list, fetches the signed-in object ID, and runs an idempotent
az role assignment create for 'Cognitive Services OpenAI User' at the
resource-group scope BEFORE 'agentops eval analyze' / 'agentops eval
run'. This mirrors the new manual step added in the same PR to all
three tutorials and keeps the skill experience aligned: users running
the skill against a fresh Foundry project no longer hit the 401
PermissionDenied that the portal's default 'Foundry User'-at-project
assignment leaves behind. CHANGELOG entry added under [Unreleased].

Plugin skills mirror under plugins/agentops/skills/ regenerated via
scripts/sync-skills.ps1 to keep the VS Code extension copy identical.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 CHANGELOG.md                                  | 24 ++++++++++
 docs/tutorial-end-to-end.md                   | 28 +++++++++++
 docs/tutorial-hosted-agent-quickstart.md      | 26 ++++++++++
 docs/tutorial-prompt-agent-quickstart.md      | 34 +++++++++++++
 .../agentops/skills/agentops-eval/SKILL.md    | 48 +++++++++++++++++++
 .../templates/skills/agentops-eval/SKILL.md   | 48 +++++++++++++++++++
 6 files changed, 208 insertions(+)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index a5e4392..c0b0334 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,30 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 
 ## [Unreleased]
 
+### Changed
+- **`agentops-eval` coding-agent skill now preflights the data-plane RBAC
+  step that the Foundry portal does not assign by default.** Creating a
+  Foundry project through the portal only grants the user `Foundry User`
+  at the *project* scope, which does not cover
+  `Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action`
+  on the parent AI Services account where chat completions actually live.
+  Subscription `Owner` is also insufficient because the built-in `Owner`
+  role definition has `actions: ["*"]` but `dataActions: []`. The first
+  `agentops eval run` against a fresh workspace therefore failed with
+  `PermissionDenied` on every AI-assisted evaluator and every cloud-eval
+  grader. The skill's new **Step 0.5 - Ensure data-plane RBAC on the AI
+  Services account** resolves the Foundry project endpoint from
+  `.azure/<env>/.env` or `.agentops/.env`, looks up the backing AI
+  Services account + resource group with
+  `az cognitiveservices account list`, fetches the signed-in object ID
+  with `az ad signed-in-user show`, and runs an idempotent
+  `az role assignment create` for `Cognitive Services OpenAI User` at
+  the resource-group scope before handing off to `agentops eval analyze`.
+  This keeps the skill experience consistent with the new manual
+  instructions added to the prompt-agent, hosted-agent, and end-to-end
+  tutorials, so users running the skill against a fresh Foundry project
+  no longer hit the same 401 the manual tutorials previously hid.
+
 ## [0.3.4] - 2026-06-01
 
 ### Fixed
diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md
index 27fdbe5..4074a9d 100644
--- a/docs/tutorial-end-to-end.md
+++ b/docs/tutorial-end-to-end.md
@@ -286,6 +286,34 @@ for creating agents, tools, tracing, evaluation, and red-team scans:
 https://github.com/Azure-Samples/microsoft-foundry-e2e-agent-observability-workshop/tree/2026-04-aie-europe
 ```
 
+### Grant your identity data-plane access to the AI Services account
+
+Both options above (prompt agent and hosted HTTP agent) eventually drive
+an `agentops eval run` that calls chat-completions on the AI Services
+account behind your Foundry project — either through Foundry's cloud
+graders or through the local AI-assisted evaluators. Creating a project
+through the portal assigns you `Foundry User` **only at the project
+scope**, which does not cover OpenAI data-plane actions on the parent
+account. Subscription `Owner` is also insufficient: its built-in role
+definition has `actions: ["*"]` but `dataActions: []`. Skipping this is
+what causes the eval to fail later with `PermissionDenied` on
+`Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/
+completions/action`.
+
+Run the assignment once per resource group that hosts a Foundry account
+you will evaluate against. Replace `<your-objectId>`,
+`<subscription-id>`, and `<resource-group>` with your own values (use
+`az ad signed-in-user show --query id -o tsv` to get the object ID):
+
+```powershell
+az role assignment create `
+  --assignee <your-objectId> `
+  --role "Cognitive Services OpenAI User" `
+  --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
+```
+
+Propagation usually completes within 30–120 seconds.
+
 ## 2. Create the travel eval dataset
 
 ```powershell
diff --git a/docs/tutorial-hosted-agent-quickstart.md b/docs/tutorial-hosted-agent-quickstart.md
index fffdbae..9c7ae2e 100644
--- a/docs/tutorial-hosted-agent-quickstart.md
+++ b/docs/tutorial-hosted-agent-quickstart.md
@@ -310,6 +310,32 @@ If the deployed endpoint needs a bearer token:
 $env:HOSTED_AGENT_TOKEN = "<token>"
 ```
 
+### Grant your identity data-plane access to the AI Services account
+
+The local AI-assisted evaluators that AgentOps runs in step 8 call
+chat-completions on the AI Services account that backs your Foundry
+project. Creating a project through the portal only assigns you
+`Foundry User` **at the project scope**, which does not cover the
+OpenAI data-plane action on the parent account. Even subscription
+`Owner` is insufficient: the built-in `Owner` role has `actions: ["*"]`
+but `dataActions: []`. Skipping this once causes the eval to fail with
+`PermissionDenied` on `Microsoft.CognitiveServices/accounts/OpenAI/
+deployments/chat/completions/action`.
+
+Run the assignment once per resource group hosting a Foundry account
+you will evaluate against (replace `<your-objectId>`,
+`<subscription-id>`, and `<resource-group>` with your values; get the
+object ID with `az ad signed-in-user show --query id -o tsv`):
+
+```powershell
+az role assignment create `
+  --assignee <your-objectId> `
+  --role "Cognitive Services OpenAI User" `
+  --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
+```
+
+Propagation usually completes within 30–120 seconds.
+
 ## 5. Initialize AgentOps interactively
 
 ```powershell
diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md
index 8126f67..d622b25 100644
--- a/docs/tutorial-prompt-agent-quickstart.md
+++ b/docs/tutorial-prompt-agent-quickstart.md
@@ -241,6 +241,40 @@ Show me the planned changes and the resulting endpoints before applying.
 
 If the skill is not available, use Path A.
 
+### Grant your identity data-plane access to the AI Services account
+
+Creating a project through the portal only assigns you `Foundry User` **at
+the project scope**. That role does not cover the OpenAI data-plane actions
+that live on the parent AI Services *account* — the chat-completions call
+that backs every AI-assisted evaluator and every cloud-eval grader. Even
+`Owner` on the subscription is not enough: the built-in `Owner` role
+definition has `actions: ["*"]` but `dataActions: []`, so it grants full
+control plane and zero data plane on Cognitive Services accounts.
+
+Skipping this step is what causes the eval grader to fail later with::
+
+    PermissionDenied: The principal `<your-objectId>` lacks the required
+    data action `Microsoft.CognitiveServices/accounts/OpenAI/deployments/
+    chat/completions/action` to perform `POST /openai/deployments/...`
+
+Run the assignment once per resource group that hosts a Foundry account
+you will evaluate against. Replace `<your-objectId>`, `<subscription-id>`,
+and `<resource-group>` with your own values (you can get the object ID
+with `az ad signed-in-user show --query id -o tsv`):
+
+```powershell
+az role assignment create `
+  --assignee <your-objectId> `
+  --role "Cognitive Services OpenAI User" `
+  --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
+```
+
+Repeat the command with the `travel-agent-dev` resource group if the dev
+project lives in a different RG. The assignment usually propagates within
+30–120 seconds. AgentOps Doctor will detect the missing assignment in a
+future release, but until then this is a manual one-time setup step per
+new environment.
+
 ## 4. Seed `travel-agent` in the sandbox project
 
 You only author the agent in **one place**: your sandbox Foundry
diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md
index 093a0ea..662fb53 100644
--- a/plugins/agentops/skills/agentops-eval/SKILL.md
+++ b/plugins/agentops/skills/agentops-eval/SKILL.md
@@ -25,6 +25,54 @@ with a `name:version` or URL.
    (`--project-endpoint`, `--agent`, `--dataset`, …) for non-interactive
    runs. Run `agentops init show` later to inspect the resolved config.
 
+## Step 0.5 - Ensure data-plane RBAC on the AI Services account
+
+AgentOps eval (cloud graders **and** local AI-assisted evaluators) calls
+`/openai/deployments/.../chat/completions` on the AI Services account
+that backs the Foundry project. Creating a project through the Foundry
+portal only assigns the user `Foundry User` at the *project* scope,
+which does **not** cover OpenAI data-plane actions on the parent
+account. Subscription `Owner` is also insufficient because the built-in
+`Owner` role has `actions: ["*"]` but `dataActions: []`. The first
+`agentops eval run` against a fresh workspace will otherwise fail with:
+
+```
+PermissionDenied … lacks the required data action
+'Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action'
+```
+
+Run this preflight before Step 1 - it is idempotent (Azure returns
+`RoleAssignmentExists` if already granted) and takes ~5 seconds:
+
+```bash
+# 1. Resolve the AI Services account from agentops.yaml / .azure/<env>/.env
+PROJECT_ENDPOINT=$(grep -h '^AZURE_AI_FOUNDRY_PROJECT_ENDPOINT' .azure/*/.env .agentops/.env 2>/dev/null | tail -1 | cut -d= -f2- | tr -d '"')
+ACCOUNT_HOST=$(echo "$PROJECT_ENDPOINT" | awk -F[/:] '{print $4}')
+ACCOUNT_NAME=$(echo "$ACCOUNT_HOST" | cut -d. -f1)
+
+# 2. Resolve subscription, resource group, and signed-in object ID
+SUB_ID=$(az account show --query id -o tsv)
+RG=$(az cognitiveservices account list --subscription "$SUB_ID" --query "[?name=='$ACCOUNT_NAME'].resourceGroup | [0]" -o tsv)
+OBJ_ID=$(az ad signed-in-user show --query id -o tsv)
+
+# 3. Grant data-plane access at the RG scope (covers sandbox + future evals)
+az role assignment create \
+  --assignee "$OBJ_ID" \
+  --role "Cognitive Services OpenAI User" \
+  --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
+```
+
+PowerShell equivalent: replace `$(...)` with the PowerShell variable
+assignments shown in `docs/tutorial-prompt-agent-quickstart.md`.
+
+If the user has not run `az login` yet, do that first. If
+`az cognitiveservices account list` returns an empty RG, the AI Services
+account lives in a different subscription - ask the user which one.
+
+Skip this step only if the user explicitly says the role is already
+assigned, or if a previous `agentops eval run` succeeded against the
+same Foundry account.
+
 ## Step 1 - Analyze evaluation setup
 
 Run the deterministic local triage first:
diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md
index 093a0ea..662fb53 100644
--- a/src/agentops/templates/skills/agentops-eval/SKILL.md
+++ b/src/agentops/templates/skills/agentops-eval/SKILL.md
@@ -25,6 +25,54 @@ with a `name:version` or URL.
    (`--project-endpoint`, `--agent`, `--dataset`, …) for non-interactive
    runs. Run `agentops init show` later to inspect the resolved config.
 
+## Step 0.5 - Ensure data-plane RBAC on the AI Services account
+
+AgentOps eval (cloud graders **and** local AI-assisted evaluators) calls
+`/openai/deployments/.../chat/completions` on the AI Services account
+that backs the Foundry project. Creating a project through the Foundry
+portal only assigns the user `Foundry User` at the *project* scope,
+which does **not** cover OpenAI data-plane actions on the parent
+account. Subscription `Owner` is also insufficient because the built-in
+`Owner` role has `actions: ["*"]` but `dataActions: []`. The first
+`agentops eval run` against a fresh workspace will otherwise fail with:
+
+```
+PermissionDenied … lacks the required data action
+'Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action'
+```
+
+Run this preflight before Step 1 - it is idempotent (Azure returns
+`RoleAssignmentExists` if already granted) and takes ~5 seconds:
+
+```bash
+# 1. Resolve the AI Services account from agentops.yaml / .azure/<env>/.env
+PROJECT_ENDPOINT=$(grep -h '^AZURE_AI_FOUNDRY_PROJECT_ENDPOINT' .azure/*/.env .agentops/.env 2>/dev/null | tail -1 | cut -d= -f2- | tr -d '"')
+ACCOUNT_HOST=$(echo "$PROJECT_ENDPOINT" | awk -F[/:] '{print $4}')
+ACCOUNT_NAME=$(echo "$ACCOUNT_HOST" | cut -d. -f1)
+
+# 2. Resolve subscription, resource group, and signed-in object ID
+SUB_ID=$(az account show --query id -o tsv)
+RG=$(az cognitiveservices account list --subscription "$SUB_ID" --query "[?name=='$ACCOUNT_NAME'].resourceGroup | [0]" -o tsv)
+OBJ_ID=$(az ad signed-in-user show --query id -o tsv)
+
+# 3. Grant data-plane access at the RG scope (covers sandbox + future evals)
+az role assignment create \
+  --assignee "$OBJ_ID" \
+  --role "Cognitive Services OpenAI User" \
+  --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
+```
+
+PowerShell equivalent: replace `$(...)` with the PowerShell variable
+assignments shown in `docs/tutorial-prompt-agent-quickstart.md`.
+
+If the user has not run `az login` yet, do that first. If
+`az cognitiveservices account list` returns an empty RG, the AI Services
+account lives in a different subscription - ask the user which one.
+
+Skip this step only if the user explicitly says the role is already
+assigned, or if a previous `agentops eval run` succeeded against the
+same Foundry account.
+
 ## Step 1 - Analyze evaluation setup
 
 Run the deterministic local triage first:

From faf48cf311a6629c0d3559ad8b119eea2bb75526 Mon Sep 17 00:00:00 2001
From: Paulo Lacerda <pclacerda@gmail.com>
Date: Mon, 1 Jun 2026 16:57:58 -0300
Subject: [PATCH 2/3] fix: distinguish grader execution errors from
 quality-gate failures (#226)

When cloud/local evaluator workers error out on a subset of rows (most
commonly data-plane RBAC that is still propagating), no dataset row has
every grader return a score, so items_passed_all is 0 and gentops eval
run reports Threshold status: FAILED even though every computable
threshold passed. This produced confusing phantom quality failures on
the first run after granting the Cognitive Services OpenAI User role.

- CLI now detects errored graders combined with all-thresholds-passed
  and prints a Warning clarifying this is an execution failure, names the
  RBAC-propagation cause, surfaces the first grader error, and advises
  waiting + re-running. Exit-code contract unchanged.
- Added _grader_error_summary helper + focused unit tests.
- Corrected RBAC propagation guidance (several minutes, intermittent
  FAILED-with-green-thresholds symptom) in the prompt-agent, hosted-agent
  and end-to-end tutorials and the agentops-eval skill; re-synced plugin.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 CHANGELOG.md                                  |  22 +++
 docs/tutorial-end-to-end.md                   |  10 +-
 docs/tutorial-hosted-agent-quickstart.md      |  10 +-
 docs/tutorial-prompt-agent-quickstart.md      |  21 ++-
 .../agentops/skills/agentops-eval/SKILL.md    |  11 ++
 src/agentops/cli/app.py                       |  47 ++++++
 .../templates/skills/agentops-eval/SKILL.md   |  11 ++
 tests/unit/test_eval_run_grader_errors.py     | 150 ++++++++++++++++++
 8 files changed, 276 insertions(+), 6 deletions(-)
 create mode 100644 tests/unit/test_eval_run_grader_errors.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 0b3a8bb..d67d95a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,28 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 
 ## [Unreleased]
 
+### Changed
+- **`agentops eval run` now distinguishes a grader *execution* failure from a
+  quality-gate failure.** When evaluator workers error out on a subset of rows
+  (auth/RBAC/timeout), no row has every grader return a score, so
+  `items_passed_all` is `0` and the run reports `Threshold status: FAILED` even
+  though every threshold that *could* be computed passed. The CLI now detects
+  this case (errored graders combined with all thresholds passing) and prints a
+  `Warning` explaining that this is an execution error, not a quality
+  regression, names the most common cause (data-plane RBAC granted moments
+  earlier that is still propagating to the evaluator workers), surfaces the
+  first underlying grader error, and advises waiting a few minutes before
+  re-running. The exit-code contract is unchanged. Added the
+  `_grader_error_summary` helper plus focused unit tests.
+- **Corrected the RBAC propagation guidance in the tutorials and the
+  `agentops-eval` skill.** Data-plane role assignments on Cognitive Services
+  accounts can take several minutes (not 30-120 seconds) to reach the
+  independent, per-row evaluator workers, which can produce an *intermittent*
+  `FAILED` with otherwise-green thresholds on the first run after granting
+  access. The prompt-agent, hosted-agent, and end-to-end tutorials and the
+  skill now describe this symptom and tell readers to wait and re-run rather
+  than lower thresholds.
+
 ## [0.3.5] - 2026-06-01
 
 ### Changed
diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md
index 4074a9d..e338baf 100644
--- a/docs/tutorial-end-to-end.md
+++ b/docs/tutorial-end-to-end.md
@@ -312,7 +312,15 @@ az role assignment create `
   --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
 ```
 
-Propagation usually completes within 30–120 seconds.
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the evaluator workers can take several
+> minutes (occasionally up to ~15). Evaluators authenticate per call, so
+> the **first eval right after granting the role may show intermittent
+> `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green**. This
+> is a grader execution failure, not a quality regression — wait a few
+> minutes and re-run the eval.
 
 ## 2. Create the travel eval dataset
 
diff --git a/docs/tutorial-hosted-agent-quickstart.md b/docs/tutorial-hosted-agent-quickstart.md
index 9c7ae2e..188f076 100644
--- a/docs/tutorial-hosted-agent-quickstart.md
+++ b/docs/tutorial-hosted-agent-quickstart.md
@@ -334,7 +334,15 @@ az role assignment create `
   --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
 ```
 
-Propagation usually completes within 30–120 seconds.
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the local/Foundry evaluator workers can
+> take several minutes (occasionally up to ~15). Evaluators authenticate
+> per call, so the **first eval right after granting the role may show
+> intermittent `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green**. This
+> is a grader execution failure, not a quality regression — wait a few
+> minutes and re-run the eval.
 
 ## 5. Initialize AgentOps interactively
 
diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md
index d622b25..b2843d7 100644
--- a/docs/tutorial-prompt-agent-quickstart.md
+++ b/docs/tutorial-prompt-agent-quickstart.md
@@ -270,10 +270,23 @@ az role assignment create `
 ```
 
 Repeat the command with the `travel-agent-dev` resource group if the dev
-project lives in a different RG. The assignment usually propagates within
-30–120 seconds. AgentOps Doctor will detect the missing assignment in a
-future release, but until then this is a manual one-time setup step per
-new environment.
+project lives in a different RG.
+
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the Foundry evaluator workers can take
+> several minutes (occasionally up to ~15). The cloud eval runs each
+> grader as an independent worker that authenticates separately, so the
+> **first run right after granting the role may show intermittent
+> `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green** (no
+> single row had all graders succeed). This is a grader execution
+> failure, not a quality regression. Wait a few minutes and re-run
+> `agentops eval run` — once propagation finishes, every grader scores
+> and the gate passes.
+
+AgentOps Doctor will detect the missing assignment in a future release,
+but until then this is a manual one-time setup step per new environment.
 
 ## 4. Seed `travel-agent` in the sandbox project
 
diff --git a/plugins/agentops/skills/agentops-eval/SKILL.md b/plugins/agentops/skills/agentops-eval/SKILL.md
index 662fb53..b5b2701 100644
--- a/plugins/agentops/skills/agentops-eval/SKILL.md
+++ b/plugins/agentops/skills/agentops-eval/SKILL.md
@@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already
 assigned, or if a previous `agentops eval run` succeeded against the
 same Foundry account.
 
+**Propagation:** data-plane role assignments do not take effect
+instantly — allow several minutes (occasionally up to ~15) before the
+first eval. The cloud/local graders authenticate per call, so if the
+user runs an eval immediately after this preflight and sees intermittent
+`AuthenticationError` on a subset of graders plus
+`Threshold status: FAILED` while the visible thresholds are green, that
+is propagation lag (a grader **execution** failure), not a quality
+regression. Tell the user to wait a few minutes and re-run
+`agentops eval run`; do not treat it as a failing gate or start changing
+thresholds.
+
 ## Step 1 - Analyze evaluation setup
 
 Run the deterministic local triage first:
diff --git a/src/agentops/cli/app.py b/src/agentops/cli/app.py
index edeff00..dafdc3e 100644
--- a/src/agentops/cli/app.py
+++ b/src/agentops/cli/app.py
@@ -2055,10 +2055,57 @@ def _run_flat_schema_eval(
     if result.summary.overall_passed:
         typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}")
         return
+
+    # Distinguish a genuine quality-gate failure from grader *execution*
+    # errors. When evaluator workers error (auth/RBAC/timeout) on a subset of
+    # rows, no row has every grader succeed, so `items_passed_all` is 0 and the
+    # gate reports FAILED even though every threshold that *could* be computed
+    # passed. Surfacing this prevents users from chasing a phantom quality
+    # regression - the most common cause is data-plane RBAC granted moments
+    # earlier that is still propagating to the evaluator workers.
+    errored, total, first_error = _grader_error_summary(result)
+    all_thresholds_passed = (
+        result.summary.thresholds_total > 0
+        and result.summary.thresholds_passed == result.summary.thresholds_total
+    )
+    if errored and all_thresholds_passed:
+        typer.echo(
+            f"{_cli_warn('Warning')}: {errored} of {total} grader execution(s) "
+            "errored, so no dataset row had every grader return a score. This is "
+            "a grader execution failure, not a quality regression - every "
+            "threshold that could be computed passed. The most common cause is "
+            "data-plane RBAC granted recently that is still propagating to the "
+            "evaluator workers; wait a few minutes and re-run `agentops eval run`.",
+            err=True,
+        )
+        if first_error:
+            typer.echo(f"{_cli_warn('Warning')}: first grader error: {first_error}", err=True)
+
     typer.echo(f"{_cli_label('Threshold status')}: {style('FAILED', 'bold', 'red')}")
     raise typer.Exit(code=exit_code_from(result))
 
 
+def _grader_error_summary(result) -> tuple[int, int, Optional[str]]:
+    """Return ``(errored_metric_count, total_metric_count, first_error)``.
+
+    Walks every per-row metric in the run so the CLI can tell a grader
+    *execution* failure (auth/RBAC/timeout) apart from a quality-gate failure.
+    The first non-empty error string is lifted out as the actionable cause.
+    """
+    errored = 0
+    total = 0
+    first_error: Optional[str] = None
+    for row in result.rows:
+        for metric in row.metrics:
+            total += 1
+            err = getattr(metric, "error", None)
+            if isinstance(err, str) and err.strip():
+                errored += 1
+                if first_error is None:
+                    first_error = err.strip()
+    return errored, total, first_error
+
+
 def _default_flat_output_dir(config_path: Path) -> Path:
     base = config_path.parent / ".agentops" / "results"
     timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")
diff --git a/src/agentops/templates/skills/agentops-eval/SKILL.md b/src/agentops/templates/skills/agentops-eval/SKILL.md
index 662fb53..b5b2701 100644
--- a/src/agentops/templates/skills/agentops-eval/SKILL.md
+++ b/src/agentops/templates/skills/agentops-eval/SKILL.md
@@ -73,6 +73,17 @@ Skip this step only if the user explicitly says the role is already
 assigned, or if a previous `agentops eval run` succeeded against the
 same Foundry account.
 
+**Propagation:** data-plane role assignments do not take effect
+instantly — allow several minutes (occasionally up to ~15) before the
+first eval. The cloud/local graders authenticate per call, so if the
+user runs an eval immediately after this preflight and sees intermittent
+`AuthenticationError` on a subset of graders plus
+`Threshold status: FAILED` while the visible thresholds are green, that
+is propagation lag (a grader **execution** failure), not a quality
+regression. Tell the user to wait a few minutes and re-run
+`agentops eval run`; do not treat it as a failing gate or start changing
+thresholds.
+
 ## Step 1 - Analyze evaluation setup
 
 Run the deterministic local triage first:
diff --git a/tests/unit/test_eval_run_grader_errors.py b/tests/unit/test_eval_run_grader_errors.py
new file mode 100644
index 0000000..565e53c
--- /dev/null
+++ b/tests/unit/test_eval_run_grader_errors.py
@@ -0,0 +1,150 @@
+"""CLI behaviour when graders *execute* but a subset errors out.
+
+A grader execution error (auth/RBAC/timeout) is not a quality regression, but
+because ``items_passed_all`` requires every grader on a row to succeed, a single
+errored grader flips ``overall_passed`` to ``False`` and the run reports
+``Threshold status: FAILED`` even though every computable threshold passed.
+
+The CLI must surface that distinction loudly so users (the most common trigger
+is data-plane RBAC that is still propagating) do not chase a phantom quality
+failure or start lowering thresholds.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from typer.testing import CliRunner
+
+from agentops.cli.app import _grader_error_summary, app
+from agentops.core.results import (
+    RowMetric,
+    RowResult,
+    RunResult,
+    RunSummary,
+    TargetInfo,
+    ThresholdEvaluation,
+)
+
+runner = CliRunner()
+
+_AUTH_ERROR = (
+    "FAILED_EXECUTION: (UserError) OpenAI API hits AuthenticationError: "
+    "Principal does not have access to API/Operation."
+)
+
+
+def _result_with_partial_grader_errors() -> RunResult:
+    """One row where coherence scored but similarity errored on auth."""
+    row = RowResult(
+        row_index=0,
+        input="plan a trip",
+        expected="an itinerary",
+        response="here is an itinerary",
+        metrics=[
+            RowMetric(name="coherence", value=5.0),
+            RowMetric(name="similarity", value=None, error=_AUTH_ERROR),
+        ],
+    )
+    summary = RunSummary(
+        items_total=1,
+        items_passed_all=0,  # the errored grader means no row passed all
+        items_pass_rate=0.0,
+        thresholds_total=1,
+        thresholds_passed=1,  # every computable threshold passed
+        threshold_pass_rate=1.0,
+        overall_passed=False,
+    )
+    return RunResult(
+        started_at="2026-06-01T00:00:00+00:00",
+        finished_at="2026-06-01T00:01:00+00:00",
+        duration_seconds=60.0,
+        target=TargetInfo(kind="foundry_prompt", raw="travel-agent:2"),
+        dataset_path="dataset.jsonl",
+        evaluators=["CoherenceEvaluator", "SimilarityEvaluator"],
+        rows=[row],
+        aggregate_metrics={"coherence": 5.0},
+        thresholds=[
+            ThresholdEvaluation(
+                metric="coherence",
+                criteria=">=",
+                expected=">=3",
+                actual="5",
+                passed=True,
+            )
+        ],
+        summary=summary,
+    )
+
+
+def test_grader_error_summary_counts_and_lifts_first_error() -> None:
+    errored, total, first_error = _grader_error_summary(
+        _result_with_partial_grader_errors()
+    )
+    assert (errored, total) == (1, 2)
+    assert first_error is not None
+    assert "AuthenticationError" in first_error
+
+
+def _write_minimal_config(tmp_path: Path) -> Path:
+    dataset = tmp_path / "dataset.jsonl"
+    dataset.write_text(json.dumps({"input": "hi", "expected": "hi"}), encoding="utf-8")
+    config = tmp_path / "agentops.yaml"
+    config.write_text(
+        json.dumps(
+            {"version": 1, "agent": "model:gpt-4o", "dataset": str(dataset)}
+        ),
+        encoding="utf-8",
+    )
+    return config
+
+
+def test_eval_run_warns_on_partial_grader_errors(tmp_path, monkeypatch) -> None:
+    config = _write_minimal_config(tmp_path)
+    output = tmp_path / "out"
+    output.mkdir()
+
+    crafted = _result_with_partial_grader_errors()
+    import agentops.pipeline.orchestrator as orch
+
+    monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: crafted)
+
+    result = runner.invoke(
+        app,
+        ["eval", "run", "--config", str(config), "--output", str(output)],
+    )
+
+    # A grader-execution failure keeps the gate-failed exit code...
+    assert result.exit_code == 2, result.output
+    # ...but the user is told it is an execution error, not a quality failure.
+    assert "grader execution(s) errored" in result.output
+    assert "propagating" in result.output
+    assert "AuthenticationError" in result.output
+    assert "FAILED" in result.output
+
+
+def test_eval_run_no_warning_when_no_grader_errors(tmp_path, monkeypatch) -> None:
+    config = _write_minimal_config(tmp_path)
+    output = tmp_path / "out"
+    output.mkdir()
+
+    clean = _result_with_partial_grader_errors()
+    # Drop the errored grader so the row is clean and the gate genuinely passes.
+    clean.rows[0].metrics = [RowMetric(name="coherence", value=5.0)]
+    clean.summary.items_passed_all = 1
+    clean.summary.items_pass_rate = 1.0
+    clean.summary.overall_passed = True
+
+    import agentops.pipeline.orchestrator as orch
+
+    monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: clean)
+
+    result = runner.invoke(
+        app,
+        ["eval", "run", "--config", str(config), "--output", str(output)],
+    )
+
+    assert result.exit_code == 0, result.output
+    assert "PASSED" in result.output
+    assert "grader execution(s) errored" not in result.output

From 1ae0a127976c6d7c6bcaa6bcc473edeb7e68f4fa Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
 <41898282+github-actions[bot]@users.noreply.github.com>
Date: Mon, 1 Jun 2026 19:59:22 +0000
Subject: [PATCH 3/3] chore: prepare release 0.3.6

---
 .claude-plugin/marketplace.json | 2 +-
 .github/plugin/marketplace.json | 2 +-
 CHANGELOG.md                    | 2 ++
 plugins/agentops/package.json   | 2 +-
 plugins/agentops/plugin.json    | 2 +-
 5 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
index b829338..ae7fec2 100644
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.5",
+      "version": "0.3.6",
       "keywords": [
         "agentops",
         "evaluation",
diff --git a/.github/plugin/marketplace.json b/.github/plugin/marketplace.json
index b829338..ae7fec2 100644
--- a/.github/plugin/marketplace.json
+++ b/.github/plugin/marketplace.json
@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.5",
+      "version": "0.3.6",
       "keywords": [
         "agentops",
         "evaluation",
diff --git a/CHANGELOG.md b/CHANGELOG.md
index d67d95a..f7d77d3 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,8 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 
 ## [Unreleased]
 
+## [0.3.6] - 2026-06-01
+
 ### Changed
 - **`agentops eval run` now distinguishes a grader *execution* failure from a
   quality-gate failure.** When evaluator workers error out on a subset of rows
diff --git a/plugins/agentops/package.json b/plugins/agentops/package.json
index aa6462e..9706810 100644
--- a/plugins/agentops/package.json
+++ b/plugins/agentops/package.json
@@ -2,7 +2,7 @@
   "name": "agentops-accelerator",
   "displayName": "AgentOps Accelerator — Skills for GitHub Copilot",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.5",
+  "version": "0.3.6",
   "publisher": "AgentOpsAccelerator",
   "icon": "icon.png",
   "license": "MIT",
diff --git a/plugins/agentops/plugin.json b/plugins/agentops/plugin.json
index 1b1e656..59bb9fa 100644
--- a/plugins/agentops/plugin.json
+++ b/plugins/agentops/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "agentops-accelerator",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.5",
+  "version": "0.3.6",
   "author": {
     "name": "AgentOps Accelerator",
     "url": "https://github.com/Azure/agentops"