From f37eda605b7b7608f0ceffc01a11e1bc3d97283a Mon Sep 17 00:00:00 2001
From: Zac <1221537+tezheng@users.noreply.github.com>
Date: Tue, 26 May 2026 23:24:49 +0800
Subject: [PATCH 001/143] docs: add user-facing documentation site (MkDocs
 Material)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds a complete MkDocs Material documentation site for the winml-cli
project, served from /docs and built locally and via GitHub Actions
(manual dispatch).

Site infrastructure:
- mkdocs.yml with Material theme, mermaid superfences, tabbed code,
  light/dark palette toggle
- pyproject.toml dev deps: mkdocs-material, mkdocs-jupyter,
  pymdown-extensions
- .github/workflows/docs.yml (workflow_dispatch only)
- .gitignore exception for docs/superpowers/specs/

User-facing chapters:

Home — tagline + Goals/Promises bullets sourced from the MVP
transcript; describes the toolkit's three workflows (primitives,
pipeline, one-command) plus the EP × Device coverage promise

Getting Started (3 pages):
- Installation — Win 11 24H2 + Copilot+PC + Python 3.10 + uv + git
  prereqs table; 'No NPU?' callout pointing at --device auto with
  the winml eval caveat
- Quickstart — 5-minute export + inspect with
  'winml sys --list-device --list-ep' verify step
- End-to-End Tour — universal --device auto walkthrough that works
  on Copilot+ PC NPU, DirectML GPU, or CPU; tabbed example outputs
  for sys and perf so each reader sees their own machine

Concepts (12 pages in two sub-groups):
- Fundamentals (5): How winml-cli works, Graph and IR, Weight and
  Activation, EP and Device (with the full 7-EP × Device matrix),
  Datatype and Quantization (8-precision family from _KNOWN_PRECISIONS
  with w4a16 marked 'Planned — not yet supported')
- WinML CLI (7 workflow-concept pages): Primitives and pipeline,
  Load and export, Analyze and optimize, Compile and EPContext,
  Perf and monitoring, Eval and datasets, Config and build (with
  the full WinMLBuildConfig schema inline)

Commands (13 pages):
- Overview with the four user-intent groups (Discover / Configure /
  Build / Measure)
- Per-command reference for: sys, inspect, hub, analyze, config,
  optimize, export, quantize, compile, build, perf, eval

Samples (3 pages):
- ConvNeXt — Primitives Walkthrough (CPU/GPU/NPU device comparison)
- BERT — Config + Build + Perf (workflow demonstration)
- Qwen3 — Composite Models (placeholder for the in-progress feature)

Tutorials (2 pages):
- Overview
- ConvNeXt on NPU — 2200-word linear walkthrough with both QNN and
  OpenVINO compile paths shown via tabbed blocks, plus the
  'winml build' one-shot variant

P2 stubs preserved in nav: Reference, Troubleshooting, Contributing

Source-grounding:
- Every flag mentioned in user-facing docs is verified against
  src/winml/modelkit/
- Non-functional flags (--torch-module, --dynamo on export;
  --no-quant on compile) are explicitly marked
- All URLs target the canonical microsoft/winml-cli destination
- mkdocs build --strict passes with zero warnings

Internal artifacts kept under docs/superpowers/ for reference:
- Spec and plan files for the v1 and v2 design iterations
- 2026-05-26-v3-known-issues.md — fact-checked review findings

Existing internal docs (docs/design/, docs/naming-convention.md,
docs/pytest-best-practices.md) are unchanged and excluded from the
user-facing nav via exclude_docs in mkdocs.yml.
---
 .github/workflows/docs.yml                    |   22 +
 .gitignore                                    |    1 +
 docs/commands/analyze.md                      |   93 ++
 docs/commands/build.md                        |  112 ++
 docs/commands/compile.md                      |  110 ++
 docs/commands/config.md                       |   97 ++
 docs/commands/eval.md                         |   96 ++
 docs/commands/export.md                       |  105 ++
 docs/commands/hub.md                          |  113 ++
 docs/commands/inspect.md                      |  105 ++
 docs/commands/optimize.md                     |  118 ++
 docs/commands/overview.md                     |   70 ++
 docs/commands/perf.md                         |  102 ++
 docs/commands/quantize.md                     |  115 ++
 docs/commands/sys.md                          |  118 ++
 docs/concepts/analyze-and-optimize.md         |   34 +
 docs/concepts/compile-and-epcontext.md        |   42 +
 docs/concepts/config-and-build.md             |  161 +++
 docs/concepts/eps-and-devices.md              |   62 +
 docs/concepts/eval-and-datasets.md            |   70 ++
 docs/concepts/graphs-and-ir.md                |   59 +
 docs/concepts/how-it-works.md                 |  121 ++
 docs/concepts/load-and-export.md              |   41 +
 docs/concepts/perf-and-monitoring.md          |   45 +
 docs/concepts/primitives-and-pipeline.md      |  105 ++
 docs/concepts/quantization.md                 |   65 ++
 docs/concepts/weight-and-activation.md        |   32 +
 docs/contributing.md                          |    4 +
 docs/getting-started/end-to-end.md            |  209 ++++
 docs/getting-started/installation.md          |   88 ++
 docs/getting-started/quickstart.md            |   71 ++
 docs/index.md                                 |   31 +
 docs/reference/index.md                       |    4 +
 docs/samples/bert-config-build.md             |  140 +++
 docs/samples/convnext-primitives.md           |  176 +++
 docs/samples/qwen3-composite.md               |   27 +
 .../superpowers/2026-05-26-v3-known-issues.md |  102 ++
 .../plans/2026-05-20-modelkit-docs-site.md    | 1031 +++++++++++++++++
 .../plans/2026-05-24-docs-expansion-v2.md     |  996 ++++++++++++++++
 .../2026-05-20-modelkit-docs-site-design.md   |  239 ++++
 .../2026-05-24-docs-expansion-v2-design.md    |  263 +++++
 docs/troubleshooting.md                       |    4 +
 docs/tutorials/index.md                       |   11 +
 docs/tutorials/npu-convnext.md                |  281 +++++
 mkdocs.yml                                    |  120 ++
 pyproject.toml                                |    3 +
 46 files changed, 6014 insertions(+)
 create mode 100644 .github/workflows/docs.yml
 create mode 100644 docs/commands/analyze.md
 create mode 100644 docs/commands/build.md
 create mode 100644 docs/commands/compile.md
 create mode 100644 docs/commands/config.md
 create mode 100644 docs/commands/eval.md
 create mode 100644 docs/commands/export.md
 create mode 100644 docs/commands/hub.md
 create mode 100644 docs/commands/inspect.md
 create mode 100644 docs/commands/optimize.md
 create mode 100644 docs/commands/overview.md
 create mode 100644 docs/commands/perf.md
 create mode 100644 docs/commands/quantize.md
 create mode 100644 docs/commands/sys.md
 create mode 100644 docs/concepts/analyze-and-optimize.md
 create mode 100644 docs/concepts/compile-and-epcontext.md
 create mode 100644 docs/concepts/config-and-build.md
 create mode 100644 docs/concepts/eps-and-devices.md
 create mode 100644 docs/concepts/eval-and-datasets.md
 create mode 100644 docs/concepts/graphs-and-ir.md
 create mode 100644 docs/concepts/how-it-works.md
 create mode 100644 docs/concepts/load-and-export.md
 create mode 100644 docs/concepts/perf-and-monitoring.md
 create mode 100644 docs/concepts/primitives-and-pipeline.md
 create mode 100644 docs/concepts/quantization.md
 create mode 100644 docs/concepts/weight-and-activation.md
 create mode 100644 docs/contributing.md
 create mode 100644 docs/getting-started/end-to-end.md
 create mode 100644 docs/getting-started/installation.md
 create mode 100644 docs/getting-started/quickstart.md
 create mode 100644 docs/index.md
 create mode 100644 docs/reference/index.md
 create mode 100644 docs/samples/bert-config-build.md
 create mode 100644 docs/samples/convnext-primitives.md
 create mode 100644 docs/samples/qwen3-composite.md
 create mode 100644 docs/superpowers/2026-05-26-v3-known-issues.md
 create mode 100644 docs/superpowers/plans/2026-05-20-modelkit-docs-site.md
 create mode 100644 docs/superpowers/plans/2026-05-24-docs-expansion-v2.md
 create mode 100644 docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md
 create mode 100644 docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md
 create mode 100644 docs/troubleshooting.md
 create mode 100644 docs/tutorials/index.md
 create mode 100644 docs/tutorials/npu-convnext.md
 create mode 100644 mkdocs.yml

diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
new file mode 100644
index 000000000..d4d9992f5
--- /dev/null
+++ b/.github/workflows/docs.yml
@@ -0,0 +1,22 @@
+name: Build & Publish Docs
+
+on:
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v3
+        with:
+          python-version: "3.10"
+      - run: uv sync --extra dev
+      - run: uv run mkdocs build --strict
+      - uses: peaceiris/actions-gh-pages@v4
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: ./site
diff --git a/.gitignore b/.gitignore
index 6d8e97985..a1815e457 100644
--- a/.gitignore
+++ b/.gitignore
@@ -252,6 +252,7 @@ ui/node_modules/
 ui/src-tauri/target/
 
 specs/
+!docs/superpowers/specs/
 /tests/integration/pattern_crawl/output/
 /tests/unit/pattern_crawl/output/
 /tests/integration/static_analyzer/output/
diff --git a/docs/commands/analyze.md b/docs/commands/analyze.md
new file mode 100644
index 000000000..24292e381
--- /dev/null
+++ b/docs/commands/analyze.md
@@ -0,0 +1,93 @@
+# winml analyze
+
+> Verify an ONNX model is compatible with a target execution provider before deployment.
+
+## When to use this
+
+Use `winml analyze` before running the full build pipeline to confirm that your ONNX model's operators are supported by the intended execution provider and device. It surfaces operator gaps and actionable recommendations early, saving time that would otherwise be spent on a failed compile or quantize run.
+
+## Synopsis
+
+```bash
+$ winml analyze [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|------|-------|------|---------|-------------|
+| `--model` | | `PATH` | *(required)* | Path to the ONNX model file to analyze. |
+| `--ep` | | choice | *(none)* | Target execution provider. Accepts full names (`QNNExecutionProvider`, `OpenVINOExecutionProvider`, `VitisAIExecutionProvider`) or short aliases (`qnn`, `ov`/`openvino`, `vitis`/`vitisai`). When omitted, all supported EPs are analyzed. |
+| `--device` | | `CPU\|GPU\|NPU` | `NPU` | Target device type. Filters the analysis for the named device class when `--ep` is also supplied. When omitted, defaults to NPU. |
+| `--output` | | `PATH` | *(none)* | Save the full JSON result to a file in addition to printing the console summary. |
+| `--information` / `--no-information` | | flag | enabled | Include detailed per-operator recommendations and remediation hints in the output. Pass `--no-information` for a compact pass/fail summary. |
+| `--htp-metadata` | | `PATH` | *(none)* | Path to an HTP metadata JSON file. Enables enhanced Qualcomm-specific pattern extraction when targeting QNN. |
+| `--run-unknown-op` / `--no-run-unknown-op` | | flag | enabled | Attempt to run operators unknown to the EP locally to infer shape and type information. Disable when the local machine lacks the required libraries. |
+| `--save-node` | | `partial\|unsupported` | *(none)* | Save partial or unsupported node subgraphs to disk for further investigation. Can be specified multiple times: `--save-node partial --save-node unsupported`. |
+
+## How it works
+
+`winml analyze` loads the ONNX model and runs a static analysis pass via `ONNXStaticAnalyzer`. It checks each operator in the graph against the EP's capability list, classifies nodes as fully supported, partially supported, or unsupported, and optionally runs unknown operators locally to infer missing shape information. The command exits with code `0` when all operators are supported, `1` when at least one operator is unsupported or only partially supported, and `2` on any input or runtime error — making it safe to use in CI pipelines with exit-code checks.
+
+## Examples
+
+Analyze against all supported EPs using the default NPU device:
+
+```bash
+$ winml analyze --model microsoft/resnet-50.onnx
+```
+
+```text
+Analyzing microsoft/resnet-50.onnx against all supported EPs...
+
+QNNExecutionProvider (NPU): FULLY SUPPORTED
+  Operators checked : 142
+  Unsupported       : 0
+  Partial           : 0
+
+OpenVINOExecutionProvider (NPU): FULLY SUPPORTED
+  Operators checked : 142
+  Unsupported       : 0
+  Partial           : 0
+```
+
+Check QNN NPU support using the short alias:
+
+```bash
+$ winml analyze --model bert-base-uncased.onnx --ep qnn --device NPU
+```
+
+Check Intel OpenVINO GPU support and print operator-level recommendations:
+
+```bash
+$ winml analyze --model bert-base-uncased.onnx --ep ov --device GPU --information
+```
+
+Save the full JSON result for offline inspection while still printing the console summary:
+
+```bash
+$ winml analyze --model facebook/convnext-tiny-224.onnx --output results.json
+```
+
+Use QNN with HTP metadata for enhanced Qualcomm pattern extraction:
+
+```bash
+$ winml analyze --model bert-base-uncased.onnx \
+    --ep QNNExecutionProvider --device NPU \
+    --htp-metadata htp_metadata.json
+```
+
+## Common pitfalls
+
+- **Omitting `--ep` analyzes every EP** — this is slower and may produce confusing output when one EP shows unsupported operators that another handles fine. Specify `--ep` when you know your target hardware.
+- **Exit code 1 is not a hard failure** — it means at least one operator is unsupported, not that the model cannot run at all. Many EPs fall back unsupported nodes to the CPU EP automatically; review the recommendations before deciding to restructure the model.
+- **`--htp-metadata` is QNN-specific** — passing a QNN HTP metadata file while targeting a different EP has no effect. Ensure the EP and metadata file correspond to the same hardware.
+- **`--no-run-unknown-op` may widen the unsupported list** — if local execution is disabled, operators whose support cannot be verified statically are conservatively marked as unsupported.
+- **The model path must point to an existing `.onnx` file** — symbolic HuggingFace model IDs are not accepted; export the model first with `winml export`.
+
+## See also
+
+- [eps-and-devices.md](../concepts/eps-and-devices.md) — background on ONNX operators and execution providers
+- [export.md](export.md) — convert a HuggingFace model to ONNX before analyzing
+- [compile.md](compile.md) — compile the model for the target EP after analysis passes
+- [sys.md](sys.md) — list EPs available on the current machine
diff --git a/docs/commands/build.md b/docs/commands/build.md
new file mode 100644
index 000000000..45fca553c
--- /dev/null
+++ b/docs/commands/build.md
@@ -0,0 +1,112 @@
+# winml build
+
+> Run the entire winml-cli pipeline (export → quantize → compile) in one command.
+
+## When to use this
+
+Use `winml build` when you want to go from a Hugging Face model ID (or an
+existing `.onnx` file) to a deployment-ready artifact in a single invocation,
+without manually chaining `winml export`, `winml quantize`, and `winml
+compile`. A build config file — generated by `winml config` — controls every
+stage of the pipeline.
+
+## Synopsis
+
+```bash
+$ winml build [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|---|---|---|---|---|
+| `--config` | `-c` | path | *(required)* | `WinMLBuildConfig` JSON file, generated by `winml config`. |
+| `--model` | `-m` | string | *(required)* | Hugging Face model ID or path to an existing `.onnx` file. |
+| `--output-dir` | `-o` | path | `None` | Directory for all build artifacts. Mutually exclusive with `--use-cache`. |
+| `--use-cache` | | flag | `false` | Store artifacts in the winml-cli global cache (`~/.cache/winml/`). Mutually exclusive with `--output-dir`. |
+| `--random-init` | | flag | `false` | Skip weight download; build with random weights (useful for architecture testing). |
+| `--rebuild` | | flag | `false` | Overwrite existing artifacts and re-run the full pipeline. |
+| `--no-quant` | | flag | `false` | Skip the quantization stage, overriding the config. |
+| `--no-compile` | | flag | `false` | Skip the compilation stage, overriding the config. |
+| `--no-optimize` | | flag | `false` | Skip the optimization stage (for pre-quantized ONNX input models). |
+| `--ep` | | string | `None` | Target execution provider for the analyzer (e.g., `qnn`). Falls back to the compile config EP if not set. |
+| `--device` | | string | `None` | Target device for the analyzer (e.g., `NPU`, `GPU`). Default: `NPU`. |
+| `--no-analyze` | | flag | `false` | Skip the analyzer loop during build. |
+| `--max-optim-iterations` | | integer | `3` | Maximum autoconf re-optimization rounds. `--no-analyze` implicitly sets this to 0. |
+| `--help` | `-h` | flag | | Show this message and exit. |
+
+## How it works
+
+`winml build` reads a `WinMLBuildConfig` JSON file (from `winml config`) that
+encodes device, precision, export, quantization, and compilation settings.
+When `-m` is a Hugging Face model ID, the full pipeline runs: export → optimize
+→ quantize → compile. When `-m` points to an existing `.onnx` file, the export
+stage is skipped and the pipeline starts at optimization. After compilation, an
+optional analyzer loop (`--max-optim-iterations`) re-evaluates graph quality
+and applies further passes; `--no-analyze` disables it for a deterministic
+single-pass build. Individual stages can be suppressed with `--no-quant`,
+`--no-compile`, and `--no-optimize` without touching the config file.
+
+## Examples
+
+```bash
+# Full pipeline: HF model → export → optimize → quantize → compile
+winml build -c config.json -m microsoft/resnet-50 -o output/
+```
+
+```text
+winml build
+  Config:     config.json
+  Model:      microsoft/resnet-50
+  Output:     output/
+
+  export       done  (28.3s)
+  optimize     done  (4.1s)
+  quantize     done  (6.8s)
+  compile      done  (14.2s)
+
+  Build complete in 53.4s
+  Final artifact: output/resnet50_ctx.onnx
+```
+
+```bash
+# Start from a pre-exported ONNX file (skips export stage)
+winml build -c config.json -m resnet50.onnx -o output/
+```
+
+```bash
+# Export and optimize only — skip quantization and compilation for quick testing
+winml build -c config.json -m bert-base-uncased -o output/ \
+  --no-quant --no-compile
+```
+
+```bash
+# Force a clean rebuild, overwriting any cached artifacts
+winml build -c config.json -m facebook/convnext-tiny-224 -o output/ --rebuild
+```
+
+```bash
+# Use the global cache and cap optimizer iterations for faster turnaround
+winml build -c config.json -m microsoft/resnet-50 \
+  --use-cache --max-optim-iterations 1
+```
+
+## Common pitfalls
+
+- **Either `--output-dir` or `--use-cache` is required; they are mutually
+  exclusive.** Omitting both raises an error immediately.
+- **`--use-cache` is not supported in module mode.** When the config is a JSON
+  array (module mode), only `--output-dir` is accepted.
+- **The config file must come from `winml config`.** The schema is strict;
+  unknown keys are rejected.
+- **`--random-init` can produce silent failures for some architectures.**
+  Use a real model ID when accuracy matters.
+- **Existing artifacts are reused by default.** Pass `--rebuild` to force a
+  fresh run after changing the config.
+
+## See also
+
+- [winml export](export.md)
+- [winml compile](compile.md)
+- [Config and build](../concepts/config-and-build.md)
+- [How it works](../concepts/how-it-works.md)
diff --git a/docs/commands/compile.md b/docs/commands/compile.md
new file mode 100644
index 000000000..decc0e7d1
--- /dev/null
+++ b/docs/commands/compile.md
@@ -0,0 +1,110 @@
+# winml compile
+
+> Compile an ONNX model to an EP-specific format for fast runtime loading.
+
+## When to use this
+
+Use `winml compile` as the final pipeline stage after `winml quantize` to
+produce an execution-provider-native artifact (for example, a QNN EPContext
+model) that loads faster and avoids online graph compilation at inference time.
+
+## Synopsis
+
+```bash
+$ winml compile [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|---|---|---|---|---|
+| `--model` | `-m` | path | *(required unless `--list`)* | Input ONNX model file. |
+| `--output-dir` | | path | same dir as input | Directory to write compiled output artifacts. |
+| `--device` | `-d` | choice | `npu` | Target device: `auto`, `npu`, `gpu`, or `cpu`. |
+| `--ep` | | choice | `None` | Force a specific execution provider, overriding device-to-provider mapping. Choices: `cpu`, `cuda`, `dml`, `migraphx`, `openvino`, `qnn`, `tensorrt`, `vitisai`. |
+| `--no-quant` | | flag | `false` | Flag retained for compatibility; quantization is no longer performed during compile. Use `winml quantize` beforehand. |
+| `--no-validate` | | flag | `false` | Skip validation of the compiled model after compilation. |
+| `--compiler` | | choice | `ort` | Compiler backend: `ort` (ONNX Runtime) or `qairt` (Qualcomm AI Runtime Tools). |
+| `--qnn-sdk-root` | | path | `None` | Path to the QAIRT/QNN SDK root directory. Required when `--compiler qairt` is set. |
+| `--embed` | | flag | `false` | Embed the EP context blob inside the ONNX file instead of writing a separate `.bin` file. |
+| `--list` | | flag | `false` | List available compiler backends for the selected device and exit without compiling. |
+| `--help` | `-h` | flag | | Show this message and exit. |
+
+## How it works
+
+`winml compile` resolves the target execution provider from `--device` and
+`--ep`, then calls the winml-cli compiler API to hand the ONNX graph to the
+EP's offline compilation toolchain. For the default NPU target, ONNX Runtime's
+QNN EP generates a binary `.bin` context file (or embeds it inline with
+`--embed`) that encodes the hardware-optimized execution plan, eliminating
+graph partitioning at load time. When `--compiler qairt` is used, the
+Qualcomm AI Runtime Tools SDK is invoked directly (requires `--qnn-sdk-root`).
+An optional post-compilation validation pass runs a forward pass through the
+target EP; skip it with `--no-validate` when the target hardware is absent.
+
+## Examples
+
+```bash
+# Compile for NPU (default device and compiler)
+winml compile -m resnet50_qdq.onnx
+```
+
+```text
+Input: resnet50_qdq.onnx
+Device: npu
+Provider: qnn
+Compiler: ort
+
+Compiling model...
+
+Success! Model compiled
+Output: resnet50_qdq_ctx.onnx
+Compile time: 12.40s
+Total time: 13.05s
+```
+
+```bash
+# List available compiler backends for NPU before committing to a run
+winml compile --list --device npu
+```
+
+```bash
+# Compile a pre-quantized BERT model for NPU with context embedded inline
+winml compile -m bert-base-uncased_qdq.onnx --embed
+```
+
+```bash
+# Compile for GPU using the MIGraphX execution provider
+winml compile -m microsoft_resnet50.onnx --device gpu --ep migraphx
+```
+
+```bash
+# Compile using the QAIRT SDK and skip post-compilation validation
+winml compile -m facebook_convnext_qdq.onnx \
+  --compiler qairt \
+  --qnn-sdk-root /opt/qnn-sdk \
+  --no-validate
+```
+
+## Common pitfalls
+
+- **`--no-quant` is a no-op in the current release.** Quantization is no longer
+  performed during compile; run `winml quantize` on your model first, then pass
+  the QDQ model to this command.
+- **`--compiler qairt` requires `--qnn-sdk-root`.** Without a valid SDK path,
+  compilation will fail immediately with a missing-executable error.
+- **`--embed` inflates the `.onnx` file significantly.** Embedding the EP
+  context produces a single portable file but can make it impractical to open or
+  inspect the ONNX graph with standard tooling.
+- **Validation requires the target hardware.** The post-compilation validation
+  step runs an actual inference pass; on a machine without the NPU driver or the
+  relevant EP installed, always pass `--no-validate`.
+- **`--device` default is `npu`, not `auto`.** Unlike other commands, compile
+  defaults to NPU targeting. Pass `--device cpu` or `--device gpu` explicitly
+  when targeting other hardware.
+
+## See also
+
+- [winml quantize](quantize.md)
+- [winml build](build.md)
+- [ONNX and execution providers](../concepts/eps-and-devices.md)
diff --git a/docs/commands/config.md b/docs/commands/config.md
new file mode 100644
index 000000000..cf1cd1c2b
--- /dev/null
+++ b/docs/commands/config.md
@@ -0,0 +1,97 @@
+# winml config
+
+> Generate a reusable build configuration for a Hugging Face model or ONNX file.
+
+## When to use this
+
+Use `winml config` at the start of a new model project to produce a `WinMLBuildConfig` JSON file. The config captures the model identity, task, precision, and per-stage settings in one shareable artifact that you can edit, version-control, and repeatedly pass to `winml build`. Running config first lets you review and adjust pipeline settings before committing to a full build.
+
+## Synopsis
+
+```bash
+$ winml config [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|------|-------|------|---------|-------------|
+| `--model` | `-m` | `TEXT` | *(none)* | HuggingFace model ID (e.g., `microsoft/resnet-50`) or path to an existing `.onnx` file. Optional when `--model-type` or `--model-class` is provided. |
+| `--task` | `-t` | `TEXT` | *(auto)* | Override the auto-detected task (e.g., `image-classification`, `text-classification`). When omitted, the first supported task for the model is selected automatically. |
+| `--model-class` | | `TEXT` | *(auto)* | Override the auto-detected model class (e.g., `CLIPTextModelWithProjection`). Useful for multi-component models. |
+| `--model-type` | | `TEXT` | *(auto)* | Override the auto-detected model type (e.g., `bert`, `resnet`). Can be used without `-m` to generate a config from HuggingFace default settings. |
+| `--module` | | `TEXT` | *(none)* | Generate configs for every submodule whose class name matches the given string (e.g., `ResNetConvLayer`). The output is a JSON array instead of a single object. |
+| `--config` | `-c` | `PATH` | *(none)* | JSON override file in `WinMLBuildConfig` format. Fields present in this file take precedence over auto-detected values. |
+| `--shape-config` | | `PATH` | *(none)* | JSON file with input shape overrides for dummy input generation. Valid keys by modality — text: `sequence_length`; vision: `height`, `width`, `num_channels`; audio: `feature_size`, `nb_max_frames`, `audio_sequence_length`. |
+| `--device` | `-d` | `auto\|npu\|gpu\|cpu` | `auto` | Target device. Affects the generated quantization and compilation sub-configs. `auto` leaves those sections unchanged from the kit defaults. |
+| `--ep` | | `TEXT` | *(none)* | Force a specific execution provider (`qnn`, `dml`, `migraphx`, `tensorrt`, `vitisai`, `openvino`, `cpu`). Overrides the device-to-provider mapping. When used without `--device`, the device is inferred from the EP. |
+| `--precision` | `-p` | `TEXT` | `auto` | Target precision: `auto`, `fp32`, `fp16`, `int8`, `int16`, or a mixed format such as `w8a16`. `auto` selects the precision based on the chosen device. |
+| `--output` | `-o` | `PATH` | *(stdout)* | Write the generated JSON to this file instead of printing to stdout. |
+| `--library` | | `TEXT` | `transformers` | Source library for `TasksManager` task lookup. Defaults to `transformers`; set to `diffusers` or another Optimum-supported library when needed. |
+| `--no-quant` | | flag | off | Omit quantization from the generated config (sets `quant` to `null`). Equivalent to removing the `quant` section before passing to `winml build`. |
+| `--no-compile` | | flag | off | Omit compilation from the generated config (sets `compile` to `null`). Use this when you want to inspect the optimized ONNX before EP-specific compilation. |
+| `--trust-remote-code` | | flag | off | Allow execution of custom model code from the HuggingFace repository. Required for some community models. Only enable for repositories you trust. |
+
+## How it works
+
+`winml config` queries the HuggingFace `TasksManager` to auto-detect the model's task, class, and ONNX export specification. For known model types it looks up a per-model kit in `MODEL_BUILD_CONFIGS` and uses that as a starting point, layering in your device, precision, and override file on top. When `-m` points to an existing `.onnx` file, the export stage is skipped by setting `export` to `null` in the output. The result is a complete `WinMLBuildConfig` JSON printed to stdout or written to a file, ready to be passed to `winml build`.
+
+## Examples
+
+Generate a config for ResNet-50 with all auto-detected settings:
+
+```bash
+$ winml config -m microsoft/resnet-50
+```
+
+```text
+Generating config for microsoft/resnet-50...
+Auto-selected task: image-classification (from 'microsoft/resnet-50')
+Generated config for task 'image-classification'
+{
+  "loader": { "task": "image-classification", ... },
+  "export": { "opset_version": 17, ... },
+  "optim": { ... },
+  "quant": null,
+  "compile": null
+}
+```
+
+Target NPU with int8 quantization and save to a file:
+
+```bash
+$ winml config -m microsoft/resnet-50 --device npu --precision int8 -o resnet_npu.json
+```
+
+Generate a config for BERT and override the task:
+
+```bash
+$ winml config -m bert-base-uncased --task text-classification -o bert_cls.json
+```
+
+Generate from a model type alone (no HuggingFace download required at config time):
+
+```bash
+$ winml config --model-type bert --task fill-mask
+```
+
+Generate a config from an already-exported ONNX file, skipping quantization and compilation:
+
+```bash
+$ winml config -m facebook/convnext-tiny-224.onnx --no-quant --no-compile -o convnext_optim_only.json
+```
+
+## Common pitfalls
+
+- **At least one of `-m`, `--model-type`, or `--model-class` is required** — calling `winml config` with none of these three flags raises a usage error immediately.
+- **`auto` precision does not always map to a lower-bit type** — when `--device` is also `auto`, precision stays at the kit default (usually `fp32`). Explicitly pass `--device npu` or `--device gpu` for `auto` precision to resolve to `int8` or `fp16`.
+- **`--module` changes the output shape** — with `--module` the JSON output is an array of configs, not a single object. Scripts that expect a single object will fail to parse this output.
+- **`--trust-remote-code` has security implications** — only use this flag with model repositories you own or explicitly trust; it allows arbitrary Python execution from the remote model card.
+- **Shape overrides in `--shape-config` are modality-specific** — passing a `sequence_length` key for a vision model has no effect. Check the `--help` description for valid keys per modality.
+
+## See also
+
+- [Config and build](../concepts/config-and-build.md) — structure of `WinMLBuildConfig` and how stages interact
+- [build.md](build.md) — run the full pipeline using a generated config
+- [export.md](export.md) — export a HuggingFace model to ONNX as a standalone step
+- [optimize.md](optimize.md) — apply graph optimizations to an existing ONNX file
diff --git a/docs/commands/eval.md b/docs/commands/eval.md
new file mode 100644
index 000000000..6c70af952
--- /dev/null
+++ b/docs/commands/eval.md
@@ -0,0 +1,96 @@
+# winml eval
+
+> Evaluate ONNX model accuracy on a standard dataset.
+
+## When to use this
+
+Use `winml eval` to measure how accurately a model performs on real data — especially after quantization, where comparing the quantized model against the floating-point baseline reveals any accuracy regression introduced by precision reduction.
+
+## Synopsis
+
+```bash
+$ winml eval [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|---|---|---|---|---|
+| `--model` | `-m` | `TEXT` | — | HuggingFace model ID, or path to a local `.onnx` file. Required (unless `--model-id` is provided directly). |
+| `--model-id` | | `TEXT` | — | HuggingFace model ID used for preprocessor and config resolution when `-m` points to an `.onnx` file. Required when `-m` is an ONNX file. |
+| `--dataset` | | `TEXT` | task default | HuggingFace dataset path (e.g., `imagenet-1k`, `glue`). If omitted, a default dataset is selected based on the task. |
+| `--dataset-name` | | `TEXT` | — | Dataset configuration name for multi-config datasets (e.g., `mrpc` within `glue`). |
+| `--task` | | `TEXT` | auto-detected | Task name (e.g., `image-classification`). Auto-detected from `--model-id` when not provided. |
+| `--device` | | `cpu\|gpu\|npu` | `cpu` | Device to run inference on during evaluation. |
+| `--samples` | `-n` (alias) | `INTEGER` | `100` | Number of dataset samples to evaluate. |
+| `--split` | | `TEXT` | `validation` | Dataset split to use (e.g., `validation`, `test`, `train`). |
+| `--shuffle / --no-shuffle` | | flag | `shuffle` | Shuffle the dataset before sampling. Disable with `--no-shuffle` for reproducible sample ordering. |
+| `--streaming` | | flag | `false` | Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. |
+| `--column` | | `TEXT` (multiple) | — | Column mapping as `key=value` pairs (e.g., `--column input_column=image`). Can be specified multiple times. |
+| `--label-mapping` | | `PATH` | — | Path to a JSON file mapping label names to integer IDs: `{"label_name": id}`. |
+| `--output` | `-o` | `PATH` | — | Output JSON file path for the evaluation results. |
+| `--schema` | | flag | `false` | Print the expected dataset schema for the given `--task` and exit. Does not run evaluation. |
+
+## How it works
+
+`winml eval` loads the model via `WinMLAutoModel` (supporting both HuggingFace IDs and local ONNX files), then pulls the requested number of samples from a HuggingFace dataset. Each sample is preprocessed using the tokenizer or image processor associated with the model ID, passed through the ONNX Runtime session, and the output is compared against the ground-truth label. Aggregated metrics (accuracy, F1, etc.) are printed to the console and optionally written to a JSON file. When `-m` is an ONNX file, `--model-id` must be provided so the command knows which preprocessor and label vocabulary to use.
+
+## Examples
+
+Evaluate a HuggingFace model using the task-default dataset:
+
+```bash
+$ winml eval -m microsoft/resnet-50
+```
+
+```text
+Task:     image-classification
+Dataset:  imagenet-1k (validation, 100 samples)
+Device:   cpu
+
+Accuracy: 76.00%
+
+Results saved to: microsoft_resnet-50_eval.json
+```
+
+Evaluate a pre-exported ONNX file, providing the source model ID for preprocessing:
+
+```bash
+$ winml eval -m model.onnx --model-id microsoft/resnet-50 --dataset imagenet-1k
+```
+
+Evaluate a BERT model on the MRPC paraphrase task with column remapping:
+
+```bash
+$ winml eval -m bert-base-uncased --dataset glue --dataset-name mrpc \
+    --column input_column=sentence1 --samples 500
+```
+
+Check what dataset columns are expected before running, then evaluate on the NPU:
+
+```bash
+$ winml eval --schema --task image-classification
+$ winml eval -m facebook/convnext-tiny-224 --device npu --samples 200 --split test
+```
+
+Evaluate with a custom label mapping file and save results:
+
+```bash
+$ winml eval -m model.onnx --model-id microsoft/resnet-50 \
+    --label-mapping labels.json -o results/resnet_eval.json
+```
+
+## Common pitfalls
+
+- **ONNX file without `--model-id` fails.** When `-m` is a `.onnx` path, `--model-id` is mandatory. Without it the command cannot resolve the preprocessor or label vocabulary and will exit with a usage error.
+- **Default dataset requires Hub credentials for gated datasets.** Some task defaults (e.g., `imagenet-1k`) require a HuggingFace account with accepted terms of use. Log in with `huggingface-cli login` before running eval on gated data.
+- **`--shuffle` is on by default.** The random 100-sample slice changes between runs unless you pass `--no-shuffle`. Use `--no-shuffle` when comparing two model variants to ensure they see identical samples.
+- **`--streaming` skips the local cache.** Streaming mode avoids downloading the full split but prevents random shuffling on large datasets. For reproducible evaluation, download the split once and omit `--streaming`.
+- **Column names vary across dataset versions.** If the evaluator raises a missing-column error, run `winml eval --schema --task <task>` to inspect the expected schema and use `--column` to remap dataset field names to the expected names.
+
+## See also
+
+- [winml perf](perf.md) — measure latency and throughput on the same model
+- [winml build](build.md) — produce the quantized artifact to evaluate
+- [Quantization & QDQ](../concepts/quantization.md) — why accuracy validation after quantization matters
+- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — understand the `--device` option
diff --git a/docs/commands/export.md b/docs/commands/export.md
new file mode 100644
index 000000000..23014a472
--- /dev/null
+++ b/docs/commands/export.md
@@ -0,0 +1,105 @@
+# winml export
+
+> Convert a PyTorch / Hugging Face model to ONNX, preserving module hierarchy.
+
+## When to use this
+
+Use `winml export` when you have a Hugging Face model ID or a local PyTorch
+checkpoint and need an ONNX file as the first step of the optimization
+pipeline. This is the entry point before `winml quantize` or `winml compile`.
+
+## Synopsis
+
+```bash
+$ winml export [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|---|---|---|---|---|
+| `--model` | `-m` | string | *(required)* | Hugging Face model name or local path (e.g., `prajjwal1/bert-tiny`). |
+| `--output` | `-o` | path | *(required)* | Output ONNX file path (e.g., `model.onnx`). |
+| `--with-report` | | flag | `false` | Generate full export reports: Markdown, JSON, and a console tree. |
+| `--clean-onnx` / `--no-hierarchy` | | flag | `false` | Skip embedding `hierarchy_tag` metadata in ONNX nodes, producing a clean ONNX file. |
+| `--dynamo` | | flag | `false` | Enable PyTorch 2.9+ dynamo export for richer node metadata. (Experimental — currently logs a warning.) |
+| `--torch-module` | | string | `None` | Comma-separated list of `torch.nn` module types to include in hierarchy (e.g., `LayerNorm,Embedding`). (Experimental — currently logs a warning.) |
+| `--input-specs` | | path | `None` | JSON file with explicit input tensor specifications. Auto-generated when omitted. |
+| `--task` | `-t` | string | `None` | Override auto-detected Hugging Face task (e.g., `image-feature-extraction`). |
+| `--export-config` | | path | `None` | JSON file with ONNX export parameters such as `opset_version` and `do_constant_folding`. |
+| `--shape-config` | | path | `None` | JSON object mapping symbolic dimension names to concrete sizes (e.g., `{"sequence_length": 2048}`). Ignored when `--input-specs` is provided. |
+| `--help` | `-h` | flag | | Show this message and exit. |
+
+## How it works
+
+`winml export` loads the model via Hugging Face `transformers`, then runs the
+eight-step Hierarchy-preserving Tags Protocol (HTP): model preparation, input
+generation, module-hierarchy tracing, TorchScript ONNX export, node-tagger
+creation, per-node tagging, tag injection into ONNX `metadata_props`, and
+optional report generation. The hierarchy metadata allows downstream tools to
+reason about operators grouped by their originating module rather than flat
+graph position. When `--clean-onnx` is specified, hierarchy steps are bypassed
+and a bare ONNX file is written, useful for third-party tools that do not
+understand custom metadata.
+
+## Examples
+
+```bash
+# Minimal export: Hugging Face model ID to ONNX file
+winml export -m microsoft/resnet-50 -o resnet50.onnx
+```
+
+```text
+Model: microsoft/resnet-50
+Output: resnet50.onnx
+
+Starting HTP export...
+  Detected task: image-classification
+
+Success! Model exported to: resnet50.onnx
+```
+
+```bash
+# Export with verbose output and full Markdown + JSON reports
+winml export -m facebook/convnext-tiny-224 -o convnext.onnx -v --with-report
+```
+
+```bash
+# Export a BERT model, overriding input shapes for longer sequences
+winml export -m bert-base-uncased -o bert.onnx \
+  --shape-config shape.json
+# shape.json: {"sequence_length": 512}
+```
+
+```bash
+# Export with a hand-crafted input-spec file (skips auto-detection)
+winml export -m bert-base-uncased -o bert.onnx --input-specs inputs.json
+```
+
+```bash
+# Produce clean ONNX without hierarchy metadata (for third-party optimizers)
+winml export -m microsoft/resnet-50 -o resnet50_clean.onnx --clean-onnx
+```
+
+## Common pitfalls
+
+- **Task detection fails on unusual model IDs.** If auto-detection picks the
+  wrong task (or fails entirely), pass `-t` with the correct task string, for
+  example `-t image-feature-extraction`.
+- **`--shape-config` is silently ignored when `--input-specs` is set.**
+  `--input-specs` takes full priority; remove it if you only want to override
+  individual dimensions.
+- **`--dynamo` and `--torch-module` are experimental.** Both flags emit a
+  warning and have no effect in the current release. Do not rely on them in
+  automated pipelines yet.
+- **Output directory must be writable.** The command creates parent directories
+  automatically, but will fail with a permission error on read-only paths.
+- **Model weights are downloaded to the Hugging Face cache.** Set `HF_HOME` or
+  `HF_HUB_CACHE` to control the download location.
+
+## See also
+
+- [winml quantize](quantize.md)
+- [winml compile](compile.md)
+- [winml build](build.md)
+- [Load and export concept](../concepts/load-and-export.md)
diff --git a/docs/commands/hub.md b/docs/commands/hub.md
new file mode 100644
index 000000000..efba66652
--- /dev/null
+++ b/docs/commands/hub.md
@@ -0,0 +1,113 @@
+# winml hub
+
+> Browse the curated winml-cli catalog of validated models and benchmarks.
+
+## When to use this
+
+Use `winml hub` to discover which HuggingFace models have been validated end-to-end
+by the winml-cli team — exported, quantized, compiled, and benchmarked on real Windows
+ML devices. It is the starting point when you want a model that is known to work
+before investing time in a custom build.
+
+## Synopsis
+
+```bash
+$ winml hub [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|------|-------|------|---------|-------------|
+| `--model-type` | `-t` | string | `null` | Filter the catalog by model architecture (case-insensitive). Examples: `bert`, `roberta`, `vit`. |
+| `--task` | `-k` | string | `null` | Filter by HuggingFace task (case-insensitive). Examples: `text-classification`, `image-segmentation`. |
+| `--model` | `-m` | string | `null` | Show detailed latency and accuracy benchmarks for a specific model ID. Accepts exact ID or an unambiguous substring. |
+| `--output` | `-o` | path | `null` | Save the displayed results to a JSON file. Works for both list and detail views. |
+| `--help` | `-h` | flag | — | Show help and exit. |
+
+> `winml hub` reads a local catalog bundled with the package — no network access is
+> required. It does not accept `--device`, `--ep`, or `--precision`.
+
+## How it works
+
+The catalog is stored in `winml/modelkit/data/hub_models.json` and is loaded
+directly from the installed package data without any network call. Each catalog
+entry records the model ID, task, architecture type, per-EP latency statistics
+(avg, P50, P90, P95, P99, min, max, QPS), and per-EP accuracy results compared
+against a floating-point FP32 baseline. The accuracy verdict uses three levels:
+`PASS` (drop within tolerance), `AT_RISK` (borderline), and `REGRESSION` (exceeds
+threshold). When `--output` is provided, the displayed data — whether a filtered
+list or a single model's detail — is written as indented JSON to the specified path.
+
+## Examples
+
+```bash
+# List all validated models in the catalog
+$ winml hub
+```
+
+```text
+╭─── winml-cli Catalog  |  12 validated model(s) ───────────────────────────╮
+│  Model                             Task                    Model Type     │
+│ ├ microsoft/resnet-50              image-classification    resnet         │
+│ ├ bert-base-uncased                fill-mask               bert           │
+│ ├ ProsusAI/finbert                 text-classification     bert           │
+│ └ ...                                                                     │
+╰────────────────────────────────────────────────────────────────────────────╯
+Use  winml hub --model <id>  to see perf and accuracy details.
+```
+
+```bash
+# Filter to BERT-family models only
+$ winml hub --model-type bert
+```
+
+```bash
+# Filter by task — show only text-classification models
+$ winml hub --task text-classification
+```
+
+```bash
+# Combine filters — BERT models for text classification
+$ winml hub --model-type bert --task text-classification
+```
+
+```bash
+# Show latency and accuracy details for a specific model
+$ winml hub --model ProsusAI/finbert
+```
+
+```bash
+# Save filtered results to JSON for offline review
+$ winml hub --task image-classification --output results/image_catalog.json
+```
+
+## Common pitfalls
+
+- **`--task` short flag is `-k`, not `-t`.** The `-t` short flag is taken by
+  `--model-type`. Using `-t text-classification` will set the architecture filter,
+  not the task filter. Use `-k` or the full `--task` flag.
+- **`--model` performs substring matching when no exact match exists.** If the
+  substring matches more than one catalog entry, the command raises an error and
+  lists the candidates. Use the full model ID to avoid ambiguity.
+- **The catalog reflects a point-in-time snapshot.** Models listed in the catalog
+  were validated against a specific version of winml-cli, ONNX Runtime, and the
+  relevant EP driver. Accuracy and latency may differ on your hardware or with
+  updated drivers.
+- **`--output` only saves what was displayed.** Combining `--model` with `--output`
+  saves the single model's detail dict. Combining a filter with `--output` saves the
+  filtered list. There is no flag to dump the entire catalog in one call — omit all
+  filters and add `--output` to do so.
+- **A model not in the hub can still be used with winml-cli.** The catalog covers
+  tested models; `winml inspect` and `winml export` work with any HuggingFace model
+  that has a supported architecture, whether or not it appears in the hub.
+
+## See also
+
+- [inspect.md](inspect.md) — check loader, exporter, and task detection for any
+  HuggingFace model ID
+- [sys.md](sys.md) — verify your environment and EP availability before building
+- [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview from export
+  to benchmark
+- [Quantization & QDQ](../concepts/quantization.md) — understand accuracy verdicts
+  and what `drop_pct` measures
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
new file mode 100644
index 000000000..c6944108a
--- /dev/null
+++ b/docs/commands/inspect.md
@@ -0,0 +1,105 @@
+# winml inspect
+
+> Inspect a model's tasks, classes, and hierarchy before committing to an export.
+
+## When to use this
+
+Use `winml inspect` to understand how winml-cli will treat a HuggingFace model before
+running `winml export` or `winml build`. It answers questions like "which task will be
+auto-detected?", "which HF model class will be loaded?", and "does this model have a
+supported exporter?" without downloading weights or writing any files.
+
+## Synopsis
+
+```bash
+$ winml inspect -m <model_id> [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|------|-------|------|---------|-------------|
+| `--model` | `-m` | string | **required** | HuggingFace model ID (e.g. `openai/clip-vit-base-patch32`). Required unless `--help` is used. |
+| `--format` | `-f` | `table` \| `json` | `table` | Output format. `table` renders rich panels; `json` emits a machine-readable object. |
+| `--task` | `-t` | string | `null` | Override the auto-detected task (e.g. `image-classification`, `feature-extraction`). |
+| `--hierarchy` | `-H` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
+| `--help` | `-h` | flag | — | Show help and exit. |
+
+> `winml inspect` does not accept `--device`, `--ep`, `--precision`, or `--output`.
+> It is a read-only discovery command that does not produce any artifacts.
+
+## How it works
+
+`winml inspect` calls into the winml-cli registry to resolve the model ID against the
+known loader and exporter configurations. It fetches only the model's `config.json`
+from HuggingFace Hub (no weights), uses the architecture field to look up the matching
+HF model class and WinML inference class, and then renders the result. When
+`--hierarchy` is supplied, the model is instantiated locally with random weights using
+`AutoModel.from_config()`, and a forward-pass trace records the full PyTorch module
+tree. Because no real weights are downloaded, hierarchy inspection is fast even for
+large models.
+
+## Examples
+
+```bash
+# Basic inspection — check task detection and loader/exporter classes
+$ winml inspect -m microsoft/resnet-50
+```
+
+```text
+╭─────────────────────────── microsoft/resnet-50 ───────────────────────────╮
+│ Task          image-classification                                         │
+│ Model Class   ResNetForImageClassification                                 │
+│ Exporter      OptimumExporter                                              │
+│ WinML Class   WinMLImageClassificationModel                                │
+│ Status        Supported                                                    │
+╰────────────────────────────────────────────────────────────────────────────╯
+```
+
+```bash
+# JSON output — useful for scripting or CI pre-flight checks
+$ winml inspect -m bert-base-uncased --format json
+```
+
+```bash
+# Override task when auto-detection picks the wrong one
+$ winml inspect -m bert-base-uncased --task feature-extraction
+```
+
+```bash
+# Print the full PyTorch module hierarchy (no weight download)
+$ winml inspect -m openai/clip-vit-base-patch32 --hierarchy
+```
+
+```bash
+# Combine verbose logging with hierarchy for deep diagnostics
+$ winml inspect -m facebook/convnext-tiny-224 -v -H
+```
+
+## Common pitfalls
+
+- **`--model` is always required.** Unlike some other commands, `winml inspect` has
+  no mode that omits `-m`. The flag is marked required; omitting it returns an error.
+- **Hierarchy requires a locally installable model config.** If the model config
+  references a custom architecture not in the local `transformers` installation,
+  `--hierarchy` will fail with an import error. Update `transformers` or omit the flag.
+- **Task override affects all output.** Passing `--task` changes which exporter and
+  WinML class are reported, not just the task field. If the override is incompatible
+  with the model architecture, the status will show as unsupported.
+- **`--format json` is silent on unsupported models.** When the model is not found in
+  the winml-cli registry, the command raises a `ClickException`. Wrap the call in
+  `winml inspect ... && ...` or check the exit code when scripting.
+- **No weight download does not mean no network access.** The `config.json` is always
+  fetched from HuggingFace Hub. Set `HF_HUB_OFFLINE=1` if you need fully offline
+  inspection of a locally cached model.
+
+## See also
+
+- [hub.md](hub.md) — browse the curated catalog and check accuracy verdicts before
+  inspecting
+- [Load and export concept](../concepts/load-and-export.md) — how `winml.hierarchy.tag`
+  metadata is written and what you can do with the module tree
+- [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview showing where
+  inspect fits before export
+- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — background on loaders,
+  exporters, and EP-specific configurations
diff --git a/docs/commands/optimize.md b/docs/commands/optimize.md
new file mode 100644
index 000000000..78c7c81a5
--- /dev/null
+++ b/docs/commands/optimize.md
@@ -0,0 +1,118 @@
+# winml optimize
+
+> Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed.
+
+## When to use this
+
+Use `winml optimize` after exporting an ONNX model and before quantization or compilation. Graph fusions reduce operator count, improve memory locality, and can make downstream quantization more accurate by presenting cleaner subgraphs to the calibration pass. It is also useful as a standalone step when you want to optimize a pre-exported ONNX file without running the full build pipeline.
+
+## Synopsis
+
+```bash
+$ winml optimize [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|------|-------|------|---------|-------------|
+| `--model` | `-m` | `PATH` | *(required unless listing)* | Input ONNX model file. Not required when `--list-capabilities` or `--list-rewrites` is used. |
+| `--output` | `-o` | `PATH` | `{input}_opt.onnx` | Output path for the optimized model. Defaults to the input filename with `_opt` inserted before the extension. |
+| `--preset` | `-p` | `qnn-compatible\|transformer-optimized\|full\|minimal` | *(none)* | Apply a named optimization preset as a starting configuration. CLI flags override preset values. |
+| `--config` | `-c` | `PATH` | *(none)* | YAML or JSON configuration file. Fields in the file override preset defaults; CLI flags override the file. |
+| `--list-capabilities` | `-l` | flag | off | Print all registered optimization capabilities grouped by category and exit. Add `--verbose` for descriptions and ORT names. |
+| `--list-rewrites` | | flag | off | Print all available pattern-rewrite families with their source-to-target mappings and exit. |
+| *(dynamic)* | | flag | *(per capability)* | Each registered capability generates a `--enable-<name>` / `--disable-<name>` pair. Run `--list-capabilities` to see the full current list. Examples: `--enable-gelu-fusion`, `--disable-constant-folding`. Pattern-rewrite flags follow the form `--enable-<source-slug>-<target-slug>`; run `--list-rewrites` to discover all names. |
+
+### Built-in presets
+
+| Preset | Description |
+|--------|-------------|
+| `qnn-compatible` | Disables fusions that produce composite ops unsupported by QNN; sets graph optimization level to 1. |
+| `transformer-optimized` | Enables GELU, LayerNorm, BiasGELU, and Attention fusions — ideal for BERT-family models. |
+| `full` | All fusions in `transformer-optimized` plus MatMul+Add. |
+| `minimal` | Graph optimization level 1 only; no fusions applied. |
+
+### Configuration precedence
+
+When multiple sources are provided, settings are resolved in this order (highest wins):
+
+1. Explicit CLI flags (`--enable-X` / `--disable-X`)
+2. Config file (`-c`)
+3. Preset (`-p`)
+4. Capability defaults
+
+## How it works
+
+`winml optimize` loads the ONNX model, builds a final capability configuration from the resolved precedence chain, and runs all enabled passes through the `Optimizer`. Each capability maps to a named optimization or fusion pipe in the `winml.modelkit.optim` registry. The capability flags are auto-generated at startup from that registry — adding a new optimization to the registry automatically makes it available as a CLI flag without any change to this command's source. After optimization, the command prints the before-and-after node count and percentage reduction so you can quantify the effect.
+
+## Examples
+
+Optimize a model with all capability defaults:
+
+```bash
+$ winml optimize -m microsoft/resnet-50.onnx
+```
+
+```text
+Input:  microsoft/resnet-50.onnx
+Output: microsoft/resnet-50_opt.onnx
+
+Loading model...
+Running optimizer...
+Saving optimized model...
+
+Success! Model optimized: microsoft/resnet-50_opt.onnx
+Nodes: 312 -> 289 (7.4% reduction)
+```
+
+Apply the transformer preset to a BERT model:
+
+```bash
+$ winml optimize -m bert-base-uncased.onnx --preset transformer-optimized -o bert_opt.onnx
+```
+
+Enable a specific fusion on top of the minimal preset:
+
+```bash
+$ winml optimize -m bert-base-uncased.onnx \
+    --preset minimal \
+    --enable-layer-norm-fusion \
+    --enable-attention-fusion \
+    -o bert_layernorm_attn.onnx
+```
+
+Use the QNN-compatible preset and save the result for downstream compilation:
+
+```bash
+$ winml optimize -m facebook/convnext-tiny-224.onnx \
+    --preset qnn-compatible \
+    -o convnext_qnn_opt.onnx
+```
+
+List all available optimization capabilities:
+
+```bash
+$ winml optimize --list-capabilities
+```
+
+Discover pattern-rewrite families and their flag names:
+
+```bash
+$ winml optimize --list-rewrites
+```
+
+## Common pitfalls
+
+- **`--model` is required for actual optimization** — it can be omitted only when using `--list-capabilities` or `--list-rewrites`. Missing `--model` in any other case raises a usage error.
+- **Preset and CLI flags interact via precedence** — a `--disable-X` CLI flag always wins over a preset that enables the same capability, but omitting the flag entirely leaves the preset value in effect. To turn off a capability set by a preset, you must pass the explicit `--disable-X` flag.
+- **Config file validation errors abort the run** — if the config file contains keys that fail capability validation or dependency checks, the command prints all errors and exits with code 1 without touching the model. Fix the config before retrying.
+- **The dynamic flag list changes between releases** — new capabilities are added as the optimizer registry grows. Always use `--list-capabilities` to confirm the current set of flags rather than relying on a cached list.
+- **Output path default may overwrite a sibling file** — if you run optimize twice on the same input without specifying `-o`, the second run silently overwrites `{input}_opt.onnx`. Specify an explicit output path in scripts.
+
+## See also
+
+- [how-it-works.md](../concepts/how-it-works.md) — where optimization fits in the full winml-cli pipeline
+- [export.md](export.md) — produce an ONNX file to optimize from a HuggingFace model
+- [quantize.md](quantize.md) — quantize the optimized model for lower-precision inference
+- [config.md](config.md) — generate a `WinMLBuildConfig` that includes optimization settings
diff --git a/docs/commands/overview.md b/docs/commands/overview.md
new file mode 100644
index 000000000..ac05e70eb
--- /dev/null
+++ b/docs/commands/overview.md
@@ -0,0 +1,70 @@
+# Commands
+
+winml-cli exposes a CLI named `winml` with 12 subcommands covering the full
+journey from model discovery to a deployment-ready artifact. Every subcommand
+shares a consistent invocation style — `winml <command> [flags]` — and the
+same global flags are available on the root `winml` group.
+
+The commands group by user intent. **Discover** (`sys`, `inspect`, `hub`,
+`analyze`) helps you understand your hardware and model before writing any
+artifacts. **Configure** (`config`, `optimize`) produces a reusable build
+configuration and tunes the ONNX graph. **Build** (`export`, `quantize`,
+`compile`, `build`) runs the pipeline stages that produce deployment artifacts.
+**Measure** (`perf`, `eval`) benchmarks and validates the result.
+
+The typical workflow follows that order: run `winml sys` to confirm hardware
+and EPs, then `winml inspect` or `winml hub` to verify model support. Use
+`winml config` to generate a build configuration, then `winml build` to execute
+the full pipeline — or chain `export` → `optimize` → `quantize` → `compile`
+individually for finer control. Close with `winml perf` and `winml eval` to
+measure speed and accuracy.
+
+## Command map
+
+| Command | Group | Purpose |
+|---|---|---|
+| [`sys`](sys.md) | Discover | Inspect your machine — devices, EPs, SDKs, runtime versions at a glance. |
+| [`inspect`](inspect.md) | Discover | Inspect a model's tasks, classes, and hierarchy before committing to an export. |
+| [`hub`](hub.md) | Discover | Browse the curated winml-cli catalog of validated models and benchmarks. |
+| [`analyze`](analyze.md) | Discover | Verify an ONNX model is compatible with a target execution provider before deployment. |
+| [`config`](config.md) | Configure | Generate a reusable build configuration for a Hugging Face model or ONNX file. |
+| [`optimize`](optimize.md) | Configure | Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed. |
+| [`export`](export.md) | Build | Convert a PyTorch / Hugging Face model to ONNX, preserving module hierarchy. |
+| [`quantize`](quantize.md) | Build | Quantize an ONNX model with QDQ insertion and calibration-based scaling. |
+| [`compile`](compile.md) | Build | Compile an ONNX model to an EP-specific format for fast runtime loading. |
+| [`build`](build.md) | Build | Run the entire winml-cli pipeline (export → optimize → quantize → compile) in one command. |
+| [`perf`](perf.md) | Measure | Benchmark an ONNX model's latency and throughput on a target device. |
+| [`eval`](eval.md) | Measure | Evaluate ONNX model accuracy on a standard dataset. |
+
+## Choosing a command
+
+- **I want to see what hardware and EPs I have** → `winml sys`
+- **I want to know if my model is supported** → `winml inspect`
+- **I want to browse validated models with known benchmarks** → `winml hub`
+- **I want to verify EP operator compatibility before compiling** → `winml analyze`
+- **I want to convert a Hugging Face model to ONNX** → `winml export`
+- **I want to run the whole pipeline in one go** → `winml build`
+- **I want to benchmark latency and throughput** → `winml perf`
+- **I want to measure model accuracy** → `winml eval`
+
+## Global flags
+
+`-v` / `--verbose`, `-q` / `--quiet`, `--debug`, `--version`, and `-h` /
+`--help` live on the root `winml` group only. Subcommands access them through
+`ctx.obj` and do not redefine them. See
+`src/winml/modelkit/commands/_options.py` for the canonical contract.
+
+## Shared flags
+
+Several flags share semantics across the commands that accept them:
+`-m` / `--model`, `-d` / `--device`, `--ep`, `-o` / `--output`,
+`-t` / `--task`, and `-p` / `--precision`. Defaults and accepted values can
+differ per command; check the **Flags** section of each command page rather
+than assuming they transfer.
+
+## See also
+
+- [How winml-cli Works](../concepts/how-it-works.md) — end-to-end pipeline overview
+- [Config and build](../concepts/config-and-build.md) — structure of `WinMLBuildConfig` and how stages interact
+- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — background on EPs and how `--device` / `--ep` interact
+- [winml build](build.md) — the single command that runs the entire pipeline
diff --git a/docs/commands/perf.md b/docs/commands/perf.md
new file mode 100644
index 000000000..496e7a422
--- /dev/null
+++ b/docs/commands/perf.md
@@ -0,0 +1,102 @@
+# winml perf
+
+> Benchmark an ONNX model's latency and throughput on a target device.
+
+## When to use this
+
+Use `winml perf` when you want a quantitative latency and throughput baseline for a model on a specific device, or when you need to compare the performance impact of different precision settings, execution providers, or batch sizes.
+
+## Synopsis
+
+```bash
+$ winml perf [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|---|---|---|---|---|
+| `--model` | `-m` | `TEXT` | — | HuggingFace model ID or path to a local `.onnx` file. Required. |
+| `--task` | | `TEXT` | auto-detected | Explicit task override (e.g., `image-classification`). Inferred from the model if omitted. |
+| `--iterations` | | `INTEGER` | `100` | Number of timed inference iterations used to compute statistics. |
+| `--warmup` | | `INTEGER` | `10` | Number of warm-up iterations run before timing begins; excluded from statistics. |
+| `--device` | | `auto\|cpu\|gpu\|npu` | `auto` | Device to run the benchmark on. `auto` selects the highest-priority available device. |
+| `--precision` | | `TEXT` | `auto` | Precision mode applied during model build: `auto`, `fp32`, `fp16`, `int8`, `int16`, or compound forms such as `w8a16`. |
+| `--ep` | | `TEXT` | — | Force a specific execution provider (e.g., `qnn`, `dml`, `vitisai`, `openvino`, `cpu`). Overrides the device-to-provider mapping. |
+| `--output` | `-o` | `PATH` | `{model_slug}_perf.json` | Output JSON file path for the benchmark report. |
+| `--batch-size` | | `INTEGER` | `1` | Batch size used when generating synthetic input tensors. |
+| `--shape-config` | | `PATH` | — | Path to a JSON file containing shape overrides (e.g., `{"height": 480, "width": 480}`). Ignored for pre-exported ONNX files and in `--module` mode. |
+| `--no-quantize` | | flag | `false` | Skip quantization during model build. Useful for measuring the fp32 baseline. |
+| `--rebuild` | | flag | `false` | Force model rebuild even if a cached artifact already exists. |
+| `--ignore-cache` | | flag | `false` | Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies `--rebuild`. |
+| `--module` | | `TEXT` | — | PyTorch module class name for per-module benchmarking (e.g., `BertAttention`). Builds and times each matching instance separately. See [Load and export](../concepts/load-and-export.md). |
+| `--monitor` | | flag | `false` | Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report. |
+| `--op-tracing` | | `basic\|detail` | — | Enable operator-level profiling. Requires `onnxruntime-qnn`. |
+| `--compare-devices` | | `TEXT` | — | Not yet implemented. Run benchmarks separately and compare the JSON outputs instead. |
+
+## How it works
+
+`winml perf` loads the model through `WinMLAutoModel` — accepting both HuggingFace IDs and local ONNX files — then generates random input tensors from the model's I/O configuration. It runs the specified number of warm-up iterations (excluded from statistics) followed by the timed iterations, collecting per-sample latency. The final report includes mean, min, max, P50, P90, P95, P99, standard deviation, and throughput in samples per second. When `--monitor` is active, a hardware polling loop runs in parallel and records NPU utilization, CPU usage, and device memory alongside the timing data.
+
+## Examples
+
+Basic benchmark on the best available device:
+
+```bash
+$ winml perf -m microsoft/resnet-50
+```
+
+```text
+Device:      npu
+Precision:   auto
+Task:        image-classification
+Iterations:  100 (+ 10 warmup)
+Batch Size:  1
+
+Latency (ms)
+  Avg    P50    P90    P95    P99    Min    Max    Std
+ 2.14   2.11   2.38   2.51   2.79   1.97   3.04   0.12
+
+Throughput: 467.29 samples/sec
+
+Results saved to: microsoft_resnet-50_perf.json
+```
+
+Benchmark a pre-exported ONNX file on CPU with more iterations:
+
+```bash
+$ winml perf -m model.onnx --device cpu --iterations 500
+```
+
+Benchmark a text model with an explicit task, targeting the NPU:
+
+```bash
+$ winml perf -m bert-base-uncased --task text-classification --device npu --precision w8a16
+```
+
+Benchmark with live hardware monitoring enabled:
+
+```bash
+$ winml perf -m microsoft/resnet-50 --device npu --monitor
+```
+
+Per-module benchmarking to find latency hot-spots across all attention blocks:
+
+```bash
+$ winml perf -m bert-base-uncased --module BertAttention --iterations 200
+```
+
+## Common pitfalls
+
+- **Warm-up too low on NPU.** The first several inferences on an NPU EP can be significantly slower due to kernel compilation and caching. The default of 10 warm-up iterations is usually enough for vision models, but transformer models with many operators may need `--warmup 30` or higher to reach steady-state latency.
+- **`--shape-config` is silently ignored in two cases.** It has no effect on pre-exported ONNX files (shapes are baked into the graph) and is ignored in `--module` mode. The command prints a warning in both situations.
+- **`--op-tracing` requires `onnxruntime-qnn`.** The flag activates the QNN profiler, which is only present in the `onnxruntime-qnn` package. If that package is not installed, the benchmark still runs but the op-trace step exits with an error.
+- **Random inputs do not represent real data distributions.** Latency numbers are accurate, but memory access patterns may differ from production because the generated tensors are uniform random values. For memory-bandwidth-sensitive models this can understate real-world latency.
+- **`--compare-devices` is not yet implemented.** Use separate `winml perf` invocations and compare the resulting JSON files manually.
+
+## See also
+
+- [winml eval](eval.md) — measure accuracy after benchmarking
+- [winml build](build.md) — build the quantized artifact that `perf` benchmarks
+- [Load and export concept](../concepts/load-and-export.md) — how `--module` per-instance benchmarking works
+- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — understand `--device` vs `--ep`
diff --git a/docs/commands/quantize.md b/docs/commands/quantize.md
new file mode 100644
index 000000000..cfa9c5e1f
--- /dev/null
+++ b/docs/commands/quantize.md
@@ -0,0 +1,115 @@
+# winml quantize
+
+> Quantize an ONNX model with QDQ insertion and calibration-based scaling.
+
+## When to use this
+
+Use `winml quantize` after `winml export` to insert
+QuantizeLinear/DequantizeLinear (QDQ) node pairs into an ONNX graph. The
+resulting model is ready for `winml compile` targeting an NPU or other
+quantization-aware execution provider.
+
+## Synopsis
+
+```bash
+$ winml quantize [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|---|---|---|---|---|
+| `--model` | `-m` | path | *(required)* | Input ONNX model file. |
+| `--output` | `-o` | path | `{input}_qdq.onnx` | Output path for the quantized model. |
+| `--precision` | `-p` | string | `None` | Precision shorthand: `int8`, `int16`, or mixed-precision like `w8a16`. Overridden by explicit `--weight-type` / `--activation-type`. |
+| `--samples` | | integer | `10` | Number of calibration samples used to compute quantization ranges. |
+| `--method` | | choice | `minmax` | Calibration algorithm: `minmax`, `entropy`, or `percentile`. |
+| `--weight-type` | | choice | `None` | Per-tensor type for weights: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. |
+| `--activation-type` | | choice | `None` | Per-tensor type for activations: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. |
+| `--per-channel` | | flag | `false` | Apply per-channel (rather than per-tensor) quantization to weight tensors. |
+| `--symmetric` | | flag | `false` | Use symmetric quantization (zero-point fixed at 0). |
+| `--help` | `-h` | flag | | Show this message and exit. |
+
+## How it works
+
+`winml quantize` applies static post-training quantization (PTQ) using the
+ONNX Runtime quantization API. Calibration passes collect activation range
+statistics, which are used to compute scale and zero-point values baked into
+`QuantizeLinear` / `DequantizeLinear` node pairs around each eligible operator.
+The `--method` flag controls range estimation: `minmax` uses global observed
+extremes, `entropy` minimizes KL-divergence, and `percentile` clips outliers.
+Precision can be set at a coarse level with `--precision` or tuned per tensor
+type with `--weight-type` and `--activation-type`; explicit type flags always
+override `--precision`.
+
+## Examples
+
+```bash
+# Minimal quantization: defaults (10 samples, uint8 weights and activations)
+winml quantize -m resnet50.onnx
+```
+
+```text
+Input: resnet50.onnx
+Output: resnet50_qdq.onnx
+Weight type: uint8
+Activation type: uint8
+Samples: 10
+Method: minmax
+
+Running quantization...
+
+Success! Model quantized
+Output: resnet50_qdq.onnx
+QDQ nodes inserted: 53
+Total time: 4.31s
+```
+
+```bash
+# int8 precision shorthand (equivalent to --weight-type int8 --activation-type int8)
+winml quantize -m resnet50.onnx -p int8
+```
+
+```bash
+# Mixed-precision: int8 weights, uint16 activations with entropy calibration
+winml quantize -m bert-base-uncased.onnx \
+  --weight-type int8 --activation-type uint16 \
+  --method entropy --samples 64
+```
+
+```bash
+# Per-channel symmetric quantization to a specific output path
+winml quantize -m facebook_convnext.onnx \
+  -o facebook_convnext_qdq.onnx \
+  --per-channel --symmetric --samples 32
+```
+
+```bash
+# int16 precision (suitable for models sensitive to int8 accuracy loss)
+winml quantize -m bert-base-uncased.onnx --precision int16
+```
+
+## Common pitfalls
+
+- **`--weight-type` / `--activation-type` silently override `--precision`.**
+  If you pass both, the explicit type flags win. Omit `--precision` when
+  setting types explicitly to avoid confusion.
+- **Low sample counts can hurt accuracy.** The default of 10 samples is
+  sufficient for quick testing, but production models typically need 64–256
+  representative samples for good calibration.
+- **`--per-channel` increases model size.** Per-channel quantization stores a
+  separate scale and zero-point per output channel; this can noticeably inflate
+  the model file size compared to per-tensor mode.
+- **Output defaults to `{stem}_qdq.onnx` in the same directory as input.**
+  Always pass `-o` when writing to a specific location to avoid accidentally
+  overwriting or cluttering the source directory.
+- **Quantizing an already-quantized model (one containing QDQ nodes) is
+  unsupported and will produce incorrect results.** Use `winml compile
+  --no-quant` instead if the model already contains QDQ nodes.
+
+## See also
+
+- [winml export](export.md)
+- [winml compile](compile.md)
+- [winml build](build.md)
+- [Quantization concepts](../concepts/quantization.md)
diff --git a/docs/commands/sys.md b/docs/commands/sys.md
new file mode 100644
index 000000000..71d23e81c
--- /dev/null
+++ b/docs/commands/sys.md
@@ -0,0 +1,118 @@
+# winml sys
+
+> Inspect your machine — devices, EPs, SDKs, runtime versions at a glance.
+
+## When to use this
+
+Run `winml sys` before starting any export or build workflow to confirm that the
+required ML libraries are installed and that the target hardware is visible. It is
+also the first command to run when diagnosing an unexpected export failure.
+
+## Synopsis
+
+```bash
+$ winml sys [options]
+```
+
+## Flags
+
+| Flag | Short | Type | Default | Description |
+|------|-------|------|---------|-------------|
+| `--format` | `-f` | `text` \| `json` \| `compact` | `text` | Output format. `text` renders rich tables, `json` emits machine-readable JSON, `compact` prints a single-line summary. |
+| `--list-device` | — | flag | `false` | List available compute devices (NPU, GPU, CPU) in priority order instead of showing the full system report. |
+| `--list-ep` | — | flag | `false` | List available ONNX Runtime execution providers instead of showing the full system report. Can be combined with `--list-device`. |
+| `--help` | `-h` | flag | — | Show help and exit. |
+
+> `winml sys` takes no `--model`, `--device`, `--ep`, `--task`, or `--precision`
+> arguments. It describes the host environment, not a specific model.
+
+## How it works
+
+`winml sys` queries Python's `platform` and `importlib.metadata` modules to report
+library versions, then probes PyTorch for CUDA availability and GPU device names.
+Backend SDK detection checks for `QNN_SDK_ROOT` / `QAIRT_SDK_ROOT` environment
+variables (QNN) and attempts to import `openvino` (OpenVINO). Device enumeration
+queries hardware directly in NPU > GPU > CPU priority order, while EP enumeration
+merges the WinML EP registry with ONNX Runtime's `get_available_providers()`. When
+`--format json` is used the full report — including devices and EPs — is emitted as
+a single JSON object, making it easy to capture in CI pipelines.
+
+## Examples
+
+```bash
+# Full human-readable system report
+$ winml sys
+```
+
+```text
+╭──────────────────────────────────╮
+│   winml-cli System Information    │
+╰──────────────────────────────────╯
+
+Environment
+  Python Version    3.11.9
+  Python Executable C:\...\python.exe
+  OS                Windows 11
+  Machine           AMD64
+
+ML Libraries
+  Library        Version   Status
+  torch          2.4.0     OK
+  transformers   4.44.0    OK
+  onnx           1.16.1    OK
+  ...
+
+Available Devices (priority order)
+  #1  NPU   Qualcomm(R) AI 100
+  #2  GPU   NVIDIA GeForce RTX 4090
+  #3  CPU   AMD Ryzen 9 7940HS
+
+Available Execution Providers
+  QNNExecutionProvider           -> NPU
+  DmlExecutionProvider           -> GPU
+  CPUExecutionProvider           -> CPU
+```
+
+```bash
+# Compact one-liner — useful for CI logs
+$ winml sys --format compact
+```
+
+```bash
+# Machine-readable JSON — pipe to jq or save for later comparison
+$ winml sys --format json > env.json
+```
+
+```bash
+# Only list devices — skip everything else
+$ winml sys --list-device
+```
+
+```bash
+# List EPs as JSON — useful for scripting EP selection
+$ winml sys --list-ep --format json
+```
+
+## Common pitfalls
+
+- **QNN SDK not found even though it is installed.** The detection relies on the
+  `QNN_SDK_ROOT` or `QAIRT_SDK_ROOT` environment variables. If neither is set,
+  `winml sys` will report the SDK as absent even if the binaries exist on disk.
+  Set the variable and re-run.
+- **`--list-device` and `--list-ep` suppress the full report.** When either flag is
+  present, only the requested section is printed. Omit both flags to see the
+  complete system report.
+- **`--format compact` omits device and EP tables.** The compact format is designed
+  for single-line log entries and does not include device or EP details. Use `text`
+  or `json` when you need the full picture.
+- **CUDA shown as unavailable on a machine with a GPU.** PyTorch must be installed
+  with CUDA support (`torch+cuXXX`). A CPU-only torch wheel will always report
+  `cuda_available: false`.
+
+## See also
+
+- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — background on EPs and
+  how `--device` / `--ep` flags interact
+- [inspect.md](inspect.md) — inspect a specific HuggingFace model's compatibility
+- [hub.md](hub.md) — browse the curated catalog of validated models
+- [How winml-cli Works](../concepts/how-it-works.md) — end-to-end pipeline overview
diff --git a/docs/concepts/analyze-and-optimize.md b/docs/concepts/analyze-and-optimize.md
new file mode 100644
index 000000000..f49d63593
--- /dev/null
+++ b/docs/concepts/analyze-and-optimize.md
@@ -0,0 +1,34 @@
+# Analyze and optimize
+
+Not every ONNX graph runs efficiently on every execution provider. An operator that compiles cleanly on CPU may be unsupported on an NPU, and a correct graph may still leave performance on the table because adjacent operations were not fused. winml-cli separates the concern into two commands — `winml analyze` and `winml optimize` — that together form a graph-quality loop driven automatically by `winml build`.
+
+## What analyze does
+
+`winml analyze` performs static analysis on an ONNX file and reports how well it will run on a target EP. It checks operator coverage, runs shape inference to catch missing or inconsistent tensor shapes, and performs runtime checks that probe actual support on the local machine.
+
+Specify a target EP with `--ep` (e.g., `--ep qnn` or `--ep openvino`) and a device with `--device` (CPU, GPU, or NPU). Omit `--ep` to analyze against all supported EPs. Results print to the console by default; add `--output results.json` to save the report as JSON for scripting or archiving.
+
+Exit codes carry the verdict: zero is full support, one is partial support with unsupported operators, two is a configuration error. This makes `winml analyze` suitable as a CI gate. Pass `--information` (enabled by default) to include recommendations alongside each flagged operator. Use `--save-node unsupported` or `--save-node partial` to persist node lists for further work.
+
+## What optimize does
+
+`winml optimize` rewrites the ONNX graph by applying fusions and structural simplifications. Fusions such as GELU, LayerNorm, and MatMul+Add collapse multi-node sequences into single operators that EPs can map to efficient kernels. Layout transformations like the NHWC transformer rearrange tensor memory order to match GPU access patterns.
+
+Every optimization is a named capability toggled via `--enable-<name>` and `--disable-<name>` flags. Run `--list-capabilities` to see all registered optimizations and their defaults. This granularity matters when a specific fusion breaks a downstream step or when you need an exact optimization profile for a given EP.
+
+The pattern-rewrite family is a complementary mechanism: instead of folding nodes, rewrites replace one subgraph pattern with a structurally equivalent alternative. Run `--list-rewrites` to discover available families and their flag names. Flags follow the form `--enable-<source-slug>-<target-slug>`.
+
+Use presets (`--preset transformer-optimized`, `--preset qnn-compatible`) as a starting point, and commit a specific combination to a `--config` file for reproducible builds.
+
+## The analyzer/optimizer loop
+
+A single optimize pass may create fusion opportunities that were not present before, and a freshly fused graph may surface new operator compatibility issues. This is why `winml build` runs analyze and optimize in an alternating loop rather than once each.
+
+The loop repeats up to `--max-optim-iterations` rounds (default: three), which covers most transformer and vision architectures. Convergence is checked after each round; the loop exits early when the analysis result no longer improves. Use `--no-analyze` to skip the loop and run a single optimization pass — useful for deterministic rebuilds from a fixed ONNX checkpoint where the graph is already known good.
+
+## See also
+
+- [Compile and EPContext](compile-and-epcontext.md)
+- [Primitives and pipeline](primitives-and-pipeline.md)
+- [analyze command](../commands/analyze.md)
+- [optimize command](../commands/optimize.md)
diff --git a/docs/concepts/compile-and-epcontext.md b/docs/concepts/compile-and-epcontext.md
new file mode 100644
index 000000000..3a2529277
--- /dev/null
+++ b/docs/concepts/compile-and-epcontext.md
@@ -0,0 +1,42 @@
+# Compile and EPContext
+
+When you run `winml compile`, you are not simply copying an ONNX file to a new location. You are asking an execution provider (EP) to transform the model into a form it can load and run directly, without repeating that transformation at every startup. Understanding what the compiler produces — and why — helps you decide when to compile, what output format to choose, and how to balance file size against runtime performance.
+
+Compilation is an offline, one-time step. The artifact it creates is what you ship with your application and what `winml-cli` uses for benchmarking and evaluation.
+
+## What compilation produces
+
+For EPs that are fully integrated into ONNX Runtime — CPU, DirectML, and similar providers — the compile step writes a new `.onnx` file that the runtime loads directly. The ONNX graph has been prepared and, in some cases, partitioned so that the EP's session initializer has less work to do when the application starts.
+
+For QNN-family EPs (the `--ep qnn` and `--ep vitisai` targets used for NPU inference), the compiler goes further. QNN takes the ONNX graph and produces a binary artifact — the **EP context blob** — that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the QNN EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
+
+The default compiler backend is `ort` (ONNX Runtime). If you have a QAIRT SDK installed you can select `--compiler qairt` and point `--qnn-sdk-root` at the SDK root for direct QAIRT compilation instead.
+
+## Embedded vs external EPContext
+
+For QNN compilation, `winml-cli` gives you a choice of where the EP context blob lives. By default the blob is written as a sidecar `.bin` file alongside the `.onnx`. Passing `--embed` instead inlines the blob directly into the ONNX file.
+
+**External (default):** The `.onnx` is small and human-inspectable; the heavy binary data lives in a separate file. You must keep the two files together — the ONNX stores a relative path back to the `.bin`. This layout is preferable for version control and for scenarios where you want to inspect or diff the model graph.
+
+**Embedded (`--embed`):** Everything ships in a single `.onnx` file. Deployment is simpler because there is only one artifact to track. The trade-off is file size: the `.onnx` grows by the full size of the compiled context, and the file is no longer human-readable in the usual sense. Choose embedded when your deployment tooling expects a single model file, or when you want to minimize the chance of the sidecar being misplaced.
+
+## Why pre-compile
+
+The first time an ONNX Runtime session is created for a model on a hardware EP, the runtime must partition the graph, allocate buffers, and JIT-compile the operators. On an NPU this process can take several seconds. For applications with tight startup budgets — on-device inference in a UI flow, for example — that cold-start cost is often unacceptable.
+
+A model produced by `winml compile` has already paid that cost. The EP context blob is the result of compilation, not its input. When the application loads the compiled model the EP reads the pre-built binary and the session is ready almost immediately. Shipping a compiled model is therefore the standard pattern for production deployments on QNN hardware.
+
+If you are iterating on quantization settings or ONNX graphs and want to check whether the model compiles at all, `winml compile` also accepts `--no-quant` to skip the quantization pass for already-quantized (QDQ) models.
+
+## Skipping validation
+
+By default `winml compile` runs a validation pass after compilation finishes — it loads the compiled model, feeds it random inputs, and checks that the outputs are numerically consistent with the original. This catches compilation regressions early.
+
+The `--no-validate` flag skips that pass. It is useful during rapid iteration when you only want to confirm that the EP can accept the model without the overhead of a full inference run. Do not use `--no-validate` for production builds. Shipping an unvalidated compiled artifact risks silent correctness regressions that are difficult to diagnose in the field.
+
+## See also
+
+- [EPs and devices](eps-and-devices.md) — execution provider selection and `--ep` / `--device` flags
+- [Analyze and optimize](analyze-and-optimize.md) — graph-level analysis before compilation
+- [compile command reference](../commands/compile.md)
+- [build command reference](../commands/build.md)
diff --git a/docs/concepts/config-and-build.md b/docs/concepts/config-and-build.md
new file mode 100644
index 000000000..164a94207
--- /dev/null
+++ b/docs/concepts/config-and-build.md
@@ -0,0 +1,161 @@
+# Config and build
+
+`winml config` and `winml build` are a producer/consumer pair. `winml config`
+inspects a Hugging Face model (or an existing ONNX file), auto-detects the task,
+model class, and I/O specifications, and writes a `WinMLBuildConfig` JSON file.
+`winml build` reads that file and runs the full pipeline — export, optimize,
+quantize, compile — producing a Windows ML-ready ONNX artifact.
+
+Keeping these two responsibilities separate is intentional. The config file is a
+stable, human-readable description of exactly what the build will do. You can
+generate it once, review or edit it, commit it to source control, and replay the
+same build at any time without re-running model introspection. CI pipelines and
+team workflows both benefit from treating the config file as a versioned artifact
+rather than a transient intermediate.
+
+## Generating a config
+
+`winml config` produces a `WinMLBuildConfig` JSON with sensible defaults for the
+detected model type. At minimum, provide a model identifier:
+
+```bash
+winml config -m microsoft/resnet-50 -o resnet50.json
+```
+
+Several flags shape what ends up in the config:
+
+- `--task` overrides the auto-detected Hugging Face task when detection is
+  ambiguous or when you want a specific variant (for example, `text-classification`
+  vs `feature-extraction`).
+- `--no-quant` sets the `quant` section to `null`, so the quantize stage is omitted
+  when `winml build` consumes the config. Use this for GPU workflows where float16
+  is preferred over QDQ quantization.
+- `--no-compile` sets the `compile` section to `null`, producing a portable ONNX
+  that the runtime compiles on first load instead of embedding a pre-compiled
+  binary.
+- `--trust-remote-code` allows model repositories that ship custom modeling code —
+  required for some community models that define non-standard architectures outside
+  the standard `transformers` library.
+
+If `-o` is omitted, the config is printed to stdout, which is convenient for
+piping or quick inspection. The generated JSON is plain text and can be edited
+directly before being passed to `winml build`.
+
+## What's in a config
+
+A `WinMLBuildConfig` is a dataclass defined in
+`src/winml/modelkit/config/build.py`. It holds five nested sub-configs, one per
+pipeline stage:
+
+| Field | Type | Purpose |
+|---|---|---|
+| `loader` | `WinMLLoaderConfig` | Task, model type, and model class used to load the Hugging Face model. |
+| `export` | `WinMLExportConfig` | Input/output tensor specs, opset version, dynamic axes (`null` for pre-exported ONNX). |
+| `optim` | `WinMLOptimizationConfig` | Graph fusion flags (GeLU, LayerNorm, MatMul+Add). |
+| `quant` | `WinMLQuantizationConfig` | Precision types (`weight_type`, `activation_type`), calibration samples and method (`null` to skip). |
+| `compile` | `WinMLCompileConfig` | Target EP provider, EPContext options, compiler backend (`null` to skip). |
+
+Setting `quant` or `compile` to `null` tells the pipeline to skip that stage
+entirely, equivalent to passing `--no-quant` or `--no-compile` on the command
+line.
+
+A generated config looks similar to:
+
+```json
+{
+  "loader": {
+    "task": "image-classification"
+  },
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1
+  },
+  "optim": {
+    "gelu_fusion": false,
+    "layer_norm_fusion": false,
+    "matmul_add_fusion": false
+  },
+  "quant": {
+    "mode": "qdq",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "samples": 10
+  },
+  "compile": {
+    "ep_config": {
+      "provider": "qnn",
+      "enable_ep_context": true
+    }
+  }
+}
+```
+
+The file is plain JSON. You can hand-edit any field before passing it to
+`winml build` — adjust the calibration sample count, change the compile
+provider, or remove a fusion flag.
+
+## Consuming a config
+
+Pass the config file to `winml build` with either an output directory or the
+global cache flag:
+
+```bash
+# Write artifacts to a local directory
+winml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/
+
+# Write to the global cache (~/.cache/winml/)
+winml build -c resnet50.json -m microsoft/resnet-50 --use-cache
+```
+
+`--output-dir` and `--use-cache` are mutually exclusive; you must supply one of
+the two when running `winml build` (enforced at runtime, not parse time). Within the output directory, `winml build` writes one ONNX file per
+completed stage so that intermediate artifacts are available for inspection, and
+it writes a copy of the resolved config so the full build parameters are recorded
+alongside the outputs.
+
+## Overrides at run time
+
+CLI flags passed directly to `winml build` override the corresponding config
+sections for that run only, without modifying the JSON file on disk. This makes
+it straightforward to experiment with a variation without creating a new config:
+
+```bash
+# Skip quantization and compilation for this run only
+winml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/ --no-quant --no-compile
+
+# Skip optimization (for a pre-quantized input ONNX)
+winml build -c resnet50.json -m model_qdq.onnx --output-dir output/ --no-optimize
+```
+
+`--no-quant`, `--no-compile`, and `--no-optimize` each suppress the corresponding
+stage regardless of what the config file specifies. Because the config file is
+unchanged, re-running without the override flag reverts to the full pipeline
+described in the config.
+
+## Why version a config
+
+Storing the `WinMLBuildConfig` JSON in source control brings three concrete
+benefits:
+
+1. **Reproducibility.** A config file pins every build decision — task, precision,
+   quantization method, calibration sample count, target EP, fusion flags — in a
+   single file. Running `winml build -c config.json` six months later produces the
+   same artifact as it does today, regardless of how the tool's defaults evolve.
+
+2. **CI integration.** A CI job can run `winml build -c config.json -m <model-id>
+   --output-dir artifacts/` with no human intervention. Because all settings live
+   in the config file, the CI script requires no per-model flag knowledge, and
+   updating build parameters is a pull request to the config file, not a change to
+   the pipeline script.
+
+3. **Team sharing.** Handing a colleague a config file is enough for them to
+   reproduce the exact build on their machine. There is no need to document the
+   sequence of primitive commands, precision arguments, or calibration settings
+   separately — the file is the documentation.
+
+## See also
+
+- [Primitives and pipeline](primitives-and-pipeline.md) — when to use `winml build`
+  vs individual primitive commands
+- [winml config command reference](../commands/config.md)
+- [winml build command reference](../commands/build.md)
diff --git a/docs/concepts/eps-and-devices.md b/docs/concepts/eps-and-devices.md
new file mode 100644
index 000000000..abf447c27
--- /dev/null
+++ b/docs/concepts/eps-and-devices.md
@@ -0,0 +1,62 @@
+# EP and Device
+
+An **Execution Provider (EP)** is a pluggable backend in ONNX Runtime that claims and runs a subset of graph nodes on a specific hardware target. When ONNX Runtime loads a model it partitions the graph among the registered EPs: operators that an EP claims are dispatched to it, and the remainder fall back to the CPU EP. This design lets a single [ONNX](graphs-and-ir.md) model exploit an NPU, GPU, or CPU without any change to the graph itself.
+
+A **device** is the hardware category that an EP targets — one of `npu`, `gpu`, or `cpu`. winml-cli exposes both levels of control: the high-level `--device` flag selects a hardware category, while the low-level `--ep` flag pins a specific ONNX Runtime provider name. In most workflows you set `--device` and let winml-cli resolve the best available EP; you reach for `--ep` when you need to compare or force a specific provider.
+
+## EPs winml-cli supports
+
+The table below lists every Execution Provider that winml-cli has explicit support for. EP names are the canonical ONNX Runtime strings accepted by `--ep`.
+
+| EP | Device | Hardware | When to use |
+|----|--------|----------|-------------|
+| `QNNExecutionProvider` | npu | Qualcomm NPU (Hexagon DSP) | Snapdragon-based Copilot+ PCs; best latency and power efficiency on Qualcomm silicon |
+| `VitisAIExecutionProvider` | npu | AMD NPU (XDNA) | AMD Ryzen AI platforms; targets the AMD AI Engine via the Vitis AI stack |
+| `OpenVINOExecutionProvider` | npu / gpu / cpu | Intel CPU / GPU / NPU | Intel Core Ultra platforms; flexible device targeting across all three Intel compute types |
+| `DmlExecutionProvider` | gpu | GPU (DirectML) | Any DirectX 12 GPU on Windows; broad compatibility across AMD, Intel, and NVIDIA discrete/integrated graphics |
+| `NvTensorRTRTXExecutionProvider` | gpu | NVIDIA GPU (TensorRT RTX) | NVIDIA RTX GPUs; maximum throughput via TensorRT graph optimization |
+| `MIGraphXExecutionProvider` | gpu | AMD GPU (MIGraphX) | AMD discrete GPUs; hardware-accelerated inference via the MIGraphX graph engine |
+| `CPUExecutionProvider` | cpu | CPU | Universal fallback; always available regardless of hardware |
+
+To see which EPs are available on the current machine, run:
+
+```bash
+winml sys --list-ep
+```
+
+## Device vs. EP on the CLI
+
+winml-cli exposes two overlapping flags for targeting hardware. Understanding their relationship prevents confusion when using `winml analyze`, `winml compile`, or `winml build`.
+
+**`--device` (high-level)**
+
+Accepts one of four values: `auto`, `cpu`, `gpu`, or `npu`. When set to `auto` (the default), winml-cli inspects the machine and selects the highest-priority device class that has a compatible EP available, in the order NPU > GPU > CPU. Setting an explicit value such as `--device npu` requests a device category without naming the EP.
+
+```bash
+# Let winml-cli pick the best available device
+winml analyze --model model.onnx --device auto
+
+# Target the NPU device class
+winml analyze --model model.onnx --device npu
+```
+
+**`--ep` (low-level override)**
+
+Accepts a valid EP name (for example `qnn`, `vitisai`, `dml`, `openvino`). When `--ep` is provided it takes precedence over `--device` and bypasses device-class resolution entirely. Use `--ep` when you need to pin a specific provider — for instance to compare `QNNExecutionProvider` against `DmlExecutionProvider` on the same machine.
+
+```bash
+# Force Qualcomm QNN regardless of device selection
+winml analyze --model model.onnx --ep QNNExecutionProvider --device npu
+
+# Use the short alias; winml-cli normalizes it to the full name
+winml analyze --model model.onnx --ep qnn
+```
+
+The `--ep` flag accepts a free-form string and is not restricted to the choices listed above. This allows forward compatibility with EP names that winml-cli does not yet enumerate.
+
+## See also
+
+- [Graphs and IR](graphs-and-ir.md) — ONNX graph format, operator sets, and the IR that EPs consume
+- [Weight and Activation](weight-and-activation.md) — tensor roles relevant to EP compatibility
+- [winml sys](../commands/sys.md) — list available devices and EPs on the current machine
+- [winml analyze](../commands/analyze.md) — check ONNX operator compatibility against a specific EP
diff --git a/docs/concepts/eval-and-datasets.md b/docs/concepts/eval-and-datasets.md
new file mode 100644
index 000000000..bc7f662ec
--- /dev/null
+++ b/docs/concepts/eval-and-datasets.md
@@ -0,0 +1,70 @@
+# Eval and datasets
+
+`winml eval` answers one question: does this model produce correct results? It measures
+accuracy — how well outputs match ground truth — rather than latency or throughput. You
+give it a model, point it at a labeled dataset, and get back a JSON report of metric
+scores. Everything else in the pipeline (compilation, quantization, device selection) is
+about making the model *fast*; eval is about knowing whether it is still *right*.
+
+The dataset is the source of truth. Eval iterates over dataset rows, runs each sample
+through the model, and compares the prediction to the label recorded in the dataset. This
+means the dataset must have both input features and ground-truth labels, and the columns
+carrying those values must be wired to the model's inputs and outputs. winml-cli handles
+standard tasks automatically, but the column-mapping flags let you override the defaults
+for non-standard datasets.
+
+## What eval reports
+
+The metric reported depends on the task. Classification tasks produce accuracy (top-1 and
+optionally top-5). Object detection tasks produce mean average precision (mAP). The exact
+set of metrics is printed to stdout and saved to the file specified by `--output`. The
+`--output` flag accepts any `.json` path; if omitted, results are printed but not persisted.
+Use `--schema` to print the expected dataset schema for a given task without running eval,
+which is useful when you are preparing a custom dataset.
+
+## Picking a dataset
+
+`--dataset` takes a Hugging Face dataset path — for example `imagenet-1k` or `glue`. If
+you omit it, winml-cli selects a default dataset based on the detected task. For datasets
+that have multiple configurations, `--dataset-name` picks the specific config (e.g.
+`--dataset-name mrpc` when using the `glue` dataset).
+
+By default eval runs on the `validation` split; `--split` overrides this. Full validation
+sets can be large. During development, `--samples 200` caps the run to 200 rows so you get
+quick feedback. For very large datasets that you prefer not to download fully, `--streaming`
+fetches rows on demand instead of materialising the whole dataset locally. `--shuffle`
+(on by default) randomises sampling order so a capped run is representative rather than
+biased toward the first rows.
+
+## Column mapping
+
+winml-cli must know which dataset column feeds which model input and which column holds
+the ground-truth label. For well-known task/dataset combinations this mapping is built in.
+When it is not, use `--column key=value` to declare it. The `key` is the name the task
+pipeline expects (e.g. `input_column`) and `value` is the actual column name in the
+dataset (e.g. `image`). You can repeat `--column` as many times as needed.
+
+When the integer label IDs in the dataset do not match the class indices the model was
+trained against, `--label-mapping` accepts a JSON file of the form `{"class_name": id}`
+that translates between the two spaces. This is common with models fine-tuned on a
+relabelled subset of a public dataset.
+
+## Why eval after quantization
+
+Quantization is a lossy transformation. Converting weights from float32 to int8, or
+activations to a narrow range, introduces rounding error that accumulates differently
+across architectures and calibration data. The impact on accuracy cannot be predicted
+analytically; it must be measured. Running `winml eval` before and after quantization
+gives you a concrete accuracy delta. A drop within your acceptable threshold confirms the
+quantized model is ready; a larger drop means you should revisit calibration settings or
+switch to a less aggressive quantization scheme.
+
+Make this a habit: quantize, then eval. Comparing two `--output` JSON files is a reliable,
+reproducible record that the trade-off between performance and accuracy was explicitly
+checked. See [Quantization](quantization.md) for the full quantization workflow.
+
+## See also
+
+- [Quantization](quantization.md) — calibrate and quantize a model, then verify with eval
+- [Perf and monitoring](perf-and-monitoring.md) — measure latency and throughput after accuracy is confirmed
+- [`winml eval` command reference](../commands/eval.md) — all flags with examples
diff --git a/docs/concepts/graphs-and-ir.md b/docs/concepts/graphs-and-ir.md
new file mode 100644
index 000000000..af5d787e9
--- /dev/null
+++ b/docs/concepts/graphs-and-ir.md
@@ -0,0 +1,59 @@
+# Graph and IR
+
+A `.onnx` file is, at rest, a binary-serialized Protocol Buffer. Open it in any hex editor and you will find the familiar `ONNX` magic bytes followed by a dense encoding of every number the model has ever learned, plus the structural description of how those numbers are combined to produce a prediction. The file is self-contained: weights and computation recipe live together, making the artifact portable without any accompanying framework installation.
+
+That computation recipe is a **graph** — a directed acyclic structure of operators wired together by named data edges. The graph is what the ONNX Intermediate Representation (IR) actually defines. When winml-cli loads or transforms a model, every operation works against this graph structure, not against framework-specific objects.
+
+## What is in a .onnx file
+
+An ONNX `ModelProto` wraps a single `GraphProto`. Inside the graph you will find:
+
+- **Inputs** — typed, named entry points that accept runtime tensors (e.g., `pixel_values: float32[1, 3, 224, 224]`).
+- **Outputs** — typed, named exit points that carry the model's predictions back to the caller.
+- **Nodes** — individual operators (Conv, MatMul, Softmax, …) that transform tensors. Each node names its inputs and outputs using the same string identifiers used throughout the graph.
+- **Initializers** — constant tensors embedded in the file. Learned weights, biases, and lookup tables are stored here; they are treated as graph inputs that are always pre-supplied.
+- **Metadata** — key–value string properties attached at the model level. winml-cli uses this area to store information such as `winml.io.inputs` (serialized tensor specs) and `winml.hierarchy.tag` attributes on individual nodes.
+
+## Graphs as IR
+
+ONNX functions as an Intermediate Representation: a portable, framework-neutral description of a computation that can be loaded by any conforming runtime. Unlike a Python object graph or a compiled binary, the ONNX IR makes data flow completely explicit. Every node declares the exact names of its input and output edges; those names form a namespace shared across the whole graph, so any consumer can trace a tensor from the model inputs through every transformation to the final output.
+
+This explicit wiring unlocks two capabilities that winml-cli relies on heavily. First, **shape inference** can propagate concrete or symbolic dimensions through the graph without running it — a prerequisite for correct quantization and for generating input specs automatically. Second, **EP-targeted compilation** can partition the graph by examining which nodes an Execution Provider supports, fuse eligible sub-graphs into accelerated kernels, and serialize the result back into a valid ONNX file using the `EPContext` convention. Neither of these would be tractable on an opaque binary or a dynamic execution trace.
+
+Because the IR is static — describing the full computation at load time rather than at call time — winml-cli can inspect, validate, and transform a model without a GPU, a framework, or sample data.
+
+## Opsets and versioning
+
+Every operator in ONNX belongs to a **domain**, and every domain advances through numbered **opset versions**. An opset is a snapshot of the operator catalog: it defines which operators exist, what their inputs and outputs mean, and how edge cases are handled. When a model declares `opset_import { domain: "" version: 17 }`, it is saying "all unnamed-domain operators in this file must be interpreted according to the rules published in opset 17."
+
+winml-cli defaults to **opset 17** when exporting a PyTorch model to ONNX. This is the value of `opset_version: int = 17` in `WinMLExportConfig` (`src/winml/modelkit/export/config.py`, line 75). Opset 17 introduced layer-normalisation and group-normalisation operators in native form, eliminating the multi-node decompositions required by earlier opsets, which is why it is the recommended baseline for modern transformer and vision architectures.
+
+Higher opsets unlock additional operators and fix known edge-case behavior, but not every Execution Provider supports the latest opset. QNN, for instance, may lag behind the ONNX standard by one or two versions. If you need to target an older EP, pass a custom export configuration:
+
+```bash
+# Write a config override
+echo '{"opset_version": 16}' > export_cfg.json
+
+# Export with the override
+winml export -m prajjwal1/bert-tiny -o bert.onnx --export-config export_cfg.json
+```
+
+You can also check the opset a saved model declares:
+
+```bash
+winml inspect -m bert.onnx
+```
+
+```text
+Opset: ai.onnx == 17
+```
+
+When winml-cli's optimization and quantization pipelines transform a model, they preserve the declared opset unless explicitly instructed otherwise, so the model you receive after `winml quantize` will carry the same opset version as the model you supplied.
+
+## See also
+
+- [EP and Device](eps-and-devices.md)
+- [Weight and Activation](weight-and-activation.md)
+- [Datatype and Quantization](quantization.md)
+- [winml inspect command](../commands/inspect.md)
+- [winml export command](../commands/export.md)
diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
new file mode 100644
index 000000000..91739833d
--- /dev/null
+++ b/docs/concepts/how-it-works.md
@@ -0,0 +1,121 @@
+# How winml-cli Works
+
+winml-cli is a toolkit for converting PyTorch and Hugging Face models into ONNX artifacts
+that are optimized and compiled for Windows ML execution providers (EPs). Starting from a
+model identifier or a pre-exported ONNX file, winml-cli runs a staged pipeline — export,
+optimize, quantize, compile — and produces a final `model.onnx` ready for inference via
+a Windows ML session.
+
+Each stage is independently controllable. Quantization and compilation are optional and
+can be bypassed with a flag or by leaving the corresponding section of the build
+configuration empty. The same pipeline API that powers `winml build` is also the
+programmatic entry point for `WinMLAutoModel.from_pretrained()`.
+
+## The Pipeline at a Glance
+
+```mermaid
+flowchart TD
+    A[PyTorch / HF model] --> B[winml export]
+    B --> C[winml optimize]
+    C --> D[winml quantize]
+    D --> E[winml compile]
+    E --> F[EP-ready ONNX]
+    F --> G[winml perf / eval]
+```
+
+The stages run in order, and each one writes an intermediate ONNX file to the output
+directory. All intermediate artifacts are preserved so you can inspect any stage's output
+or feed a pre-processed file into a later stage directly.
+
+## Pipeline Stages
+
+### Export — `winml export`
+
+`winml export` loads a Hugging Face model (pretrained or random-weight), traces it with
+torch.export or an Optimum-based exporter, and writes a portable, device-agnostic ONNX
+file. The output at this stage is a plain ONNX graph with float32 weights and no
+EP-specific nodes.
+
+### Optimize — `winml optimize`
+
+`winml optimize` runs graph-level transformations on the exported ONNX: operator fusion
+(attention, layer norm, GeLU), constant folding, and graph pruning. The optimize stage
+also contains an autoconf loop: a static analyzer inspects the graph for nodes that the
+target EP cannot dispatch natively, and re-runs optimization with adjusted fusion flags
+until no further improvements are found (up to a configurable iteration limit).
+
+### Quantize — `winml quantize`
+
+`winml quantize` inserts Quantize-Dequantize (QDQ) nodes into the optimized graph to
+reduce weights and activations to lower-precision types (for example, int8 weights with
+uint8 activations). Calibration data is used to compute quantization parameters per
+tensor. If the input model already contains QDQ nodes, this stage is skipped
+automatically.
+
+### Compile — `winml compile`
+
+`winml compile` invokes an EP-specific compiler (for example, the QNN compiler for NPU
+targets) to embed a pre-compiled binary cache inside the ONNX graph as an EPContext node.
+At inference time, the EP loads the cached binary directly, bypassing per-session
+compilation. Compilation is optional; omitting it produces a portable ONNX that is
+compiled on first load by the runtime.
+
+### Perf and Eval — `winml perf` / `winml eval`
+
+After the model is built, `winml perf` benchmarks inference latency and throughput using
+a Windows ML session, and `winml eval` runs task-specific accuracy evaluation. Neither
+command modifies the model; they consume the final `model.onnx` produced by the pipeline.
+
+## `winml build` as the One-Shot Wrapper
+
+Running each stage individually is useful when iterating on a specific step, but the
+normal workflow is `winml build`, which orchestrates the full pipeline in a single
+command:
+
+```bash
+winml build -c config.json -m microsoft/resnet-50 -o output/
+```
+
+`winml build` auto-detects whether the input is a Hugging Face model ID or an existing
+ONNX file and calls the appropriate internal API (`build_hf_model` or `build_onnx_model`).
+When given an ONNX file directly, the export stage is skipped and the pipeline starts at
+optimize.
+
+Individual stages can be bypassed from the command line without editing the config file:
+
+```bash
+# Skip quantization and compilation
+winml build -c config.json -m bert-base-uncased -o output/ --no-quant --no-compile
+
+# Skip optimization (for pre-quantized input)
+winml build -c config.json -m model_qdq.onnx -o output/ --no-optimize
+```
+
+## Configuration: `WinMLBuildConfig` vs CLI Flags
+
+Pipeline behavior is primarily governed by a `WinMLBuildConfig` JSON file generated by
+`winml config`. The config is a hierarchical structure with one section per stage:
+
+```text
+WinMLBuildConfig
+├── loader    — model type, task, input constraints
+├── export    — input tensor specs, opset, backend
+├── optim     — fusion flags, optimization level
+├── quant     — precision, calibration settings (null = skip stage)
+└── compile   — target EP, device (null = skip stage)
+```
+
+Setting `quant` or `compile` to `null` in the JSON file is equivalent to passing
+`--no-quant` or `--no-compile` on the command line; both result in the corresponding
+stage being skipped. CLI flags override the config at runtime without modifying the file,
+which is convenient for one-off experiments.
+
+The config file is written (or updated) to the output directory after the optimize stage
+completes, capturing any autoconf-adjusted fusion flags so the build is reproducible.
+
+## See Also
+
+- [winml build](../commands/build.md) — full reference for the build command
+- [winml export](../commands/export.md) — export command reference
+- [ONNX and Execution Providers](eps-and-devices.md) — background on EPs and the ONNX runtime
+- [Config and build](config-and-build.md) — detailed field-by-field config documentation
diff --git a/docs/concepts/load-and-export.md b/docs/concepts/load-and-export.md
new file mode 100644
index 000000000..cd28c3a79
--- /dev/null
+++ b/docs/concepts/load-and-export.md
@@ -0,0 +1,41 @@
+# Load and export
+
+The first stage of the winml-cli pipeline is the most deterministic: bring a model into memory and convert it to ONNX. Everything that follows — optimization, quantization, compilation — operates on that ONNX artifact. A well-exported graph with accurate metadata travels cleanly through the rest of the pipeline without requiring patching or re-export.
+
+Loading is an internal operation: the loader module resolves model provenance, selects the right HuggingFace model class, and prepares the weights for tracing. The `winml export` command is the surface users interact with directly.
+
+## Loading a model
+
+When you point winml-cli at a model identifier, the internal loader resolves it in one of two ways. If the identifier looks like a HuggingFace Hub path (e.g., `prajjwal1/bert-tiny`), the loader downloads the model weights and configuration to the standard HuggingFace cache at `~/.cache/huggingface`. Subsequent runs are served from that cache without re-downloading. If the identifier is a path to a local PyTorch checkpoint directory, the loader reads it directly without network access.
+
+In both cases the loader auto-detects the task — image classification, text feature extraction, and so on — and selects a corresponding HuggingFace model class. The result is a PyTorch model object ready for tracing.
+
+Before committing to a full export you can verify that the loader resolved everything correctly with `winml inspect`. It prints the detected task, the HuggingFace model class, the export configuration, and the WinML inference class — all without downloading weights. Add `--hierarchy` to reconstruct the PyTorch module tree from random-weight tracing.
+
+Some community models host custom Python code in their repositories. The loader refuses to execute it by default. Pass `--trust-remote-code` to `winml config` when generating a build configuration for such a model.
+
+## Exporting to ONNX
+
+`winml export` converts the loaded model to ONNX. The conversion uses TorchScript tracing by default, which follows actual execution paths and tends to produce compact, inference-oriented graphs. A `--dynamo` flag exists for the PyTorch 2.x dynamo exporter; however, **Note:** the `--dynamo` flag is reserved for the PyTorch 2.x dynamo exporter but is **not yet functional** in the current release — passing it logs a warning and the flag is ignored.
+
+By default the exporter runs an eight-step process that includes hierarchy tracing and tag injection. Every ONNX node carries a `winml.hierarchy.tag` metadata entry recording the PyTorch module path it came from (e.g. `/BertModel/BertEncoder/BertLayer.3/BertAttention`), plus a companion `winml.hierarchy.depth` integer. The model itself also carries `winml.io.inputs` and `winml.io.outputs` JSON metadata describing the I/O tensor specs. Together these power per-module benchmarking with `winml perf --module`, inspector views with `winml inspect --hierarchy`, and optimizer scoping.
+
+If you need a clean, standard-compliant ONNX without custom metadata — to hand off to a third-party tool, for example — pass `--no-hierarchy` (alias `--clean-onnx`). The graph behaviour is unchanged, but hierarchy-dependent features will not work against that file.
+
+Use `--with-report` to generate companion markdown and JSON reports alongside the output.
+
+## Where it goes wrong
+
+Most export failures fall into three categories.
+
+**Task mismatch.** The loader auto-detects task from the model card and configuration, but some models are registered under multiple tasks or have ambiguous metadata. If the wrong task is selected the exporter generates incorrect dummy inputs and the trace fails or produces wrong output shapes. Override it explicitly with `--task`, for example `--task image-feature-extraction`.
+
+**Shape issues.** Transformer models often have symbolic sequence-length dimensions; vision models may expect a fixed spatial resolution. If the default dummy inputs do not match what the model accepts, shape inference will fail or produce dynamic shapes that downstream tools cannot handle. Provide a `--shape-config` JSON file with explicit overrides, or use `--input-specs` to supply a fully specified input manifest.
+
+**Custom modules.** Some models contain `torch.nn.Module` subclasses the tracer cannot automatically decompose. A `--torch-module` option (comma-separated class names) is intended to include them as distinct hierarchy nodes rather than inlining them — most often needed for custom normalization or attention implementations defined in the model repository. **Note:** the `--torch-module` flag is reserved for module-targeted export but is **not yet functional** in the current release — passing it logs a warning and the flag is ignored.
+
+## See also
+
+- [Graph and IR](graphs-and-ir.md)
+- [inspect command](../commands/inspect.md)
+- [export command](../commands/export.md)
diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
new file mode 100644
index 000000000..3cb6d8110
--- /dev/null
+++ b/docs/concepts/perf-and-monitoring.md
@@ -0,0 +1,45 @@
+# Perf and monitoring
+
+Knowing that a model produces correct outputs is necessary but not sufficient for a production deployment. You also need to know how fast it runs, how consistently it runs, and where the time goes when it does not run fast enough. `winml perf` is the primary tool in `winml-cli` for answering those questions. It synthesises end-to-end latency numbers, per-operator timings, and live hardware utilisation into a single benchmarking workflow.
+
+Because `winml perf` accepts both HuggingFace model IDs and local `.onnx` files, you can benchmark at any stage of the development cycle — from a freshly exported float model through to a compiled, quantized production artifact.
+
+## What perf measures
+
+At its core, `winml perf` runs a configurable number of inference iterations and reports latency statistics: p50, p90, and mean latency in milliseconds, plus throughput in inferences per second. Warmup iterations (controlled by `--warmup`, defaulting to 10) are excluded from the statistics so that JIT and cache effects do not skew the numbers.
+
+You can control the run length with `--iterations` and the input shape with `--batch-size` or a `--shape-config` JSON file for models with dynamic axes. The `--device` flag selects the target EP — `cpu`, `gpu`, or `npu` — allowing you to collect numbers on each target with the same command and compare them directly. For fine-grained EP control, `--ep` lets you name a specific provider such as `qnn` or `dml`.
+
+The results are written to a JSON file (defaulting to `{model_slug}_perf.json`) so they can be archived and compared across builds.
+
+## Live monitoring
+
+Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
+
+The `--monitor` flag adds a live terminal chart that streams NPU utilisation while the benchmark runs. The chart updates in place during the iteration loop so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, the model is not executing on the NPU as expected.
+
+`--monitor` has no effect on the measured latency statistics — it is a passive observer.
+
+## Per-operator tracing
+
+When end-to-end latency is higher than expected, `--op-tracing` lets you find the operators that are responsible. Two levels are available:
+
+`--op-tracing basic` collects cumulative time per operator type and reports a ranked list. This is usually enough to identify whether, say, a sequence of Attention nodes or a large MatMul is dominating the runtime.
+
+`--op-tracing detail` goes further, collecting timing for every individual operator node in the graph. This is useful when the same operator type appears in different parts of the model with very different costs — for instance, early-layer convolutions versus late-layer convolutions in a ResNet-style architecture.
+
+Both levels require an `onnxruntime-qnn` build with profiling support. If the requirement is not met, `winml-cli` will tell you at startup rather than silently running without tracing.
+
+## Per-module benchmarking
+
+Large Transformer-family models contain many repeated module instances — attention blocks, feed-forward layers, encoder stages. When you want to understand the cost of one type of block rather than the full network, `--module <substring>` isolates and benchmarks matching modules from the HuggingFace model hierarchy.
+
+`winml perf -m bert-base-uncased --module BertAttention`, for example, builds and benchmarks each `BertAttention` instance separately and reports per-instance statistics. This is faster to iterate on than benchmarking the full model when you are tuning a specific layer, and it makes the attribution of latency to architectural decisions much clearer.
+
+The module hierarchy that `--module` navigates is built at export time: every ONNX node carries a `winml.hierarchy.tag` metadata entry recording the PyTorch module path it came from. `winml perf --module` matches against those tags, builds a separate ONNX for each match, and benchmarks them in isolation. See `winml inspect --hierarchy` to view the tree for an exported model, or [Load and export](load-and-export.md) for how the metadata is written.
+
+## See also
+
+- [Load and export](load-and-export.md) — how the module-tree metadata that `--module` targets gets written
+- [Eval and datasets](eval-and-datasets.md) — accuracy measurement to pair with performance numbers
+- [perf command reference](../commands/perf.md)
diff --git a/docs/concepts/primitives-and-pipeline.md b/docs/concepts/primitives-and-pipeline.md
new file mode 100644
index 000000000..a14ffbaad
--- /dev/null
+++ b/docs/concepts/primitives-and-pipeline.md
@@ -0,0 +1,105 @@
+# Primitives and pipeline
+
+winml-cli exposes two ways to turn a Hugging Face model or ONNX file into a
+Windows ML-ready artifact. You can invoke each stage of the pipeline as an
+individual primitive command — `winml export`, `winml optimize`, `winml quantize`,
+`winml compile`, `winml perf`, `winml eval` — running one step at a time with
+full control over inputs and outputs. Alternatively, `winml build` wraps all of
+those stages into a single command driven by a `WinMLBuildConfig` JSON file.
+
+Understanding when to reach for a primitive versus the pipeline wrapper is the
+central workflow decision in winml-cli. Both paths produce the same artifacts;
+the difference is in repeatability, convenience, and how much you need to inspect
+or vary individual stages.
+
+## The primitive commands
+
+Each primitive command corresponds to one stage of the pipeline described in
+[How winml-cli works](how-it-works.md). They run in order, each producing an ONNX
+file that the next stage consumes:
+
+- **`winml export`** — loads a Hugging Face model, traces it with PyTorch and the
+  Optimum exporter, and writes a portable float32 ONNX file with no EP-specific
+  nodes.
+- **`winml optimize`** — applies graph transformations (operator fusion, constant
+  folding, graph pruning) and runs an autoconf loop to maximize EP-compatible
+  coverage.
+- **`winml quantize`** — inserts QDQ nodes using calibration data, reducing weight
+  and activation types to lower precision (for example, int8) for efficient
+  inference.
+- **`winml compile`** — invokes an EP-specific compiler (for example, QNN for NPU
+  targets) to embed a pre-compiled binary cache in the ONNX graph as an EPContext
+  node.
+- **`winml perf`** — benchmarks latency and throughput against a Windows ML
+  session; does not modify the model.
+- **`winml eval`** — evaluates task-specific accuracy on a dataset; does not
+  modify the model.
+
+You can enter the pipeline at any stage. If you already have an optimized ONNX
+file, pass it directly to `winml quantize` without re-exporting. Each command
+writes its output to a path you specify, so all intermediate artifacts are
+preserved for inspection.
+
+## The pipeline wrapper
+
+`winml build` orchestrates all of the above stages in order from a single
+`WinMLBuildConfig` JSON file:
+
+```bash
+winml build -c config.json -m microsoft/resnet-50 -o output/
+```
+
+The config file tells `winml build` which stages to run and how to configure them.
+Setting the `quant` or `compile` section to `null` in the JSON skips that stage;
+passing `--no-quant`, `--no-compile`, or `--no-optimize` on the command line
+achieves the same effect at runtime without editing the file.
+
+When the model argument points to an existing ONNX file instead of a Hugging Face
+ID, `winml build` detects this and skips the export stage, running
+optimize → quantize → compile directly. This mirrors how each primitive command
+handles the same case.
+
+`winml build` also accepts `--use-cache` in place of `-o`/`--output-dir`, routing
+artifacts to the winml-cli global cache at `~/.cache/winml/` instead of a local
+directory. Use `--rebuild` to force a clean re-run even when cached artifacts
+already exist.
+
+## When to choose which
+
+**Use primitive commands when:**
+
+- You are learning the pipeline and want to observe each stage's output in
+  isolation.
+- You are debugging a specific stage — for example, inspecting the optimized graph
+  before quantization, or testing a quantized model before compiling it.
+- You need a one-off variation that does not warrant a versioned config, such as
+  trying a different opset or a different calibration sample count.
+- You are integrating winml-cli output into a larger script that already manages
+  intermediate files.
+
+**Use `winml build` when:**
+
+- You are targeting production or CI: a single config file captures the full
+  pipeline reproducibly and can be committed alongside the code that uses the
+  model.
+- You want to share the exact build recipe with a teammate or reproduce it later
+  without reconstructing the sequence of primitive flags.
+- You need the autoconf loop to propagate optimization decisions across stages,
+  which only `winml build` coordinates end-to-end.
+- You want stage-skipping to be declarative (`quant: null` in the config) rather
+  than remembered flag-by-flag across invocations.
+
+The two approaches are not exclusive. A common pattern is to prototype with
+primitives — iterating on `winml optimize` and `winml quantize` individually to
+tune fusion flags and calibration — and then encode the final settings into a
+`WinMLBuildConfig` for repeatable production builds via `winml build`.
+
+## See also
+
+- [How winml-cli works](how-it-works.md) — pipeline stage order and internal
+  architecture
+- [Config and build](config-and-build.md) — generating and versioning a
+  `WinMLBuildConfig`
+- [winml build command reference](../commands/build.md)
+- [ConvNeXT primitives sample](../samples/convnext-primitives.md) — worked example
+  using primitive commands end-to-end
diff --git a/docs/concepts/quantization.md b/docs/concepts/quantization.md
new file mode 100644
index 000000000..de92e7702
--- /dev/null
+++ b/docs/concepts/quantization.md
@@ -0,0 +1,65 @@
+# Datatype and Quantization
+
+Every ONNX tensor carries data in a specific numeric type — `float32`, `float16`, `int8`, `int16` — and every winml-cli pipeline makes deliberate choices about which type to use where. This page covers both halves of that decision: the **datatype family** winml-cli understands, and the **quantization** workflow that converts a model from one datatype to another to shrink it and run it faster on integer-native hardware.
+
+Quantization is the headline use of datatypes in winml-cli. By replacing `float32` weights and activations with `int8` or mixed precisions, you typically get a 2–4× smaller model artifact and a 2–8× latency speedup on NPU hardware. The trade-off is a potential reduction in model accuracy, the degree of which depends on the precision chosen and the sensitivity of the model.
+
+## Datatypes
+
+winml-cli exposes a precision shorthand on the `--precision` flag that encodes the weight/activation dtype pair as a single string. The table below lists every precision from `_KNOWN_PRECISIONS` in `_options.py`, together with the resolved quantization types from `config/precision.py`. Float precisions (`fp32`, `fp16`) carry no quantization types because weights and activations remain in floating point throughout.
+
+| Precision | Weight dtype | Activation dtype | Notes |
+|-----------|-------------|-----------------|-------|
+| `auto` | device-dependent | device-dependent | Resolves to `int8` (NPU), `fp16` (GPU/CPU) at runtime |
+| `fp32` | float32 | float32 | No quantization; baseline accuracy |
+| `fp16` | float16 | float16 | Half-precision float; no QDQ nodes inserted |
+| `int8` | uint8 | uint8 | Static quantization; default for NPU via QNN EP |
+| `int16` | int16 | uint16 | Higher-accuracy quantization; larger model than int8 |
+| `w8a8` | uint8 | uint8 | Equivalent to `int8`; explicit mixed-precision notation |
+| `w8a16` | uint8 | uint16 | Mixed: compact weights, wider activations for accuracy |
+| `w4a16` | n/a | n/a | **Planned — not yet supported.** Recognized as a precision string but raises an error at quantization time; no 4-bit weight dtype mapping exists in `precision.py` yet. |
+
+The `--weight-type` and `--activation-type` flags on `winml quantize` accept `uint8`, `int8`, `uint16`, or `int16` and override whatever the `--precision` shorthand would have resolved. This is useful when you need an unsigned weight type for QNN compatibility but a signed activation type for a specific operator constraint. See [Weight and Activation](weight-and-activation.md) for why the two need separate flags in the first place.
+
+## How quantization works in winml-cli
+
+winml-cli applies quantization by inserting **QDQ** (Quantize/Dequantize) nodes into the ONNX graph. The resulting file is a standard ONNX model that any ONNX Runtime execution provider can consume and optimize for its target hardware — the EP reads the QDQ pattern and fuses adjacent operations into true integer kernels.
+
+### Calibration
+
+Static quantization — the kind winml-cli applies — requires a calibration pass before inserting QDQ nodes. During calibration, a small set of representative inputs runs through the original floating-point model so that winml-cli can observe the actual range of values each tensor takes at runtime. Those observed ranges are then used to choose the scale and zero-point constants baked into the QDQ nodes.
+
+The `--samples` flag controls how many calibration inputs are used (default: `10`). More samples generally produce better range estimates but take longer. The `--method` flag selects the algorithm used to summarize the observed ranges:
+
+- `minmax` (default) — uses the absolute minimum and maximum observed values. Fast and predictable; can be sensitive to outliers.
+- `entropy` — minimizes the KL-divergence between the original and quantized distribution. Often yields better accuracy on models with heavy-tailed activation distributions.
+- `percentile` — clips a small fraction of extreme values before computing the range. A practical middle ground when outliers are present but entropy calibration is slow.
+
+Example using entropy calibration with more samples:
+
+```bash
+winml quantize -m model.onnx --precision int8 --samples 128 --method entropy
+```
+
+### The QDQ pattern
+
+The QDQ pattern is the standard ONNX representation for static quantization. winml-cli wraps the inputs and outputs of quantizable operators with pairs of `QuantizeLinear` and `DequantizeLinear` nodes. At the graph level the model still operates in floating-point; the QDQ nodes encode the scale and zero-point metadata that a runtime needs to fuse adjacent operations into true integer kernels.
+
+When the model runs under ONNX Runtime, the execution provider — whether CPU, DirectML, or a dedicated NPU EP — reads those QDQ patterns and performs its own graph fusion. This means the EP is free to apply hardware-specific optimizations without winml-cli needing to know anything about the target device's internal ISA or operator library. The QDQ model produced by `winml quantize` is a single portable artifact that can be deployed to any EP that supports integer execution.
+
+## When quantization is lossy
+
+Not all precision choices carry equal accuracy risk:
+
+- `fp16` is usually lossless in practice. Rounding errors relative to `fp32` are small enough that most models show no measurable accuracy difference.
+- `int8` and `int16` are inherently lossy. Compressing a 32-bit float into 8 or 16 bits discards information, and the magnitude of accuracy degradation depends on how well the calibration data represents the deployment distribution.
+- Compound precisions like `w8a16` reduce the risk compared to full `int8` by preserving more precision in activations, but they are still lossy relative to `fp32`.
+
+Always validate accuracy after quantizing an integer-precision model. Run `winml eval` on a representative dataset and compare the metrics against the original floating-point baseline before shipping the quantized artifact.
+
+## See also
+
+- [Weight and Activation](weight-and-activation.md)
+- [EP and Device](eps-and-devices.md)
+- [quantize command reference](../commands/quantize.md)
+- [eval command reference](../commands/eval.md)
diff --git a/docs/concepts/weight-and-activation.md b/docs/concepts/weight-and-activation.md
new file mode 100644
index 000000000..5a3139912
--- /dev/null
+++ b/docs/concepts/weight-and-activation.md
@@ -0,0 +1,32 @@
+# Weight and Activation
+
+Every neural network model stores two kinds of numeric tensors that matter for deployment: **weights**, the static parameters baked in at training time, and **activations**, the intermediate values that flow through the graph at every inference call. Understanding the distinction is the key to reading winml-cli's precision flags, deciding when quantization is safe, and knowing why a model that runs fine on one execution provider may stall or degrade on another.
+
+## Weights are static
+
+Weights are the trained parameters of the model: convolution kernels, linear projection matrices, attention weights, embedding tables, bias vectors. They are fixed at the moment the model is exported and stay constant for every inference call. Because they are static, their quantization parameters — the scale and zero-point used to compress them from fp32 to int8 — can be computed once, offline, using calibration data. `winml quantize` does exactly that: it observes the weight distributions in your exported ONNX and bakes the per-tensor scale/zero-point into the QDQ nodes that wrap the weights.
+
+In ONNX terms, weights are stored as **initializers** inside the graph. The runtime treats them as graph inputs that are always pre-supplied; you do not pass weights to a session at inference time, the way you pass an image tensor or a text prompt.
+
+## Activations are dynamic
+
+Activations are the intermediate results that flow through the graph during inference: the output of every matrix multiply, every layer norm, every attention softmax. Unlike weights, activations are regenerated on every forward pass and depend entirely on the input data. winml-cli cannot pre-compute their quantization parameters offline — instead, calibration runs a small set of representative inputs through the model and observes the actual ranges each activation tensor takes. Those observed ranges become the scale/zero-point baked into QDQ nodes around each activation.
+
+This is why calibration data matters. If the calibration set fails to represent the inputs you will see in production, the per-activation ranges will be wrong and the quantized model will lose more accuracy than necessary on real traffic.
+
+## Why they need separate flags
+
+The `--weight-type` and `--activation-type` flags on `winml quantize` exist because the optimal bit-width for weights is not necessarily the optimal bit-width for activations:
+
+- **Wider activation types** (int16 vs int8) reduce accuracy loss at the cost of more memory bandwidth. Useful when activations have heavy-tailed distributions that quantize poorly at 8 bits.
+- **Narrower weight types** compress the static footprint more aggressively. Useful when the model is memory-bound and accuracy headroom exists.
+- **Execution providers diverge** along this boundary too. QNN on NPU pairs uint8 weights with uint8 or uint16 activations. DirectML on GPU can run float16 throughout. The CPU EP accepts almost any combination.
+
+The compound precision shorthand `w8a16` (8-bit weights, 16-bit activations) reflects this asymmetry directly: weights and activations get different bit-widths in one config string. For the full precision family and how each maps to weight/activation dtypes, see [Datatype and Quantization](quantization.md).
+
+## See also
+
+- [Datatype and Quantization](quantization.md)
+- [EP and Device](eps-and-devices.md)
+- [quantize command](../commands/quantize.md)
+- [Graph and IR](graphs-and-ir.md)
diff --git a/docs/contributing.md b/docs/contributing.md
new file mode 100644
index 000000000..dba7a8254
--- /dev/null
+++ b/docs/contributing.md
@@ -0,0 +1,4 @@
+# Contributing
+
+!!! note "Coming soon"
+    This page is part of the documentation MVP and will be authored shortly.
diff --git a/docs/getting-started/end-to-end.md b/docs/getting-started/end-to-end.md
new file mode 100644
index 000000000..9c50f1f26
--- /dev/null
+++ b/docs/getting-started/end-to-end.md
@@ -0,0 +1,209 @@
+# End-to-End Tour
+
+This page walks the full winml-cli pipeline using `--device auto`. The CLI
+resolves to the best available device on your machine — NPU first, then GPU,
+then CPU — so the four commands below are identical regardless of whether you
+have a Copilot+ PC with a Qualcomm NPU, a DirectML-capable GPU, or a plain
+laptop with no accelerator at all. You do not need to think about device flags
+after Step 0.
+
+The vehicle for this tour is `facebook/convnext-tiny-224`, a compact image
+classifier whose operator mix exercises every stage of the pipeline: export,
+optimize, quantize, and compile. Estimated time is 15–25 minutes, most of
+which is the Hugging Face model download and the compile stage. At the end you
+will have a compiled ONNX artifact targeted at your hardware and a real latency
+reading from that device.
+
+## Prerequisites
+
+- Windows 11 24H2 (required for NPU; earlier versions work for CPU/GPU)
+- winml-cli installed (see [Installation](installation.md))
+
+!!! note "NPU users only"
+    To target the Qualcomm NPU you also need:
+
+    - A Qualcomm Snapdragon X device
+    - QAIRT SDK installed; `QNN_SDK_ROOT` env var pointing at it
+    - `--extra qnn` installed (Python 3.11+)
+
+    Everything else on this page works without these.
+
+## Step 0: See what your machine has
+
+```bash
+uv run winml sys --list-device --list-ep
+```
+
+This lists every hardware device detected and the execution providers (EPs)
+that can target each one. When you pass `--device auto` in the steps below,
+winml-cli resolves that to the highest-priority device shown here: NPU first,
+then GPU, then CPU.
+
+=== "Copilot+ PC (NPU available)"
+
+    ```text
+    Available Devices (priority order)
+      #1  NPU   Qualcomm(R) AI Accelerator
+                 Driver: 31.0.0.6978 | Manufacturer: Qualcomm Technologies, Inc.
+      #2  GPU   NVIDIA GeForce RTX 4060 Laptop GPU
+                 Driver: 31.0.15.5107 | Manufacturer: NVIDIA
+      #3  CPU   Snapdragon X Elite - X1E-80-100 - Oryon
+                 Cores: 12 | Threads: 12 | Architecture: ARM64
+
+    Available Execution Providers
+      QNNExecutionProvider              -> NPU
+      DmlExecutionProvider              -> GPU
+      CPUExecutionProvider              -> CPU
+    ```
+
+=== "Regular Windows laptop (no NPU)"
+
+    ```text
+    Available Devices (priority order)
+      #1  GPU   Intel(R) Iris(R) Xe Graphics
+                 Driver: 31.0.101.5382 | Manufacturer: Intel Corporation
+      #2  CPU   12th Gen Intel(R) Core(TM) i7-1260P
+                 Cores: 12 | Threads: 16 | Architecture: x86_64
+
+    Available Execution Providers
+      DmlExecutionProvider              -> GPU
+      CPUExecutionProvider              -> CPU
+    ```
+
+## Step 1: Generate the build config
+
+```bash
+uv run winml config -m facebook/convnext-tiny-224 --device auto -o convnext_config.json
+```
+
+`winml config` queries Hugging Face, auto-detects the task and model type, and
+produces a `WinMLBuildConfig` JSON. Passing `--device auto` tells the config
+generator to resolve the target device at generation time — it inspects your
+hardware and writes the winning device (NPU, GPU, or CPU) together with
+matching precision and compile settings into `convnext_config.json`. You can
+open the file to see exactly what was picked before committing to a full build.
+
+For a field-by-field explanation of every section in the generated JSON and how
+the `quant` and `compile` blocks interact, see
+[Config and build](../concepts/config-and-build.md).
+
+## Step 2: Run the build
+
+```bash
+uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/
+```
+
+This single command runs all four pipeline stages in sequence — export,
+optimize, quantize, and compile — reading the device and precision settings
+recorded in `convnext_config.json`. The compile stage targets whichever device
+the config captured: it calls the QNN backend and embeds a pre-compiled Hexagon
+binary on NPU, or it compiles a DirectML graph on GPU, or it produces a
+standard optimized ONNX for CPU. All intermediate artifacts land in
+`convnext_out/`, so you can inspect or reuse any stage independently.
+
+You can also pass `--no-quant` or `--no-compile` to stop the pipeline early,
+or `--rebuild` to force re-running even when cached artifacts exist. For a
+deeper look at how each stage works, see
+[Concepts → How winml-cli works](../concepts/how-it-works.md) and
+[Config and Build](../concepts/config-and-build.md).
+
+!!! warning "NPU users"
+    `winml build` reads `QNN_SDK_ROOT` from the environment. Make sure it
+    points at your QAIRT SDK before this step, or the compile stage will fail
+    with *"QAIRT SDK path not found"*.
+
+## Step 3: Benchmark on your device
+
+```bash
+uv run winml perf -m convnext_out/<artifact>.onnx --device auto --iterations 50 --monitor
+```
+
+Replace `<artifact>` with the filename written to `convnext_out/` by the build.
+The name reflects the device the build targeted — for example,
+`convnext_tiny_qnn_ctx.onnx` on NPU, `convnext_tiny_dml_ctx.onnx` on
+DirectML, or `convnext_tiny.onnx` on CPU. You can check the directory listing
+or read the compiled artifact path from the build output to get the exact name.
+
+=== "NPU (QNN)"
+
+    ```text
+    Device:      npu
+    Precision:   auto
+    Task:        image-classification
+    Iterations:  50 (+ 10 warmup)
+    Batch Size:  1
+
+    Latency (ms)
+      Avg    P50    P90    P95    P99    Min    Max    Std
+     3.87   3.82   4.21   4.38   4.71   3.51   5.04   0.21
+
+    Throughput: 258.14 samples/sec
+
+    Results saved to: convnext_tiny_qnn_ctx_perf.json
+    ```
+
+=== "GPU (DirectML)"
+
+    ```text
+    Device:      gpu
+    Precision:   auto
+    Task:        image-classification
+    Iterations:  50 (+ 10 warmup)
+    Batch Size:  1
+
+    Latency (ms)
+      Avg    P50    P90    P95    P99    Min    Max    Std
+    12.43  12.18  13.74  14.11  15.02  11.27  16.55   0.89
+
+    Throughput: 80.45 samples/sec
+
+    Results saved to: convnext_tiny_dml_ctx_perf.json
+    ```
+
+=== "CPU"
+
+    ```text
+    Device:      cpu
+    Precision:   auto
+    Task:        image-classification
+    Iterations:  50 (+ 10 warmup)
+    Batch Size:  1
+
+    Latency (ms)
+      Avg    P50    P90    P95    P99    Min    Max    Std
+    48.31  47.85  52.14  53.77  57.40  44.62  61.23   2.94
+
+    Throughput: 20.70 samples/sec
+
+    Results saved to: convnext_tiny_perf.json
+    ```
+
+The `--monitor` flag opens a live chart of device utilization while the
+benchmark runs — most meaningful on NPU or GPU where it confirms the workload
+actually hit the accelerator rather than falling back to CPU. After the run
+finishes, a JSON file named `{model_slug}_perf.json` is written to the current
+directory; you can load it programmatically to compare results across runs or
+across machines.
+
+## Cross-device comparison
+
+Each artifact produced by `winml build` is compiled for the specific device
+recorded in the config — a QNN EPContext binary will not execute on DirectML,
+and vice versa. If you want to measure NPU vs. GPU vs. CPU latency on the same
+model and the same machine you need to generate a separate config and artifact
+for each EP. The
+[ConvNeXt — Primitives Walkthrough](../samples/convnext-primitives.md) sample
+does exactly that: it builds a separate compiled artifact for each execution
+provider and benchmarks them side by side so you can compare the numbers
+directly.
+
+## Where to go next
+
+- [ConvNeXt on NPU tutorial](../tutorials/npu-convnext.md) — full primitives
+  walkthrough plus the `winml build` one-shot wrapper, going deeper than this
+  page on NPU-specific tuning
+- [ConvNeXt — Primitives Walkthrough sample](../samples/convnext-primitives.md)
+  — CPU/GPU/NPU comparison on the same model built with explicit per-device
+  configs
+- [Concepts → How winml-cli works](../concepts/how-it-works.md) — what each
+  stage of the build pipeline does and how they chain together
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
new file mode 100644
index 000000000..07dd8aa9c
--- /dev/null
+++ b/docs/getting-started/installation.md
@@ -0,0 +1,88 @@
+# Installation
+
+winml-cli is a Python toolkit for converting and optimizing PyTorch models to ONNX format, targeting deployment on the [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/) runtime. It supports multiple hardware backends including QNN (Qualcomm NPU), OpenVINO (Intel CPU/GPU), DirectML, and ONNX Runtime. To get started you need a Windows machine, Python 3.10, and the `uv` package manager.
+
+## Prerequisites
+
+| Component | Details |
+|---|---|
+| Windows | Windows 11 24H2 or later (required for NPU support) |
+| Hardware | Copilot+PC with NPU (40+ TOPS recommended for NPU acceleration; CPU/DirectML works without an NPU) |
+| Python | 3.10 (the project pins `requires-python = ">=3.10,<3.11"`) |
+| Package manager | [`uv`](https://github.com/astral-sh/uv) |
+| Version control | `git` |
+
+!!! note "No NPU?"
+    You can follow most of these docs without NPU hardware. Most winml-cli commands (`build`, `compile`, `perf`, `analyze`) accept `--device auto` and fall back to CPU or DirectML automatically. `winml eval` accepts only `cpu|gpu|npu` (no `auto`), so pass `--device cpu` explicitly there. The end-to-end tutorial documents an explicit CPU fallback path.
+
+## Install
+
+```bash
+git clone https://github.com/microsoft/winml-cli.git
+cd winml-cli
+uv python install 3.10
+uv sync
+```
+
+Cloning the repository pulls down all source code and configuration. `uv python install 3.10` downloads and pins the exact Python version the project requires. `uv sync` creates an isolated virtual environment and installs all declared dependencies from `pyproject.toml` in a single step. No separate `pip install` or manual venv activation is needed.
+
+## Verify
+
+```bash
+uv run winml sys
+```
+
+Expected output (abbreviated):
+
+```text
+╭──────────────────────────────────╮
+│   winml-cli System Information    │
+╰──────────────────────────────────╯
+
+Environment
+  Python Version    3.10.x
+  OS                Windows 11
+  Machine           AMD64
+
+ML Libraries
+  Library        Version   Status
+  torch          2.x.x     OK
+  onnx           1.x.x     OK
+
+Available Devices (priority order)
+  #1  NPU   ...
+  #2  GPU   ...
+  #3  CPU   ...
+
+Available Execution Providers
+  QNNExecutionProvider           -> NPU
+  DmlExecutionProvider           -> GPU
+  CPUExecutionProvider           -> CPU
+```
+
+This command enumerates available compute devices and execution providers on your machine. If an expected device or SDK is missing, `winml sys` is the right place to diagnose it. See [winml sys](../commands/sys.md) for the full flag reference and troubleshooting tips.
+
+## Optional extras
+
+Two optional dependency groups are available for hardware-specific backends:
+
+- `--extra openvino` — installs [OpenVINO](https://docs.openvino.ai/) for inference on Intel CPU and GPU targets.
+- `--extra qnn` — installs `onnxruntime-qnn` for Qualcomm NPU support. Note: the `onnxruntime-qnn` package requires Python 3.11 or later, so this extra will not install any packages under the project's default Python 3.10 environment. It is reserved for future use when the project broadens its Python version support.
+
+To install an extra:
+
+```bash
+uv sync --extra openvino
+```
+
+Both extras can be combined:
+
+```bash
+uv sync --extra openvino --extra qnn
+```
+
+## Next steps
+
+- **[Quickstart](quickstart.md)** — export your first model in 5 minutes.
+- **[End-to-End Tour](end-to-end.md)** — full pipeline targeting whatever hardware you have (NPU / GPU / CPU).
+- **[How winml-cli Works](../concepts/how-it-works.md)** — the mental model.
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
new file mode 100644
index 000000000..398863492
--- /dev/null
+++ b/docs/getting-started/quickstart.md
@@ -0,0 +1,71 @@
+# Quickstart
+
+This page proves your winml-cli install works end-to-end. You will export a
+Hugging Face image classifier to ONNX and then inspect the resulting artifact.
+No quantization, no execution-provider selection — just the two commands you
+need to confirm everything is wired up correctly. Estimated time: 5 minutes.
+
+## Verify the install
+
+Run the following command to enumerate available devices and execution providers
+on your machine:
+
+```bash
+uv run winml sys --list-device --list-ep
+```
+
+`--list-device` and `--list-ep` print only the hardware and EP inventory,
+skipping SDK versions and Python environment details that plain `winml sys`
+would include. If the command exits without error, your winml-cli install is
+ready. See [`winml sys`](../commands/sys.md) for the full flag reference.
+
+## Export your first model
+
+```bash
+uv run winml export -m microsoft/resnet-50 -o resnet50.onnx
+```
+
+!!! note "What just happened"
+    winml-cli downloaded the `microsoft/resnet-50` weights from Hugging Face,
+    ran the eight-step Hierarchy-preserving Tags Protocol (HTP) to trace the
+    PyTorch module tree, and wrote an ONNX file to `resnet50.onnx`. Each ONNX
+    node carries a `hierarchy_tag` metadata property recording its full PyTorch
+    ancestry, which downstream quantization and compilation steps use to reason
+    about the graph. See [`winml export`](../commands/export.md) for the full
+    flag reference.
+
+## Inspect the artifact
+
+```bash
+uv run winml inspect -m resnet50.onnx
+```
+
+```text
+╭─────────────────────────── microsoft/resnet-50 ───────────────────────────╮
+│ Task          image-classification                                         │
+│ Model Class   ResNetForImageClassification                                 │
+│ Exporter      OptimumExporter                                              │
+│ WinML Class   WinMLImageClassificationModel                                │
+│ Status        Supported                                                    │
+╰────────────────────────────────────────────────────────────────────────────╯
+```
+
+When you pass a local `.onnx` file, `winml inspect` reads the embedded model
+metadata directly. When you pass a Hugging Face model ID instead, it reads
+the model's `config.json` from the Hub without downloading weights. In both
+cases it resolves the loader, exporter, and WinML inference class that
+winml-cli will use for this architecture. See
+[`winml inspect`](../commands/inspect.md) for output-format and hierarchy
+options.
+
+## What's next
+
+- **[End-to-End walkthrough](end-to-end.md)** — full pipeline from Hugging Face to NPU.
+- **[How winml-cli Works](../concepts/how-it-works.md)** — understand what each command does under the hood.
+- **[ConvNeXt primitives sample](../samples/convnext-primitives.md)** — see every pipeline stage in detail with a representative model.
+
+## See also
+
+- [`winml export`](../commands/export.md)
+- [`winml inspect`](../commands/inspect.md)
+- [Load and export](../concepts/load-and-export.md)
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 000000000..885ebc057
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,31 @@
+# winml-cli
+
+winml-cli is a CLI toolkit to build portable, performant, and high-quality models for [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/).
+
+## What you can do
+
+- **Build once, run anywhere.** Compose your own workflow from primitive commands (`export`, `analyze`, `optimize`, `quantize`, `compile`), or hand a config to the built-in pipeline. Same portable ONNX, two complementary paths.
+- **Drill into the details.** Inspect operators, pinpoint compatibility errors, and trace performance bottlenecks at any stage of the pipeline.
+- **AI-ready.** Built-in agent skills work with mainstream coding agents — let the agent drive the pipeline for you.
+
+## What you get out of the box
+
+- **One toolkit, every EP.** All [supported execution providers](concepts/eps-and-devices.md#eps-winml-cli-supports) live behind the same commands.
+- **Repeatable and traceable.** Configs are deterministic; every pipeline run records inputs, outputs, and decisions at each stage.
+- **Quality gates built in.** The analyzer catches operator-compatibility issues before deployment and suggests fixes automatically.
+
+## Where to start
+
+- **[Installation](getting-started/installation.md)** — get the `winml` CLI running locally.
+- **[Quickstart](getting-started/quickstart.md)** — export a Hugging Face model in five minutes.
+- **[End-to-End Tour](getting-started/end-to-end.md)** — full pipeline targeting whatever hardware you have (NPU / GPU / CPU).
+
+## Learn the model
+
+- **[How winml-cli Works](concepts/how-it-works.md)** — the pipeline from a PyTorch model to an EP-compiled artifact.
+- **[Commands](commands/overview.md)** — reference for all 12 `winml` subcommands.
+- **[Samples](samples/convnext-primitives.md)** — end-to-end walkthroughs for ConvNeXt, BERT, and Qwen3.
+
+## License
+
+MIT. See [LICENSE](https://github.com/microsoft/winml-cli/blob/main/LICENSE.txt).
diff --git a/docs/reference/index.md b/docs/reference/index.md
new file mode 100644
index 000000000..dad8173fe
--- /dev/null
+++ b/docs/reference/index.md
@@ -0,0 +1,4 @@
+# Reference
+
+!!! note "Coming soon"
+    This page is part of the documentation MVP and will be authored shortly.
diff --git a/docs/samples/bert-config-build.md b/docs/samples/bert-config-build.md
new file mode 100644
index 000000000..62e57d7b8
--- /dev/null
+++ b/docs/samples/bert-config-build.md
@@ -0,0 +1,140 @@
+# BERT — Config + Build + Perf
+
+BERT (`bert-base-uncased`) is a canonical text model that exercises every stage of the winml-cli pipeline: it has multiple input tensors, benefits from graph fusion (GeLU, LayerNorm, MatMul+Add), and produces quantizable activations that run well on NPU. That combination makes it a useful reference point for teams deploying transformer encoders on Windows.
+
+This sample walks through the production-style workflow: generate a reusable `WinMLBuildConfig` JSON file with `winml config`, run the full export → optimize → quantize → compile pipeline in one shot with `winml build`, and measure the result with `winml perf`. If you want to understand each pipeline stage individually before running the all-in-one command, read the [ConvNeXt primitives sample](convnext-primitives.md) first.
+
+## Prerequisites
+
+- winml-cli installed and `winml` on your PATH.
+- A network connection to download `bert-base-uncased` weights from HuggingFace on first run.
+- A target device (NPU or GPU recommended; CPU also works).
+
+## Step 1: Generate a build config
+
+```bash
+winml config -m bert-base-uncased -t text-classification -o bert_config.json
+```
+
+This writes a `WinMLBuildConfig` JSON file to `bert_config.json`. The file captures every pipeline setting in a single artifact that you can version-control and share. A representative excerpt looks like this:
+
+```json
+{
+  "loader": {
+    "task": "text-classification",
+    "model_type": "bert"
+  },
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1
+  },
+  "optim": {
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "task": "text-classification",
+    "model_name": "bert-base-uncased"
+    ... // truncated: per_channel, symmetric, distribution, ...
+  },
+  "compile": {
+    "execution_provider": "qnn",
+    "enable_ep_context": true,
+    "compiler": "ort"
+    ... // truncated: provider_options, embed_context, validate, ...
+  }
+}
+```
+
+!!! note
+    The five top-level keys — `loader`, `export`, `optim`, `quant`, and `compile` — map directly to the five pipeline stages. Setting `quant` or `compile` to `null` skips that stage entirely. See [Config and build](../concepts/config-and-build.md) for a field-by-field description of every option.
+
+## Step 2: Run the build
+
+```bash
+winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/
+```
+
+winml-cli reads the config, downloads the model weights once, and runs the pipeline in sequence. Terminal output shows each stage as it completes:
+
+```text
+winml build
+  Config:     bert_config.json
+  Model:      bert-base-uncased
+  Output:     bert_out/
+
+  export       done  (42.1s)
+  optimize     done  (6.3s)
+  quantize     done  (18.7s)
+  compile      done  (21.4s)
+
+  Build complete in 88.5s
+  Final artifact: bert_out/bert-base-uncased_ctx.onnx
+```
+
+!!! note
+    After the optimize stage, winml-cli runs an analyzer loop that inspects the graph for nodes the target EP cannot dispatch natively and re-runs optimization with adjusted fusion flags. The loop repeats up to `--max-optim-iterations` times (default: 3). Pass `--no-optimize` to skip this stage entirely when starting from a pre-optimized ONNX file. See [How winml-cli Works](../concepts/how-it-works.md) for a full description of the autoconf loop.
+
+## Step 3: Benchmark
+
+```bash
+winml perf -m bert_out/bert-base-uncased_ctx.onnx --iterations 50
+```
+
+After a short warm-up, `winml perf` reports latency percentiles and throughput:
+
+```text
+Device:      npu
+Task:        text-classification
+Iterations:  50 (+ 10 warmup)
+Batch Size:  1
+
+Latency (ms)
+  Avg    P50    P90    P95    P99    Min    Max    Std
+ 4.83   4.79   5.12   5.31   5.68   4.51   6.04   0.21
+
+Throughput: 206.99 samples/sec
+
+Results saved to: bert-base-uncased_ctx_perf.json
+```
+
+## Customizing the config
+
+The JSON file is plain text and can be edited before running `winml build`. Two common adjustments:
+
+**Change precision.** To target fp16 instead of the default uint8 QDQ quantization, regenerate the config with an explicit precision flag:
+
+```bash
+winml config -m bert-base-uncased -t text-classification --precision fp16 -o bert_config.json
+```
+
+Alternatively, edit `bert_config.json` directly: set `quant.weight_type` and `quant.activation_type` to `"int8"` or `"uint16"`, or set `quant` to `null` to skip quantization entirely.
+
+**Disable a stage at build time.** You can suppress a stage for a single run without touching the config file using the `--no-quant` or `--no-compile` flags:
+
+```bash
+winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/ --no-quant
+```
+
+This is useful for measuring the fp32 baseline before committing to a quantized build. The `quant` section in `bert_config.json` is unchanged; the flag only affects this invocation. See [Config and build](../concepts/config-and-build.md) for the full list of configurable fields.
+
+## What you learned
+
+- `winml config` generates a complete, version-controllable `WinMLBuildConfig` JSON from a HuggingFace model ID in one command.
+- `winml build` orchestrates the full export → optimize → quantize → compile pipeline from a single config file and model ID.
+- The autoconf loop inside the optimize stage adjusts graph fusion flags automatically to maximize EP compatibility.
+- JSON fields (`quant`, `compile`) and CLI flags (`--no-quant`, `--no-compile`) are interchangeable ways to skip stages; CLI flags win for one-off experiments without modifying the file.
+- `winml perf` gives a latency and throughput baseline on the built artifact in seconds.
+
+## See also
+
+- [winml config](../commands/config.md)
+- [winml build](../commands/build.md)
+- [winml perf](../commands/perf.md)
+- [Config and build](../concepts/config-and-build.md)
diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
new file mode 100644
index 000000000..5e0576703
--- /dev/null
+++ b/docs/samples/convnext-primitives.md
@@ -0,0 +1,176 @@
+# ConvNeXt — Primitives Walkthrough
+
+!!! info "Pick the right ConvNeXt page"
+    - **This sample** — primitives on CPU, GPU (DirectML), and NPU (QNN) side-by-side. Best when you want to compare devices.
+    - **[ConvNeXt on NPU](../tutorials/npu-convnext.md)** — the canonical NPU production tutorial with both QNN and OpenVINO, plus the `winml build` one-shot.
+    - **[End-to-End Tour](../getting-started/end-to-end.md)** — short Getting Started tour.
+
+ConvNeXt Tiny is a compact convolutional image classifier trained on ImageNet-1k. At roughly 28 million parameters it is small enough to export and quantize in minutes on a developer laptop, yet representative enough that the latency and accuracy numbers you observe reflect real-world deployment trade-offs. Its straightforward architecture — no attention mechanisms, no dynamic control flow — makes it an ideal first model for learning the winml-cli pipeline.
+
+This walkthrough drives the full pipeline using the primitive commands directly: `winml inspect`, `winml config`, `winml export`, `winml quantize`, `winml compile`, `winml perf`, and `winml eval`. Running the steps individually rather than through `winml build` exposes what each command does and how its output feeds the next stage. The walkthrough covers three execution providers: CPU, GPU (DirectML), and NPU (Qualcomm QNN).
+
+## Prerequisites
+
+- winml-cli installed and `winml` available on your PATH — see [Installation](../getting-started/installation.md).
+- Internet access so HuggingFace Hub can download the model weights on first run.
+- Optional: QNN SDK installed on a Snapdragon Copilot+ PC for the NPU section.
+
+## Step 1: Inspect the model
+
+Before touching weights, confirm that winml-cli recognises the model and knows which task, loader class, and exporter to use.
+
+```bash
+winml inspect -m facebook/convnext-tiny-224
+```
+
+```text
+╭─────────────────────────── facebook/convnext-tiny-224 ────────────────────────╮
+│ Task          image-classification                                              │
+│ Model Class   ConvNextForImageClassification                                   │
+│ Exporter      OptimumExporter                                                  │
+│ WinML Class   WinMLImageClassificationModel                                    │
+│ Status        Supported                                                        │
+╰────────────────────────────────────────────────────────────────────────────────╯
+```
+
+!!! note "What we just did"
+    `winml inspect` fetched only the model's `config.json` from HuggingFace Hub — no weights — and confirmed that `facebook/convnext-tiny-224` maps to a supported task (`image-classification`), a known model class, and a compatible ONNX exporter.
+
+## Step 2: Generate a config (optional)
+
+```bash
+winml config -m facebook/convnext-tiny-224 -o convnext_config.json
+```
+
+Generating a config file is optional when running the primitives individually, but it is good practice: the JSON captures the auto-detected loader, export, quantization, and compile settings in one reproducible artifact. You can check it into source control, diff it against future versions of the model, or hand-edit individual fields before passing it to `winml build`. For a full description of every field, see [Config and build](../concepts/config-and-build.md).
+
+## Step 3: Export to ONNX
+
+Download the model weights and convert the PyTorch graph to a portable ONNX file.
+
+```bash
+winml export -m facebook/convnext-tiny-224 -o convnext.onnx
+```
+
+```text
+Model: facebook/convnext-tiny-224
+Output: convnext.onnx
+
+Starting HTP export...
+  Detected task: image-classification
+
+Success! Model exported to: convnext.onnx
+```
+
+!!! note "Hierarchy metadata"
+    By default `winml export` embeds `hierarchy_tag` metadata in each ONNX node, recording which PyTorch module the node originated from. This lets downstream tools like `winml perf --module` and `winml analyze` reason about operator groups rather than flat graph positions. To skip the metadata and produce a clean ONNX file, add `--clean-onnx`. For more detail see [Load and export](../concepts/load-and-export.md).
+
+## Step 4: Quantize
+
+Insert QDQ (Quantize/Dequantize) nodes using 32 calibration samples drawn from the task-default dataset.
+
+```bash
+winml quantize -m convnext.onnx -o convnext_int8.onnx --precision int8 --samples 32
+```
+
+```text
+Calibrating: 32 samples [minmax]
+Inserting QDQ nodes...
+Saved: convnext_int8.onnx
+```
+
+!!! note "Calibration"
+    Static quantization needs representative inputs to estimate each tensor's value range before baking scale and zero-point constants into the QDQ nodes. The `--samples` flag controls how many calibration inputs are used; 32 is a reasonable starting point for vision classifiers. If you see accuracy regression after quantization, try increasing `--samples` or switching to `--method entropy`. See [Quantization & QDQ](../concepts/quantization.md) for the full trade-off discussion.
+
+## Step 5: Compile for each EP
+
+Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the runtime can skip per-session JIT compilation.
+
+=== "CPU"
+
+    ```bash
+    winml compile -m convnext_int8.onnx --output-dir . --device cpu
+    ```
+
+=== "GPU"
+
+    ```bash
+    winml compile -m convnext_int8.onnx --output-dir . --device gpu
+    ```
+
+=== "NPU"
+
+    ```bash
+    winml compile -m convnext_int8.onnx --output-dir . --device npu --qnn-sdk-root <path-to-qnn-sdk>
+    ```
+
+!!! note "NPU requires the QNN SDK"
+    Compilation for `--device npu` invokes the Qualcomm QNN offline compiler, which must be installed separately. Pass `--qnn-sdk-root` pointing at the root of your QAIRT SDK installation, or set the `QNN_SDK_ROOT` environment variable to the same path. If the SDK is absent, compile for CPU or GPU instead. For a full explanation of how EPs relate to device targets see [ONNX & Execution Providers](../concepts/eps-and-devices.md).
+
+Each invocation writes a compiled ONNX file to the output directory: `convnext_int8_cpu_ctx.onnx` for CPU, `convnext_int8_dml_ctx.onnx` for GPU (DirectML), and `convnext_int8_qnn_ctx.onnx` for NPU (QNN). The GPU and NPU variants contain an EPContext node that embeds the pre-compiled binary.
+
+## Step 6: Benchmark
+
+Measure latency and throughput on each device. Pass the compiled ONNX directly so the benchmark uses the pre-compiled artifact.
+
+```bash
+winml perf -m convnext_int8_cpu_ctx.onnx --device cpu --iterations 200
+```
+
+```text
+Device:      cpu
+Precision:   auto
+Task:        image-classification
+Iterations:  200 (+ 10 warmup)
+Batch Size:  1
+
+Latency (ms)
+  Avg    P50    P90    P95    P99    Min    Max    Std
+ 8.41   8.35   9.02   9.31  10.14   7.88  12.63   0.48
+
+Throughput: 118.91 samples/sec
+```
+
+```bash
+winml perf -m convnext_int8_dml_ctx.onnx --device gpu --iterations 200
+winml perf -m convnext_int8_qnn_ctx.onnx --device npu --iterations 200
+```
+
+The NPU variant typically delivers the lowest latency and highest power efficiency on Qualcomm Snapdragon hardware. Use the JSON output written by `--output` to compare runs programmatically.
+
+## Step 7: Evaluate
+
+Measure top-1 accuracy on 100 samples from the ImageNet-1k validation split. When passing an ONNX file, supply `--model-id` so the command knows which preprocessor and label vocabulary to use.
+
+```bash
+winml eval -m convnext_int8.onnx --model-id facebook/convnext-tiny-224 \
+    --dataset imagenet-1k --split validation --samples 100 --device cpu
+```
+
+```text
+Task:     image-classification
+Dataset:  imagenet-1k (validation, 100 samples)
+Device:   cpu
+
+Accuracy: 81.00%
+
+Results saved to: convnext_int8_eval.json
+```
+
+Note that `--device` accepts only `cpu`, `gpu`, or `npu` — it does not accept `auto`. To compare quantized accuracy against the floating-point baseline, run the same command with `convnext.onnx` and compare the two JSON outputs.
+
+## What you learned
+
+- `winml inspect` checks task detection and exporter compatibility from the model's `config.json` alone — no weight download needed.
+- `winml config` captures the full pipeline configuration as a reproducible JSON file.
+- `winml export` converts the PyTorch model to a portable ONNX graph and embeds hierarchy metadata for downstream analysis.
+- `winml quantize` inserts QDQ nodes using calibration data; `--precision int8` and `--samples` control the precision and calibration budget.
+- `winml compile` pre-bakes an EP-specific binary cache per device; the same quantized ONNX feeds all three targets.
+- `winml perf` and `winml eval` consume the final artifact without modifying it — benchmark first, then validate accuracy before shipping.
+
+## See also
+
+- [BERT — Config + Build + Perf](bert-config-build.md) — the same pipeline driven through `winml build` with a config file
+- [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview and stage descriptions
+- [Quantization & QDQ](../concepts/quantization.md) — calibration methods and accuracy trade-offs
+- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — EP selection, device flags, and QNN SDK setup
diff --git a/docs/samples/qwen3-composite.md b/docs/samples/qwen3-composite.md
new file mode 100644
index 000000000..5875a2e7e
--- /dev/null
+++ b/docs/samples/qwen3-composite.md
@@ -0,0 +1,27 @@
+# Qwen3 — Composite Models
+
+!!! info "Coming soon"
+    Composite-model support — running models with multiple components like a text encoder + decoder, or a vision encoder + LLM, through a single winml-cli pipeline — is on an in-progress feature branch. This page will be authored once that work merges.
+
+## What composite models are
+
+A composite model is a system made up of two or more distinct ONNX sub-models that work together as a single inference pipeline. A common example is a vision-language model like Qwen3-VL, where a vision encoder processes an image and feeds its output into a separate language model decoder. Another pattern is an encoder-decoder pair — two ONNX files that share a tokenizer configuration and must be executed in sequence at runtime. Multi-stage pipelines generalize this further: the output tensor of one sub-model becomes the input tensor of the next, with each stage potentially targeting a different execution provider or precision. Composite models add coordination complexity beyond what a single ONNX graph requires, so they call for first-class support in the build and inference tooling rather than ad hoc stitching.
+
+## What Qwen3 will demonstrate
+
+The following is a forward-looking sketch of what this sample will cover once the composite-model feature branch lands:
+
+- How to declare a composite model in a `BuildConfig` — specifying multiple sub-models, their connection points, and a shared tokenizer configuration.
+- How `winml build` orchestrates export and compilation of each sub-model independently, then assembles the composite pipeline.
+- How to run end-to-end inference across the composite pipeline using a single `winml` invocation.
+- How to benchmark each sub-model's latency independently with `winml perf` to identify bottlenecks.
+- This section is a sketch and will be revised once the implementation lands; details may change.
+
+## Track progress
+
+Follow development and check current status at https://github.com/microsoft/winml-cli.
+
+## See also
+
+- [BERT — Config + Build + Perf](../samples/bert-config-build.md)
+- [Config and build](../concepts/config-and-build.md)
diff --git a/docs/superpowers/2026-05-26-v3-known-issues.md b/docs/superpowers/2026-05-26-v3-known-issues.md
new file mode 100644
index 000000000..96a175178
--- /dev/null
+++ b/docs/superpowers/2026-05-26-v3-known-issues.md
@@ -0,0 +1,102 @@
+# winml-cli docs v3 — Known issues
+
+> **Date:** 2026-05-26
+> **Branch:** `docs/v3` (squashed as `gim-doc` tag)
+> **Status:** Fact-checked findings from a 3-agent critical review pass. Each issue verified against the actual source/files.
+
+Issues identified after the v3 doc set was assembled, fact-checked against `src/winml/modelkit/` and the actual doc files. Five issues are real and pending fix; three were claimed by reviewers but dismissed on second pass.
+
+---
+
+## Confirmed issues — pending fix
+
+### 1. Stale link display text across 7 files (10+ occurrences)
+
+Several pages were renamed during the Concepts restructure but their inbound link **display text** still uses the old titles. The link URLs themselves all resolve correctly (strict build passes); the issue is the visible label readers see.
+
+| Stale text | Should be | Locations |
+|---|---|---|
+| `Quantization & QDQ` | `Datatype and Quantization` | `commands/eval.md:95`, `commands/hub.md:112`, `samples/convnext-primitives.md:83`, `samples/convnext-primitives.md:175`, `tutorials/npu-convnext.md:278` |
+| `Quantization concepts` | `Datatype and Quantization` | `commands/quantize.md:115` |
+| `Concepts → Quantization and QDQ` | `Concepts → Datatype and Quantization` | `tutorials/npu-convnext.md:137` |
+| `ONNX & Execution Providers` / `ONNX and execution providers` | `EP and Device` | `commands/compile.md:110`, `commands/eval.md:96`, `commands/inspect.md:104`, `commands/overview.md:69`, `commands/perf.md:102`, `commands/sys.md:114`, `samples/convnext-primitives.md:108`, `samples/convnext-primitives.md:176` |
+| `Load and export concept` | `Load and export` | `commands/export.md:105`, `commands/inspect.md:100`, `commands/perf.md:101` |
+
+**Fix:** sed-sweep all five label patterns to the new titles.
+
+### 2. WinML CLI concept sub-group ordering misaligned with workflow
+
+`mkdocs.yml` lists the WinML CLI Concepts sub-group in this order:
+
+```
+Primitives and pipeline
+Load and export
+Analyze and optimize
+Compile and EPContext
+Perf and monitoring
+Eval and datasets
+Config and build      ← last
+```
+
+But `winml config` is **Step 1** of the End-to-End Tour (`getting-started/end-to-end.md`), so a reader who finishes the Tour and turns to Concepts to go deeper has to walk past 5 other pages before reaching `config-and-build.md`, which documents what they just did.
+
+**Fix:** reorder so `Config and build` follows `Primitives and pipeline`:
+
+```
+Primitives and pipeline
+Config and build
+Load and export
+Analyze and optimize
+Compile and EPContext
+Perf and monitoring
+Eval and datasets
+```
+
+### 3. `graphs-and-ir.md:29` opset 17 / GroupNorm factual error
+
+Current text:
+
+> "Opset 17 introduced layer-normalisation and group-normalisation operators in native form, eliminating the multi-node decompositions required by earlier opsets…"
+
+Per the ONNX changelog, **`LayerNormalization` was added in opset 17** but **`GroupNormalization` was added in opset 18**. The compound claim is wrong.
+
+**Fix:** rewrite to "Opset 17 introduced LayerNormalization in native form; GroupNormalization arrived in opset 18." Or drop the GroupNorm mention entirely.
+
+### 4. ConvNeXt "Pick the right page" admonition missing from `end-to-end.md`
+
+The admonition appears at the top of `samples/convnext-primitives.md:3` and `tutorials/npu-convnext.md:3` but is **absent** from `getting-started/end-to-end.md`. The three pages all use `facebook/convnext-tiny-224` and a reader coming from the End-to-End Tour has no signpost telling them about the other two pages.
+
+**Fix:** add a matching `!!! info "Pick the right ConvNeXt page"` admonition near the top of `getting-started/end-to-end.md`.
+
+### 5. `end-to-end.md:108` capital-B inconsistency
+
+Line 108 reads `[Config and Build](../concepts/config-and-build.md)` (capital B). The nav label and line 88 of the same file use lowercase `Config and build`.
+
+**Fix:** change to lowercase `b` to match.
+
+---
+
+## Issues claimed by reviewers but rejected on fact-check
+
+### #2 (rejected) — Quickstart link description
+
+A UX reviewer claimed `quickstart.md:63` says "full pipeline against a Qualcomm NPU". Actual text is "full pipeline from Hugging Face to NPU". The exact phrasing the reviewer quoted is not present. The link wording is mildly NPU-leaning but not the misrepresentation claimed. Optional minor wording tweak; not pursued here.
+
+### #5 (rejected) — `<artifact>.onnx` placeholder ambiguity
+
+A UX reviewer claimed Step 3 leaves the reader guessing the per-device filename. The actual prose at `end-to-end.md:121-125` explicitly lists all three filenames (`convnext_tiny_qnn_ctx.onnx`, `convnext_tiny_dml_ctx.onnx`, `convnext_tiny.onnx`) and tells readers where to find them. Reviewer missed reading the next sentence.
+
+### #7 (rejected) — `weight-and-activation.md` forward-reference to `w8a16`
+
+A UX reviewer claimed the page mentions `w8a16` before defining it. Actual text at line 25 defines it inline: "The compound precision shorthand `w8a16` (8-bit weights, 16-bit activations)". Reviewer wrong.
+
+### #9 (partial → effectively rejected) — `optim` fields not declared on dataclass
+
+A factual reviewer flagged that `WinMLOptimizationConfig` is a free-form dict subclass with no declared fields, so the JSON example field names (`gelu_fusion`, `layer_norm_fusion`, `matmul_add_fusion`) "may not be real". Verified that the fields **are** real keys recognized by the optimizer at `src/winml/modelkit/optim/pipes/graph.py:242-243`. The example is correct. Not a defect.
+
+---
+
+## Items intentionally left as-is
+
+- **"WinML CLI" sub-group naming.** The sub-group inside Concepts is named `WinML CLI`, which is recursive (the product is `winml-cli`). Suggested rename to "Workflows" was proposed and explicitly declined earlier. No change.
+- **Singular vs plural style split between Fundamentals and WinML CLI sub-groups.** Fundamentals uses singular pair-topics ("Graph and IR", "Weight and Activation", "EP and Device", "Datatype and Quantization") per the user's preference; WinML CLI still uses plurals ("Primitives and pipeline", "Eval and datasets"). The user has not asked to reconcile.
diff --git a/docs/superpowers/plans/2026-05-20-modelkit-docs-site.md b/docs/superpowers/plans/2026-05-20-modelkit-docs-site.md
new file mode 100644
index 000000000..6321257a9
--- /dev/null
+++ b/docs/superpowers/plans/2026-05-20-modelkit-docs-site.md
@@ -0,0 +1,1031 @@
+# ModelKit User-Facing Documentation Site — Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Build an MVP user-facing documentation site for ModelKit using MkDocs Material, served from `docs/`, with full content for chapters 1-4 (Getting Started, Concepts, Commands, Samples) and stubs for chapters 5-7.
+
+**Architecture:** Static site generated by MkDocs Material from in-repo markdown. Build verified locally; CI workflow exists but requires `workflow_dispatch` (no automatic publish during MVP). Authoring parallelized via subagents per page group; each batch commits as one logical unit.
+
+**Tech Stack:** Python 3.10 + uv, MkDocs 1.6+, mkdocs-material 9+, pymdown-extensions, mkdocs-jupyter (optional notebook plugin), GitHub Actions.
+
+**Spec:** `docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md`
+
+**Branch:** `docs/init` (based on `feat/mvp`). No pushes to remote.
+
+---
+
+## Conventions used in this plan
+
+- **Source CLI name:** `winml` (entry point at `modelkit/cli.py`, surfaced through `pyproject.toml` `scripts.winml`).
+- **Command source files:** `src/winml/modelkit/commands/<name>.py`.
+- **Existing internal docs that must be untouched:** `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md`. Excluded from MkDocs build via `exclude_docs` in `mkdocs.yml`.
+- **Verification step at the end of every task:** `uv run mkdocs build --strict` must succeed with exit code 0 and zero warnings.
+- **Commit message style:** Conventional Commits (`docs: ...`, `chore: ...`). No `Co-Authored-By` (project rule from CLAUDE.md). No "Test plan" section.
+
+---
+
+## Task 1: Scaffold MkDocs config, dependencies, and stub tree
+
+**Files:**
+- Modify: `pyproject.toml`
+- Create: `mkdocs.yml`
+- Create: `docs/index.md`
+- Create: `docs/getting-started/installation.md`
+- Create: `docs/getting-started/quickstart.md`
+- Create: `docs/getting-started/end-to-end.md`
+- Create: `docs/concepts/how-it-works.md`
+- Create: `docs/concepts/onnx-and-eps.md`
+- Create: `docs/concepts/quantization.md`
+- Create: `docs/concepts/hierarchy.md`
+- Create: `docs/concepts/buildconfig.md`
+- Create: `docs/commands/overview.md`
+- Create: `docs/commands/sys.md`, `inspect.md`, `hub.md`, `analyze.md`, `config.md`, `optimize.md`, `export.md`, `quantize.md`, `compile.md`, `build.md`, `perf.md`, `eval.md`
+- Create: `docs/samples/convnext-primitives.md`
+- Create: `docs/samples/bert-config-build.md`
+- Create: `docs/samples/qwen3-composite.md`
+- Create: `docs/reference/index.md`
+- Create: `docs/troubleshooting.md`
+- Create: `docs/contributing.md`
+
+- [ ] **Step 1: Add MkDocs dev dependencies to pyproject.toml**
+
+In `pyproject.toml`, locate the `optional-dependencies.dev = [` block (around line 62) and add three new entries (alphabetical insertion):
+
+```toml
+  "mkdocs-jupyter>=0.25",
+  "mkdocs-material>=9.5",
+  "pymdown-extensions>=10.7",
+```
+
+Run:
+```bash
+uv sync --extra dev
+```
+
+Expected: `uv` resolves and installs the three new packages without error. `uv.lock` is regenerated.
+
+- [ ] **Step 2: Create `mkdocs.yml`**
+
+Create `mkdocs.yml` at the repo root with the following exact content:
+
+```yaml
+site_name: ModelKit
+site_description: Accelerate Model Deployment on WinML
+site_url: https://gim-home.github.io/ModelKit/
+repo_url: https://github.com/gim-home/ModelKit
+repo_name: gim-home/ModelKit
+edit_uri: edit/main/docs/
+
+docs_dir: docs
+
+# Internal docs and brainstorming artifacts are kept in the repo but excluded
+# from the user-facing site.
+exclude_docs: |
+  /design/
+  /superpowers/
+  /naming-convention.md
+  /pytest-best-practices.md
+
+theme:
+  name: material
+  features:
+    - navigation.instant
+    - navigation.tracking
+    - navigation.tabs
+    - navigation.sections
+    - navigation.top
+    - content.code.copy
+    - content.action.edit
+    - toc.follow
+    - search.suggest
+    - search.highlight
+  palette:
+    - media: "(prefers-color-scheme: light)"
+      scheme: default
+      primary: indigo
+      accent: indigo
+      toggle:
+        icon: material/brightness-7
+        name: Switch to dark mode
+    - media: "(prefers-color-scheme: dark)"
+      scheme: slate
+      primary: indigo
+      accent: indigo
+      toggle:
+        icon: material/brightness-4
+        name: Switch to light mode
+
+plugins:
+  - search
+
+markdown_extensions:
+  - admonition
+  - attr_list
+  - md_in_html
+  - tables
+  - toc:
+      permalink: true
+  - pymdownx.details
+  - pymdownx.highlight:
+      anchor_linenums: true
+      line_spans: __span
+      pygments_lang_class: true
+  - pymdownx.inlinehilite
+  - pymdownx.snippets
+  - pymdownx.superfences:
+      custom_fences:
+        - name: mermaid
+          class: mermaid
+          format: !!python/name:pymdownx.superfences.fence_code_format
+  - pymdownx.tabbed:
+      alternate_style: true
+  - pymdownx.tasklist:
+      custom_checkbox: true
+
+nav:
+  - Home: index.md
+  - Getting Started:
+      - Installation: getting-started/installation.md
+      - Quickstart: getting-started/quickstart.md
+      - End-to-End — HF → NPU: getting-started/end-to-end.md
+  - Concepts:
+      - How ModelKit Works: concepts/how-it-works.md
+      - ONNX & Execution Providers: concepts/onnx-and-eps.md
+      - Quantization & QDQ: concepts/quantization.md
+      - Hierarchy Preservation: concepts/hierarchy.md
+      - BuildConfig & Kits: concepts/buildconfig.md
+  - Commands:
+      - Overview: commands/overview.md
+      - Discover:
+          - sys: commands/sys.md
+          - inspect: commands/inspect.md
+          - hub: commands/hub.md
+          - analyze: commands/analyze.md
+      - Configure:
+          - config: commands/config.md
+          - optimize: commands/optimize.md
+      - Build:
+          - export: commands/export.md
+          - quantize: commands/quantize.md
+          - compile: commands/compile.md
+          - build: commands/build.md
+      - Measure:
+          - perf: commands/perf.md
+          - eval: commands/eval.md
+  - Samples:
+      - ConvNeXt — Primitives Walkthrough: samples/convnext-primitives.md
+      - BERT — Config + Build + Perf: samples/bert-config-build.md
+      - Qwen3 — Composite Models: samples/qwen3-composite.md
+  - Reference: reference/index.md
+  - Troubleshooting: troubleshooting.md
+  - Contributing: contributing.md
+```
+
+- [ ] **Step 3: Create the landing page**
+
+Create `docs/index.md`:
+
+```markdown
+# ModelKit
+
+> Accelerate model deployment on Windows ML.
+
+ModelKit is a Python toolkit that converts and optimizes PyTorch and Hugging Face models to ONNX for deployment on the [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/) runtime. It supports multiple hardware backends including QNN (Qualcomm Neural Processing SDK), OpenVINO, DirectML, and ONNX Runtime CPU/GPU.
+
+## Where to start
+
+- **[Installation](getting-started/installation.md)** — get the `winml` CLI running locally.
+- **[Quickstart](getting-started/quickstart.md)** — export a Hugging Face model in five minutes.
+- **[End-to-End: HF → NPU](getting-started/end-to-end.md)** — full pipeline against a Qualcomm NPU.
+
+## Learn the model
+
+- **[How ModelKit Works](concepts/how-it-works.md)** — the pipeline from a PyTorch model to an EP-compiled artifact.
+- **[Commands](commands/overview.md)** — reference for all 12 `winml` subcommands.
+- **[Samples](samples/convnext-primitives.md)** — end-to-end walkthroughs for ConvNeXt, BERT, and Qwen3.
+
+## License
+
+MIT. See [LICENSE](https://github.com/gim-home/ModelKit/blob/main/LICENSE.txt).
+```
+
+- [ ] **Step 4: Create all stub pages**
+
+For every page listed in the `Files` section above (other than `index.md`), create a stub with this structure:
+
+```markdown
+# <Page Title>
+
+!!! note "Coming soon"
+    This page is part of the documentation MVP and will be authored shortly.
+```
+
+Use the page title from the `nav:` block in `mkdocs.yml`. For sub-pages like `commands/sys.md`, use `winml sys` as the title.
+
+- [ ] **Step 5: Verify the strict build passes**
+
+Run:
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit code 0; the message `INFO - Documentation built in <N>.<NN> seconds`; no `WARNING` lines. A `site/` directory is produced and is already gitignored (line 99 of `.gitignore`).
+
+If warnings appear about excluded internal docs, re-check the `exclude_docs:` block in `mkdocs.yml`.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add pyproject.toml uv.lock mkdocs.yml docs/index.md docs/getting-started/ docs/concepts/ docs/commands/ docs/samples/ docs/reference/ docs/troubleshooting.md docs/contributing.md
+git commit -m "docs: scaffold MkDocs Material site with stub pages
+
+Adds mkdocs.yml (Material theme, mermaid superfences, search),
+landing page, and stub pages for chapters 1-7. Internal docs
+(design/, naming-convention.md, pytest-best-practices.md) are
+excluded from the user-facing build via exclude_docs."
+```
+
+---
+
+## Task 2: CI workflow (manual dispatch only)
+
+**Files:**
+- Create: `.github/workflows/docs.yml`
+
+- [ ] **Step 1: Create the workflow file**
+
+Create `.github/workflows/docs.yml`:
+
+```yaml
+name: Build & Publish Docs
+
+on:
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v3
+        with:
+          python-version: "3.10"
+      - run: uv sync --extra dev
+      - run: uv run mkdocs build --strict
+      - uses: peaceiris/actions-gh-pages@v4
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          publish_dir: ./site
+```
+
+- [ ] **Step 2: Verify the workflow is syntactically valid**
+
+Run:
+```bash
+python -c "import yaml; yaml.safe_load(open('.github/workflows/docs.yml'))"
+```
+
+Expected: no output (valid YAML, exit code 0).
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add .github/workflows/docs.yml
+git commit -m "ci: add docs build/publish workflow (manual dispatch only)
+
+Workflow runs mkdocs build --strict and deploys to gh-pages.
+Triggered manually via workflow_dispatch — no automatic publish
+during MVP authoring."
+```
+
+---
+
+## Task 3: Author Concepts pages (5 parallel agents)
+
+**Files:**
+- Modify: `docs/concepts/how-it-works.md`
+- Modify: `docs/concepts/onnx-and-eps.md`
+- Modify: `docs/concepts/quantization.md`
+- Modify: `docs/concepts/hierarchy.md`
+- Modify: `docs/concepts/buildconfig.md`
+
+- [ ] **Step 1: Dispatch 5 parallel subagents — one per concept page**
+
+Send all five `Agent` tool calls in a single message so they run concurrently. Use `subagent_type: general-purpose`. Use the prompt template below, substituting the bracketed values per page.
+
+### Reusable agent prompt template
+
+```
+You are authoring one page of the ModelKit user-facing documentation. Output: a single markdown file at the path I give you. Audience: external open-source developers, no insider jargon. Length: 400-700 words. Tone: clear, direct, second person where useful, no marketing fluff.
+
+Page: [PAGE TITLE]
+File to write: [ABSOLUTE PATH]
+Source files to read first (for accuracy — do not copy verbatim):
+[SOURCE PATHS, one per line]
+
+Required structure:
+1. H1 heading matching the page title.
+2. One- or two-paragraph lead that defines the concept and why a ModelKit user encounters it.
+3. Body sections (use H2/H3) covering the points listed below.
+4. A "See also" section at the bottom linking to 2-4 related pages within the docs (use relative paths like `../commands/build.md`).
+
+Body points to cover for this page:
+[BULLET LIST OF SUBSTANTIVE POINTS]
+
+Diagram: [yes/no — if yes, embed one mermaid block as instructed]
+
+Rules:
+- Write actual prose, not bullet lists, except where bullets clarify (lists of EPs, lists of precision options).
+- Use real `winml` command names and flag names where relevant — never the old `wmk` name. CLI source of truth is `src/winml/modelkit/commands/`.
+- Code blocks use triple backticks with language tags (```python, ```bash, ```text).
+- Do NOT speculate beyond what the source code supports. If unsure, omit.
+- Do NOT use placeholder phrases like "TBD" or "more details to come".
+
+Return only the file path you wrote.
+```
+
+### Per-page substitutions
+
+**Agent 1 — How ModelKit Works**
+- File: `docs/concepts/how-it-works.md`
+- Sources: `src/winml/modelkit/build/`, `src/winml/modelkit/commands/build.py`, `src/winml/modelkit/cli.py`
+- Body points:
+  - The pipeline at a glance: PyTorch/HF model → ONNX export → optional optimization → optional quantization → EP-specific compilation → inference session.
+  - What each stage does and which `winml` command owns it.
+  - Where `build` fits as the one-shot wrapper for the staged commands.
+  - How configuration flows (BuildConfig) versus ad-hoc CLI flags.
+- Diagram: yes — a top-to-bottom mermaid flowchart of the five stages. Use the literal syntax:
+  ```
+  ```mermaid
+  flowchart TD
+      A[PyTorch / HF model] --> B[winml export]
+      B --> C[winml optimize]
+      C --> D[winml quantize]
+      D --> E[winml compile]
+      E --> F[EP-ready ONNX]
+      F --> G[winml perf / eval]
+  ```
+  ```
+
+**Agent 2 — ONNX & Execution Providers**
+- File: `docs/concepts/onnx-and-eps.md`
+- Sources: `src/winml/modelkit/sysinfo/`, `src/winml/modelkit/commands/sys.py`, `src/winml/modelkit/analyze/`
+- Body points:
+  - What ONNX is (one paragraph; link to onnx.ai).
+  - What an Execution Provider is in ONNX Runtime terms.
+  - The EPs ModelKit supports today (read sysinfo/analyze source to enumerate — at least QNN, OpenVINO, DirectML, CPU, CUDA).
+  - Hardware/device mapping table (Device × EP) showing which combinations are valid.
+  - How `--device` (auto/cpu/gpu/npu) versus `--ep` interact.
+- Diagram: no.
+
+**Agent 3 — Quantization & QDQ**
+- File: `docs/concepts/quantization.md`
+- Sources: `src/winml/modelkit/quant/`, `src/winml/modelkit/commands/quantize.py`
+- Body points:
+  - Why quantize: smaller artifacts, faster inference on integer-capable hardware (NPUs), trade-off is accuracy loss.
+  - Precision options ModelKit supports today (read quant source to enumerate — fp32, fp16, int8, int16, w8a8, w8a16, w4a16, auto).
+  - Calibration: what calibration samples do, the `--samples` and `--method` flags (minmax/entropy/percentile).
+  - QDQ pattern explained: insert Quantize and Dequantize nodes around weights/activations so the runtime can fuse them on the target device.
+  - When quantization is lossy and how to tell.
+- Diagram: no.
+
+**Agent 4 — Hierarchy Preservation**
+- File: `docs/concepts/hierarchy.md`
+- Sources: `src/winml/modelkit/export/`, `src/winml/modelkit/onnx/`, `src/winml/modelkit/commands/export.py`, `src/winml/modelkit/commands/inspect.py`
+- Body points:
+  - Standard ONNX export flattens the PyTorch module tree — you lose which ops belonged to which layer.
+  - ModelKit embeds PyTorch hierarchy as metadata in the exported ONNX (`hierarchy_tag`).
+  - What this unlocks: per-module benchmarking (`winml perf --module`), targeted optimization, hierarchy view in `winml inspect --hierarchy`.
+  - How to turn it off: `winml export --no-hierarchy`.
+- Diagram: no.
+
+**Agent 5 — BuildConfig & Kits**
+- File: `docs/concepts/buildconfig.md`
+- Sources: `src/winml/modelkit/config/`, `src/winml/modelkit/commands/config.py`, `src/winml/modelkit/commands/build.py`
+- Body points:
+  - What a `WinMLBuildConfig` represents: the full set of decisions for one model's pipeline (model id, task, precision, EP, quantization options, compilation options).
+  - How `winml config` generates one for a given Hugging Face model or local ONNX.
+  - How `winml build -c <config.json>` consumes one.
+  - Per-task templates — where they live in `MODEL_BUILD_CONFIGS`.
+  - Why a config file is useful: reproducibility, sharing, CI.
+- Diagram: no.
+
+- [ ] **Step 2: Verify strict build passes**
+
+Run:
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit code 0, no warnings. If any link the agents wrote is broken (e.g. `../commands/sys.md` before sys.md has real content — it's fine, the stub still exists), `--strict` will pass; only missing files fail.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add docs/concepts/
+git commit -m "docs: author Concepts chapter (5 pages)
+
+Covers the ModelKit mental model: pipeline overview, ONNX/EPs,
+quantization & QDQ, hierarchy preservation, BuildConfig. Each
+page authored from the corresponding source modules; mermaid
+diagram in How ModelKit Works."
+```
+
+---
+
+## Task 4: Author 12 Command pages (4 parallel agents, 3 commands each)
+
+**Files (modified):**
+- `docs/commands/sys.md`, `inspect.md`, `hub.md`, `analyze.md`
+- `docs/commands/config.md`, `optimize.md`
+- `docs/commands/export.md`, `quantize.md`, `compile.md`, `build.md`
+- `docs/commands/perf.md`, `eval.md`
+
+- [ ] **Step 1: Dispatch 4 parallel subagents**
+
+Send all four `Agent` tool calls in a single message. Each agent owns one group:
+
+| Agent | Group | Commands |
+|---|---|---|
+| 1 | Discover (part 1) | sys, inspect, hub |
+| 2 | Discover (part 2) + Configure | analyze, config, optimize |
+| 3 | Build | export, quantize, compile, build |
+| 4 | Measure | perf, eval |
+
+Note Agent 3 owns four commands (the Build group naturally has 4); the rebalance keeps the other agents at 3.
+
+### Reusable agent prompt template
+
+```
+You are authoring user-facing reference pages for the ModelKit `winml` CLI. Output: one markdown file per command listed below. Audience: external open-source developers. Length: 300-600 words per page.
+
+Commands to author:
+[LIST: command name → file path]
+
+For each command, read its source file at `src/winml/modelkit/commands/<command>.py` and the shared options at `src/winml/modelkit/commands/_options.py`. Confirm the actual flags by running `uv run winml <command> --help` and capturing the output.
+
+Required structure for every page:
+
+# winml <command>
+
+> [One-line tagline describing the command's job — 8-15 words.]
+
+## When to use this
+
+[One or two sentences. State the user intent and where it falls in the pipeline.]
+
+## Synopsis
+
+```bash
+$ winml <command> [options]
+```
+
+## Flags
+
+[A markdown table with columns: Flag | Short | Type | Default | Description. Include EVERY flag from --help output. For shared flags (model, device, ep, output, task, precision), list them with their actual semantics for this command — do not skip them.]
+
+## How it works
+
+[2-4 sentences explaining what the command does internally — keep it user-relevant, not implementation trivia.]
+
+## Examples
+
+[3-5 examples. Each is a fenced bash block. Order from simplest to richest. Include expected output snippet (use a separate ```text block) for at least the first example. Use realistic model ids — microsoft/resnet-50, bert-base-uncased, microsoft/Phi-3-mini-4k-instruct as appropriate.]
+
+## Common pitfalls
+
+[Bulleted list of 2-5 gotchas. Be specific: missing flags, environment requirements, common error messages.]
+
+## See also
+
+[2-4 relative links to related command pages or concept pages.]
+
+Rules:
+- Use the actual `winml` CLI name everywhere. Never `wmk`.
+- Use the actual flag names from the source. Never invent flags.
+- Code blocks: ```bash for invocations, ```text for output, ```python only if a real Python snippet is needed.
+- No placeholder phrases. If a section legitimately has nothing to say (e.g. no pitfalls), omit the section.
+- Cross-link to concept pages where helpful: `../concepts/quantization.md`, `../concepts/buildconfig.md`, etc.
+
+Return the list of file paths you wrote.
+```
+
+- [ ] **Step 2: Verify strict build passes**
+
+Run:
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit code 0, no warnings.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add docs/commands/
+git commit -m "docs: author all 12 command reference pages
+
+Each page follows the standard template: tagline, when-to-use,
+synopsis, flags table, how-it-works, examples, pitfalls, see-also.
+Pages drafted from src/winml/modelkit/commands/ sources and the
+real --help output. Commands → Overview page authored in next task."
+```
+
+---
+
+## Task 5: Author Commands → Overview page
+
+**Files:**
+- Modify: `docs/commands/overview.md`
+
+- [ ] **Step 1: Dispatch one subagent**
+
+Prompt:
+
+```
+Author the user-facing Commands Overview page for ModelKit. Output: `docs/commands/overview.md`. Length: 400-600 words.
+
+Read all 12 command pages under `docs/commands/` (already authored). Read the shared CLI argument spec at `docs/design/cli/3_cli_args_spec.md` for context (do not copy from it — it is internal).
+
+Required structure:
+
+# Commands
+
+[2-3 paragraph lead: ModelKit exposes a CLI named `winml` with 12 subcommands organized by user intent. Show the four groups (Discover / Configure / Build / Measure) and explain when a user would reach for each group.]
+
+## Command map
+
+[Markdown table with columns: Command | Group | Purpose. One row per command. Link the command name to its page, e.g. [sys](sys.md).]
+
+## Choosing a command
+
+[A decision-style section. Pose 5-8 common questions a user might have ("I want to see what hardware I have", "I want to convert a HF model to ONNX", "I want to benchmark a compiled model on NPU", ...) and answer each with a single command + a one-line reason.]
+
+## Global flags
+
+[Brief mention of -v / -q / --debug / --version / -h. Note they live on the root `winml` group only and are inherited by all subcommands via ctx.obj.]
+
+## Shared flags
+
+[Brief mention of -m / -d / -o / -t / -p / --ep. State they have the same meaning on every command that accepts them.]
+
+Rules:
+- Use `winml` everywhere.
+- Link every command name to its page.
+- Do not duplicate full flag tables — that's what each command page is for.
+
+Return the file path you wrote.
+```
+
+- [ ] **Step 2: Verify strict build passes**
+
+Run:
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit code 0, no warnings.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add docs/commands/overview.md
+git commit -m "docs: author Commands Overview page
+
+Adds the 12-command map (grouped Discover/Configure/Build/Measure),
+a 'choosing a command' decision section, and references to global
+and shared flags. Drafted from the now-settled per-command pages."
+```
+
+---
+
+## Task 6: Author Sample pages (3 parallel agents)
+
+**Files:**
+- Modify: `docs/samples/convnext-primitives.md`
+- Modify: `docs/samples/bert-config-build.md`
+- Modify: `docs/samples/qwen3-composite.md`
+
+- [ ] **Step 1: Dispatch 3 parallel subagents**
+
+### Agent 1 — ConvNeXt primitives walkthrough
+
+```
+Author the ConvNeXt primitives sample page. Output: `docs/samples/convnext-primitives.md`. Length: 700-1100 words.
+
+The goal: teach the user the ModelKit pipeline by invoking each command directly (no `build` wrapper). Reader walks away understanding what each command does and how outputs chain.
+
+Source materials:
+- `src/winml/modelkit/commands/` (all command files — use real flags)
+- The command pages under `docs/commands/`
+- The concept pages under `docs/concepts/`
+
+Required structure:
+
+# ConvNeXt — Primitives Walkthrough
+
+[Lead: 2-paragraph intro. Why ConvNeXt (small, fast, ImageNet classifier — good first model). State this sample uses the primitive commands rather than `winml build`, to show how the pieces compose. State the target EPs covered (CPU, GPU, NPU).]
+
+## Prerequisites
+
+[Bulleted list: ModelKit installed, internet access for HF download, optional QNN SDK for NPU section.]
+
+## Step 1: Inspect the model
+
+[Brief bash block running `winml inspect -m facebook/convnext-tiny-224`. Show abbreviated expected output. One-paragraph callout: "What we just did — checked task detection, model class, exporter compatibility."]
+
+## Step 2: Generate a config (optional)
+
+[Show `winml config -m facebook/convnext-tiny-224 -o convnext_config.json`. Callout: this is optional in the primitives flow but useful for reproducibility.]
+
+## Step 3: Export to ONNX
+
+[`winml export -m facebook/convnext-tiny-224 -o convnext.onnx`. Callout on hierarchy preservation.]
+
+## Step 4: Quantize
+
+[`winml quantize -m convnext.onnx -o convnext_int8.onnx --precision int8 --samples 32`. Callout on calibration.]
+
+## Step 5: Compile for each EP
+
+[Three tabbed bash blocks using pymdownx.tabbed syntax:
+
+=== "CPU"
+    ```bash
+    winml compile -m convnext_int8.onnx -o convnext_cpu.onnx --device cpu
+    ```
+
+=== "GPU"
+    ```bash
+    winml compile -m convnext_int8.onnx -o convnext_gpu.onnx --device gpu
+    ```
+
+=== "NPU"
+    ```bash
+    winml compile -m convnext_int8.onnx -o convnext_npu.onnx --device npu --qnn-sdk-root <path>
+    ```
+
+Callout: NPU compilation requires the QNN SDK; cross-link to concepts/onnx-and-eps.md.]
+
+## Step 6: Benchmark
+
+[`winml perf` invocations with --device flags, one per EP.]
+
+## Step 7: Evaluate
+
+[`winml eval` on an ImageNet validation slice. Note the dataset flag.]
+
+## What you learned
+
+[Short bulleted summary: which command does what, how the artifacts chain.]
+
+## See also
+
+[Links to convnext command pages, BERT sample, Concepts/Quantization.]
+
+Rules:
+- Use real flag names. Verify against source.
+- Use realistic but small sample counts.
+- Where an output is shown, use ```text and keep it short (5-10 lines).
+
+Return the file path.
+```
+
+### Agent 2 — BERT config + build sample
+
+```
+Author the BERT config + build sample. Output: `docs/samples/bert-config-build.md`. Length: 500-800 words.
+
+The goal: teach the production-style workflow where the user generates a BuildConfig, runs `winml build` end-to-end, then measures with `winml perf`. EP coverage is NOT the focus — the workflow is.
+
+Source materials: same as ConvNeXt sample agent.
+
+Required structure:
+
+# BERT — Config + Build + Perf
+
+[Lead: 2-paragraph intro. Why BERT (canonical text classifier). State this sample uses `winml config` to generate a config file, then `winml build` to run the whole pipeline in one shot. Contrast briefly with the ConvNeXt primitives sample.]
+
+## Prerequisites
+
+[Bulleted list — short.]
+
+## Step 1: Generate a build config
+
+[`winml config -m bert-base-uncased -t text-classification -o bert_config.json`. Show truncated example of the JSON content (5-10 lines, illustrative). One-paragraph callout on what's in the file — link to concepts/buildconfig.md.]
+
+## Step 2: Run the build
+
+[`winml build -c bert_config.json --output-dir bert_out/`. Show short text-style output of stage progress (export → quantize → compile). Callout on the analyzer loop / --no-analyze if relevant.]
+
+## Step 3: Benchmark
+
+[`winml perf -m bert_out/<artifact>.onnx --iterations 50`. Show expected output snippet (1-2 numeric latency/throughput lines).]
+
+## Customizing the config
+
+[A short section: how to override precision in the config, how to disable a stage (--no-quant). Refer to concepts/buildconfig.md.]
+
+## What you learned
+
+[Bulleted summary.]
+
+## See also
+
+[Links to commands/config.md, commands/build.md, commands/perf.md, concepts/buildconfig.md, samples/convnext-primitives.md.]
+
+Rules:
+- Use real flag names. Verify.
+- Keep EP detail minimal — workflow is the focus.
+
+Return the file path.
+```
+
+### Agent 3 — Qwen3 placeholder
+
+```
+Author the Qwen3 sample placeholder. Output: `docs/samples/qwen3-composite.md`. Length: 150-300 words.
+
+This is a placeholder because composite-model support is on a feature branch and not yet in `feat/mvp` or `main`. We reserve the nav slot now.
+
+Required structure:
+
+# Qwen3 — Composite Models
+
+!!! info "Coming soon"
+    Composite-model support — running models with multiple components like text encoder + decoder or vision + LLM through a single ModelKit pipeline — is on an in-progress feature branch. This page will be authored once that work merges.
+
+## What composite models are
+
+[One short paragraph: explain at a conceptual level. Examples: an LLM with a separate vision encoder, a text encoder + decoder pair, multi-stage pipelines.]
+
+## What Qwen3 will demonstrate
+
+[Bulleted preview: 3-5 bullets of what the sample will cover when it ships. Be honest that this is forward-looking.]
+
+## Track progress
+
+[One line pointing the reader to GitHub issues / the project README for the current status.]
+
+Rules:
+- Be honest that this is a placeholder. Do not invent details.
+- Do not promise dates.
+- Use the actual project URL: https://github.com/gim-home/ModelKit
+```
+
+- [ ] **Step 2: Verify strict build passes**
+
+Run:
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit code 0, no warnings.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add docs/samples/
+git commit -m "docs: author Sample pages (ConvNeXt, BERT, Qwen3 placeholder)
+
+ConvNeXt walks through the primitive commands (export → quantize →
+compile → perf → eval) across CPU/GPU/NPU. BERT shows the
+config + build + perf workflow. Qwen3 is a placeholder for the
+upcoming composite-model feature."
+```
+
+---
+
+## Task 7: Author Getting Started pages (3 parallel agents)
+
+**Files:**
+- Modify: `docs/getting-started/installation.md`
+- Modify: `docs/getting-started/quickstart.md`
+- Modify: `docs/getting-started/end-to-end.md`
+
+Sequenced after Concepts, Commands, and Samples so it can cross-link to settled pages.
+
+- [ ] **Step 1: Dispatch 3 parallel subagents**
+
+### Agent 1 — Installation
+
+```
+Author `docs/getting-started/installation.md`. Length: 300-500 words.
+
+Audience: external developers who have just landed on the repo.
+
+Required structure:
+
+# Installation
+
+[Lead: one paragraph explaining what ModelKit is and what you need to install it.]
+
+## Prerequisites
+
+[Bulleted list: Windows 10/11, Python 3.10 (not 3.11+), `uv` package manager with link to https://github.com/astral-sh/uv. Mention git.]
+
+## Install
+
+[Bash block:
+```bash
+git clone https://github.com/gim-home/ModelKit.git
+cd ModelKit
+uv python install 3.10
+uv sync
+```
+Briefly explain each line.]
+
+## Verify
+
+[Bash block: `uv run winml sys` and show a snippet of expected output (5-8 lines, abbreviated). State that this enumerates available devices and execution providers.]
+
+## Optional extras
+
+[Brief mention of `--extra openvino` and `--extra qnn` (or whatever extras pyproject.toml actually defines — read `pyproject.toml` lines 79-82 to confirm). Note these are needed for OpenVINO and Qualcomm NPU respectively.]
+
+## Next steps
+
+[Link to quickstart.md.]
+
+Rules:
+- Use realistic terminal commands. No placeholders.
+- Verify the extras names against pyproject.toml.
+
+Return the file path.
+```
+
+### Agent 2 — Quickstart
+
+```
+Author `docs/getting-started/quickstart.md`. Length: 400-600 words.
+
+Audience: someone with ModelKit installed who wants a first success in ~5 minutes.
+
+Required structure:
+
+# Quickstart
+
+[Lead: one paragraph. Goal of this page: prove your install works by exporting a Hugging Face image classifier and inspecting the result.]
+
+## Export your first model
+
+[Bash block:
+```bash
+uv run winml export -m microsoft/resnet-50 -o resnet50.onnx
+```
+One-paragraph callout: what just happened. Cross-link to commands/export.md.]
+
+## Inspect the artifact
+
+[Bash block:
+```bash
+uv run winml inspect -m resnet50.onnx
+```
+Show a short truncated table-style output (5-8 lines). Cross-link to commands/inspect.md.]
+
+## What's next
+
+[Three short bullet links:
+- End-to-End walkthrough → ../getting-started/end-to-end.md
+- Concept of how ModelKit works → ../concepts/how-it-works.md
+- Full ConvNeXt sample → ../samples/convnext-primitives.md ]
+
+Rules:
+- Keep it under 5 minutes of reading + running.
+- No quantization, no EP selection — that's the next page's job.
+- Use `uv run winml` prefix consistently.
+
+Return the file path.
+```
+
+### Agent 3 — End-to-End: HF → NPU
+
+```
+Author `docs/getting-started/end-to-end.md`. Length: 700-1000 words.
+
+Audience: someone past the quickstart who wants to see the full pipeline land on a real NPU.
+
+Required structure:
+
+# End-to-End: Hugging Face → NPU
+
+[Lead: 2 paragraphs. Goal of this page: run a ConvNeXt classifier through the full ModelKit pipeline targeting a Qualcomm NPU via `winml build`. Estimated time, hardware requirement (Qualcomm device + QNN SDK).]
+
+## Prerequisites
+
+[Bulleted list: Quickstart done, Qualcomm device, QNN SDK installed (link out), `--extra qnn` installed.]
+
+## Step 1: Generate the build config
+
+[Bash block: `uv run winml config -m facebook/convnext-tiny-224 --device npu -o convnext_npu.json`. Truncated JSON snippet. Callout linking to concepts/buildconfig.md.]
+
+## Step 2: Run the build
+
+[Bash block: `uv run winml build -c convnext_npu.json --output-dir convnext_npu_out/ --qnn-sdk-root <path>`. Stage-progress output snippet. One-paragraph callout on what each stage does (link to concepts/how-it-works.md).]
+
+## Step 3: Benchmark on the NPU
+
+[Bash block: `uv run winml perf -m convnext_npu_out/<artifact>.onnx --device npu --iterations 50`. Expected latency line snippet.]
+
+## Step 4: (Optional) Compare against CPU
+
+[Same model, --device cpu, show the relative latency.]
+
+## Where to go next
+
+[Bulleted list of links:
+- Samples → ../samples/convnext-primitives.md
+- Command reference → ../commands/overview.md
+- BuildConfig → ../concepts/buildconfig.md]
+
+Rules:
+- All commands must use real flags from src/winml/modelkit/commands/.
+- Place QNN-specific notes in admonitions.
+- This is the showcase page — write it well.
+
+Return the file path.
+```
+
+- [ ] **Step 2: Verify strict build passes**
+
+Run:
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit code 0, no warnings.
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add docs/getting-started/
+git commit -m "docs: author Getting Started chapter (3 pages)
+
+Installation covers Windows + Python 3.10 + uv setup and the
+optional EP extras. Quickstart proves the install with a 5-minute
+export + inspect. End-to-End walks ConvNeXt through the full
+pipeline targeting a Qualcomm NPU."
+```
+
+---
+
+## Task 8: Final verification and cleanup
+
+**Files:** none modified.
+
+- [ ] **Step 1: Run full strict build**
+
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit code 0, no warnings, `site/` produced.
+
+- [ ] **Step 2: Local-serve smoke test**
+
+```bash
+uv run mkdocs serve
+```
+
+Expected: server starts on http://127.0.0.1:8000, banner shows "Documentation built in <N> seconds", no warnings in the log. Open the URL in a browser, click through:
+- Landing → Getting Started → Quickstart
+- Concepts → How ModelKit Works (verify mermaid renders)
+- Commands → Overview → click into 2-3 command pages
+- Samples → ConvNeXt
+
+Verify dark/light toggle works.
+
+Stop the server (Ctrl+C).
+
+- [ ] **Step 3: Confirm no remote pushes**
+
+```bash
+git log origin/feat/mvp..HEAD --oneline
+git status
+```
+
+Expected: lists every commit from Task 1 through Task 7 (about 7 commits) as ahead of `origin/feat/mvp`. Working tree clean.
+
+- [ ] **Step 4: Confirm internal docs untouched**
+
+```bash
+git diff origin/feat/mvp..HEAD -- docs/design/ docs/naming-convention.md docs/pytest-best-practices.md
+```
+
+Expected: no output (no changes to those paths).
+
+- [ ] **Step 5: No commit needed**
+
+If smoke test passes and no internal docs were touched, the MVP is complete. Report the commit log to the user.
+
+---
+
+## Self-review notes
+
+- **Spec coverage:** every section in the spec maps to at least one task. Section 5 IA → Tasks 1, 3, 4, 5, 6, 7. Section 8 layout → Task 1. Section 9 MkDocs config → Task 1 Step 2. Section 10 CI → Task 2. Section 11 batches → Tasks 3, 4, 5, 6, 7 (one task per batch). Section 13 acceptance criteria → Task 8.
+- **Type/name consistency:** `winml` (not `wmk`) used throughout. Flag names referenced are present in `src/winml/modelkit/commands/_options.py` (verified). Source paths use `src/winml/modelkit/commands/` consistently.
+- **Placeholder scan:** no TBD/TODO. All agent prompts are concrete; per-page substitutions are listed. The Qwen3 sample is intentionally a placeholder *page*, not a placeholder *task*.
+- **Parallelism is explicit:** Tasks 3, 4, 6, 7 dispatch multiple agents in a single message — the executor (subagent-driven-development) reads this and parallelizes accordingly.
diff --git a/docs/superpowers/plans/2026-05-24-docs-expansion-v2.md b/docs/superpowers/plans/2026-05-24-docs-expansion-v2.md
new file mode 100644
index 000000000..56b4ef49f
--- /dev/null
+++ b/docs/superpowers/plans/2026-05-24-docs-expansion-v2.md
@@ -0,0 +1,996 @@
+# Docs Expansion v2 — Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Author 11 new doc pages, rename 2 existing pages with content edits, modify 5 pages, and restructure the MkDocs nav — delivering a Tutorials chapter, a sub-grouped Concepts chapter (Fundamentals + WinML CLI), and polish to Getting Started.
+
+**Architecture:** Six-batch plan executed on `docs/v2`. Foundation first (scaffold + nav + renames), then authoring batches in parallel where pages don't share state, then a cross-link sweep that catches any reference to the renamed files. Verification at every batch is `uv run mkdocs build --strict`.
+
+**Tech Stack:** Python 3.10 + uv, MkDocs Material 9.5+, pymdown-extensions, Bash via Git for Windows for `sed`/`grep` operations.
+
+**Spec:** `docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md`
+
+**Branch:** `docs/v2` (off `docs/v1`). No remote pushes during execution.
+
+---
+
+## Conventions used in this plan
+
+- **CLI source of truth:** `src/winml/modelkit/commands/<name>.py` and `src/winml/modelkit/commands/_options.py`. Every flag mentioned in a doc must exist in source.
+- **Product name in prose:** `winml-cli` (never `wmk` or `ModelKit`).
+- **Existing internal docs that must NOT be modified:** `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md`, `docs/superpowers/` (other than this plan and its spec).
+- **Verification at task end:** `uv run mkdocs build --strict` — must exit 0 with no WARNING lines from MkDocs (the Material upstream advisory banner is not a MkDocs WARNING).
+- **Commit style:** Conventional Commits (`docs: ...`). No `Co-Authored-By`. No "Test plan" section.
+- **Parallel agent dispatches** within a task = single message with multiple Agent tool calls, agents do NOT commit (orchestrator batch-commits).
+
+---
+
+## Task 1: Scaffold — stubs, renames, nav restructure (Batch A)
+
+**Files (modify):**
+- `mkdocs.yml`
+
+**Files (rename):**
+- `docs/concepts/onnx-and-eps.md` → `docs/concepts/eps-and-devices.md`
+- `docs/concepts/hierarchy.md` → `docs/concepts/hierarchy-and-metadata.md`
+
+**Files (create as stubs — full content authored in later batches):**
+- `docs/tutorials/index.md`
+- `docs/tutorials/npu-convnext.md`
+- `docs/concepts/graphs-and-ir.md`
+- `docs/concepts/tensors-and-dtypes.md`
+- `docs/concepts/primitives-and-pipeline.md`
+- `docs/concepts/config-and-build.md`
+- `docs/concepts/load-and-export.md`
+- `docs/concepts/analyze-and-optimize.md`
+- `docs/concepts/compile-and-epcontext.md`
+- `docs/concepts/perf-and-monitoring.md`
+- `docs/concepts/eval-and-datasets.md`
+
+- [ ] **Step 1: Rename the 2 existing concept files**
+
+Use `git mv` so history is preserved:
+
+```bash
+git mv docs/concepts/onnx-and-eps.md docs/concepts/eps-and-devices.md
+git mv docs/concepts/hierarchy.md docs/concepts/hierarchy-and-metadata.md
+```
+
+The file contents are unchanged at this step; content edits happen in Batch B.
+
+- [ ] **Step 2: Create 11 stub pages**
+
+Each stub has this exact body shape, with `<Page Title>` filled per the table below:
+
+```markdown
+# <Page Title>
+
+!!! note "Coming soon"
+    This page is part of the v2 docs expansion and will be authored next.
+```
+
+| File path | Page Title |
+|---|---|
+| `docs/tutorials/index.md` | `Tutorials` |
+| `docs/tutorials/npu-convnext.md` | `ConvNeXt on NPU` |
+| `docs/concepts/graphs-and-ir.md` | `Models, graphs, and the ONNX IR` |
+| `docs/concepts/tensors-and-dtypes.md` | `Tensors and dtypes` |
+| `docs/concepts/primitives-and-pipeline.md` | `Primitives and pipeline` |
+| `docs/concepts/config-and-build.md` | `Config and build` |
+| `docs/concepts/load-and-export.md` | `Load and export` |
+| `docs/concepts/analyze-and-optimize.md` | `Analyze and optimize` |
+| `docs/concepts/compile-and-epcontext.md` | `Compile and EPContext` |
+| `docs/concepts/perf-and-monitoring.md` | `Perf and monitoring` |
+| `docs/concepts/eval-and-datasets.md` | `Eval and datasets` |
+
+- [ ] **Step 3: Update `mkdocs.yml` nav**
+
+Replace the existing `nav:` block with this exact block (the rest of the file — `site_name`, `theme`, `plugins`, `markdown_extensions`, `exclude_docs` — stays untouched):
+
+```yaml
+nav:
+  - Home: index.md
+  - Getting Started:
+      - Installation: getting-started/installation.md
+      - Quickstart: getting-started/quickstart.md
+      - End-to-End — HF → NPU: getting-started/end-to-end.md
+  - Concepts:
+      - Fundamentals:
+          - How winml-cli works: concepts/how-it-works.md
+          - Models, graphs, and the ONNX IR: concepts/graphs-and-ir.md
+          - Tensors and dtypes: concepts/tensors-and-dtypes.md
+          - Execution Providers and devices: concepts/eps-and-devices.md
+          - Quantization and QDQ: concepts/quantization.md
+          - Hierarchy and ONNX metadata: concepts/hierarchy-and-metadata.md
+          - BuildConfig and kits: concepts/buildconfig.md
+      - WinML CLI:
+          - Primitives and pipeline: concepts/primitives-and-pipeline.md
+          - Config and build: concepts/config-and-build.md
+          - Load and export: concepts/load-and-export.md
+          - Analyze and optimize: concepts/analyze-and-optimize.md
+          - Compile and EPContext: concepts/compile-and-epcontext.md
+          - Perf and monitoring: concepts/perf-and-monitoring.md
+          - Eval and datasets: concepts/eval-and-datasets.md
+  - Commands:
+      - Overview: commands/overview.md
+      - Discover:
+          - sys: commands/sys.md
+          - inspect: commands/inspect.md
+          - hub: commands/hub.md
+          - analyze: commands/analyze.md
+      - Configure:
+          - config: commands/config.md
+          - optimize: commands/optimize.md
+      - Build:
+          - export: commands/export.md
+          - quantize: commands/quantize.md
+          - compile: commands/compile.md
+          - build: commands/build.md
+      - Measure:
+          - perf: commands/perf.md
+          - eval: commands/eval.md
+  - Samples:
+      - ConvNeXt — Primitives Walkthrough: samples/convnext-primitives.md
+      - BERT — Config + Build + Perf: samples/bert-config-build.md
+      - Qwen3 — Composite Models: samples/qwen3-composite.md
+  - Tutorials:
+      - Overview: tutorials/index.md
+      - ConvNeXt on NPU: tutorials/npu-convnext.md
+  - Reference: reference/index.md
+  - Troubleshooting: troubleshooting.md
+  - Contributing: contributing.md
+```
+
+- [ ] **Step 4: Verify strict build**
+
+```bash
+uv run mkdocs build --strict
+```
+
+Expected: exit 0, message `Documentation built in <N> seconds`, no WARNING lines.
+
+If `--strict` errors with "doc file not found" for any of the 11 new files or the 2 renamed files, fix the path before continuing.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add mkdocs.yml docs/concepts/ docs/tutorials/
+git commit -m "docs: scaffold v2 expansion (stubs, renames, nav restructure)
+
+Renames:
+- concepts/onnx-and-eps.md -> concepts/eps-and-devices.md
+- concepts/hierarchy.md -> concepts/hierarchy-and-metadata.md
+
+Stubs created (content authored in next batches):
+- tutorials/index.md, tutorials/npu-convnext.md
+- concepts/graphs-and-ir.md, concepts/tensors-and-dtypes.md
+- concepts/{primitives-and-pipeline,config-and-build,load-and-export,
+  analyze-and-optimize,compile-and-epcontext,perf-and-monitoring,
+  eval-and-datasets}.md
+
+Nav restructured: Concepts sub-grouped into Fundamentals + WinML CLI;
+Tutorials chapter inserted between Samples and Reference."
+```
+
+---
+
+## Task 2: Concepts — Fundamentals authoring (Batch B)
+
+**Files (full content authoring or content editing):**
+- Modify: `docs/concepts/how-it-works.md` (rename-in-nav only — content kept; included here so the reviewer notices it)
+- Modify: `docs/concepts/eps-and-devices.md` (already renamed in Task 1; small content reframe to match the new pair-topic title)
+- Modify: `docs/concepts/hierarchy-and-metadata.md` (already renamed; broaden content to cover other metadata, not just `winml.hierarchy.tag`)
+- Modify: `docs/concepts/buildconfig.md` (rename-in-nav only — content kept)
+- Modify: `docs/concepts/quantization.md` (tighten — dtype family content moves out to Tensors page)
+- Author: `docs/concepts/graphs-and-ir.md`
+- Author: `docs/concepts/tensors-and-dtypes.md`
+
+In total: **2 new pages authored, 3 pages content-edited, 2 pages untouched-but-renamed-in-nav**. The 2 untouched-in-nav pages need no editing in this batch.
+
+### Voice anchor (read before dispatching agents)
+
+The 5 existing Fundamentals pages (now renamed/touched) set the voice: clear, direct, 400–700 words, opens with a 1–2 paragraph lead, uses H2 sections, ends with a `## See also` block of 2–4 relative links. Every flag and symbol cited is verified in `src/winml/modelkit/`. No marketing language.
+
+- [ ] **Step 1: Dispatch parallel author agents (wave 1)**
+
+Send all 4 `Agent` tool calls in a single message; `subagent_type: general-purpose`, `model: sonnet`. Agents write only; the orchestrator commits.
+
+#### Agent B1 — Author `concepts/graphs-and-ir.md` (new)
+
+```
+You are authoring ONE Concepts page for the winml-cli user-facing docs. Output: overwrite the stub at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\graphs-and-ir.md. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Title: # Models, graphs, and the ONNX IR
+Length: 400–700 words of prose.
+
+Sources to read first (for accuracy — do not copy verbatim):
+- src/winml/modelkit/onnx/ (directory; look at metadata.py and model detection helpers)
+- src/winml/modelkit/export/ (directory; opset version selection)
+- An external reference: https://github.com/onnx/onnx/blob/main/docs/IR.md (treat as background, do not link)
+
+Body requirements:
+1. Lead (1–2 paragraphs): what a model file is at rest; the model is a graph; graphs are described in the ONNX IR; opsets version the operator set.
+2. H2 — "What is in a .onnx file": inputs, outputs, nodes (operators), initializers (weights), metadata. Use one short bulleted list.
+3. H2 — "Graphs as IR": brief explanation that ONNX is an Intermediate Representation — a static computation graph that's portable across runtimes. Mention nodes have inputs/outputs that wire into the graph; this enables shape inference and EP-targeted compilation.
+4. H2 — "Opsets and versioning": opset is a snapshot of the operator catalog at a specific version. winml-cli's `winml export` defaults to opset 17 (verify in src/winml/modelkit/export/ or commands/export.py). New opsets unlock new ops; EPs may not support the latest opset.
+5. H2 — "See also": 2–4 relative links. Valid targets (relative to docs/concepts/):
+   - eps-and-devices.md
+   - tensors-and-dtypes.md
+   - hierarchy-and-metadata.md
+   - ../commands/inspect.md
+   - ../commands/export.md
+
+Rules:
+- Use winml-cli (never ModelKit, never wmk).
+- Verify opset default by reading the source. If you cannot confirm 17, state the actual default you found.
+- No "TBD", no placeholders.
+- Code blocks: ```bash for invocations, ```text for output.
+
+Verify the strict build after writing:
+  uv run mkdocs build --strict 2>&1 | tail -3
+
+Expected: exit 0, no WARNING lines.
+
+Return: status (DONE/DONE_WITH_CONCERNS/BLOCKED), word count estimate, last 3 lines of mkdocs build output, the opset version you cited and where you confirmed it.
+```
+
+#### Agent B2 — Author `concepts/tensors-and-dtypes.md` (new)
+
+```
+You are authoring ONE Concepts page for the winml-cli user-facing docs. Output: overwrite the stub at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\tensors-and-dtypes.md. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Title: # Tensors and dtypes
+Length: 500–800 words of prose (slightly longer than typical Fundamentals page because this absorbs the dtype content from the quantization page).
+
+Sources to read first:
+- src/winml/modelkit/commands/_options.py (for _KNOWN_PRECISIONS)
+- src/winml/modelkit/onnx/ (for I/O tensor spec and shape inference helpers)
+- src/winml/modelkit/commands/quantize.py (for activation_type / weight_type flags)
+- src/winml/modelkit/commands/export.py (for --input-specs and --shape-config flags)
+
+Body requirements:
+1. Lead (1–2 paragraphs): three roles for tensors in a model — weights (static parameters), activations (intermediate values at inference), I/O tensors (inputs and outputs at the graph boundary). Each role has a dtype that may differ.
+2. H2 — "Weights and activations": one paragraph explaining the distinction and why it matters (memory footprint, quantization granularity, EP support tiers).
+3. H2 — "Dtype options in winml-cli": markdown table listing the precision strings from _KNOWN_PRECISIONS in _options.py. Columns: Precision | Weight dtype | Activation dtype | Notes. Cover at least auto, fp32, fp16, int8, int16, w8a8, w8a16, w4a16.
+4. H2 — "Static vs dynamic shapes": one paragraph. ONNX supports symbolic dimensions ("batch", "sequence") that are resolved at runtime. winml-cli's --input-specs and --shape-config flags let you constrain these at export time. Some EPs (QNN) require fully static shapes; others (DirectML) accept dynamic.
+5. H2 — "See also": 2–4 relative links. Valid targets:
+   - quantization.md
+   - eps-and-devices.md
+   - graphs-and-ir.md
+   - ../commands/export.md
+   - ../commands/quantize.md
+
+Rules:
+- Verify every precision string against _KNOWN_PRECISIONS. Do not invent precisions.
+- Verify --input-specs and --shape-config exist on winml export (read export.py).
+- Use winml-cli (never ModelKit/wmk).
+- No "TBD", no placeholders.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, word count, build output last 3 lines, the precision list you enumerated (verbatim).
+```
+
+#### Agent B3 — Edit `concepts/eps-and-devices.md` (rename done; content reframe)
+
+```
+You are content-editing ONE Concepts page for the winml-cli user-facing docs. The page is at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\eps-and-devices.md (just renamed from onnx-and-eps.md; content is the previous version). DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Goal: reframe the page title and lead from "ONNX & Execution Providers" to "Execution Providers and devices". The ONNX intro content should be trimmed (it now lives in graphs-and-ir.md) and the EP × Device matrix should remain front-and-center.
+
+Length: 400–700 words after editing.
+
+Specific edits:
+1. Change the H1 to: # Execution Providers and devices
+2. Rewrite the lead (1–2 paragraphs): what an EP is, what a device is, how winml-cli's --device and --ep flags map to them. Drop the "what is ONNX" intro paragraph (now covered by graphs-and-ir.md). If you reference ONNX, link to ../concepts/graphs-and-ir.md.
+3. Keep the EP × Device table (and update it if you find a missed EP in src/winml/modelkit/sysinfo/).
+4. Keep the "Device vs EP on the CLI" section.
+5. Update the "## See also" block to include a link to graphs-and-ir.md and tensors-and-dtypes.md if not already present. Keep total at 2–4 links.
+
+Rules:
+- Use winml-cli (never ModelKit/wmk). Replace any "ModelKit" string you find inside the page with "winml-cli".
+- Do not invent EPs. Verify against src/winml/modelkit/sysinfo/.
+- No "TBD", no placeholders.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, word count after editing, last 3 lines of build, list of EP names referenced in the final table.
+```
+
+#### Agent B4 — Edit `concepts/hierarchy-and-metadata.md` (rename done; broaden content)
+
+```
+You are content-editing ONE Concepts page for the winml-cli user-facing docs. The page is at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\hierarchy-and-metadata.md (just renamed from hierarchy.md; current content focuses only on hierarchy.tag). DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Goal: broaden the page from "what is hierarchy_tag" to "what metadata winml-cli writes into the ONNX model, and why each entry exists."
+
+Length target: 500–700 words after editing.
+
+Specific edits:
+1. Change the H1 to: # Hierarchy and ONNX metadata
+2. Rewrite the lead (1–2 paragraphs): ONNX files carry metadata_props key/value entries beyond the graph itself. winml-cli writes several of these. The most important is winml.hierarchy.tag (the PyTorch module-path tag), but there are others.
+3. New H2 — "Metadata winml-cli writes": markdown table. Columns: Key | Set by | Purpose. Inspect src/winml/modelkit/onnx/metadata.py and src/winml/modelkit/export/htp/exporter.py to find the canonical list. Include winml.hierarchy.tag at minimum.
+4. Existing H2 — "What hierarchy_tag enables": keep the existing content about per-module benchmarking (winml perf --module) and the --no-hierarchy / --clean-onnx flag on winml export.
+5. Existing H2 — "See also": keep but add tensors-and-dtypes.md as a link.
+
+Rules:
+- Verify every metadata key by reading the source. State the file:line where you found each key.
+- Use winml-cli (never ModelKit/wmk). Replace any "ModelKit" string with "winml-cli".
+- No "TBD", no placeholders.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, word count, build output last 3 lines, the list of metadata keys you documented with file:line evidence.
+```
+
+- [ ] **Step 2: Edit `concepts/quantization.md` (tighten — move dtype content to Tensors page)**
+
+Read the current file first; find any paragraph or table that primarily explains the dtype family (fp32/fp16/int8/int16/compound types). Move that content (logically — by trimming here, since the new content already lives in tensors-and-dtypes.md after Step 1).
+
+Make these specific edits to `docs/concepts/quantization.md`:
+
+- If the page has an H2 like "Precision options" that lists the dtype family, replace its body with a short sentence: "See [Tensors and dtypes](tensors-and-dtypes.md) for the full precision family. This page focuses on the quantization algorithm, calibration, and the QDQ pattern."
+- Otherwise no changes — the page can keep its calibration and QDQ content.
+
+The orchestrator does this edit directly (not via agent) since it's a one-line surgical change.
+
+- [ ] **Step 3: Edit `concepts/how-it-works.md` and `concepts/buildconfig.md` — verify they don't reference renamed files**
+
+Read each file. If they contain links like `[ONNX & Execution Providers](onnx-and-eps.md)` or `[Hierarchy](hierarchy.md)`, update them to `eps-and-devices.md` and `hierarchy-and-metadata.md` respectively. Otherwise no changes.
+
+- [ ] **Step 4: Verify strict build**
+
+```bash
+uv run mkdocs build --strict 2>&1 | tail -3
+```
+
+Expected: exit 0, no WARNING lines.
+
+If the build complains about broken links pointing to `onnx-and-eps.md` or `hierarchy.md`, fix those references in whatever file they live in (these are the inbound links flagged in spec §7).
+
+- [ ] **Step 5: Commit (Fundamentals batch)**
+
+```bash
+git add docs/concepts/
+git commit -m "docs(concepts/fundamentals): author graphs-and-ir + tensors-and-dtypes; reframe eps-and-devices + hierarchy-and-metadata after rename
+
+- New: graphs-and-ir.md (models, graphs, IR, opsets)
+- New: tensors-and-dtypes.md (weights/activations/I-O tensors, precision
+  family, static-vs-dynamic shapes)
+- Reframed: eps-and-devices.md (drops the ONNX intro, now covered by
+  graphs-and-ir.md; keeps EP × Device matrix)
+- Broadened: hierarchy-and-metadata.md (now covers all metadata
+  winml-cli writes, not only winml.hierarchy.tag)
+- Tightened: quantization.md (dtype family content moved to
+  tensors-and-dtypes.md to remove duplication)"
+```
+
+---
+
+## Task 3: Concepts — WinML CLI authoring (Batch C)
+
+**Files (all new, full content authoring):**
+- `docs/concepts/primitives-and-pipeline.md`
+- `docs/concepts/config-and-build.md`
+- `docs/concepts/load-and-export.md`
+- `docs/concepts/analyze-and-optimize.md`
+- `docs/concepts/compile-and-epcontext.md`
+- `docs/concepts/perf-and-monitoring.md`
+- `docs/concepts/eval-and-datasets.md`
+
+### Voice anchor
+
+These are **workflow-concept pages**, not command-reference pages. Each explains the **why** and **when**, cross-linking to the per-command reference at `docs/commands/<name>.md` for **what**. No flag tables — that's the command-reference page's job.
+
+- [ ] **Step 1: Dispatch parallel author agents (4 agents, 7 pages)**
+
+Single message, 4 Agent tool calls, `model: sonnet`, agents write only.
+
+| Agent | Pages owned |
+|---|---|
+| C1 | `primitives-and-pipeline.md`, `config-and-build.md` |
+| C2 | `load-and-export.md`, `analyze-and-optimize.md` |
+| C3 | `compile-and-epcontext.md`, `perf-and-monitoring.md` |
+| C4 | `eval-and-datasets.md` |
+
+#### Reusable agent prompt template
+
+```
+You are authoring Concepts pages for the winml-cli user-facing docs. Output: write the markdown files listed below. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Voice and shape per page:
+- H1 = page title (given below per page).
+- 400–700 words of prose.
+- Lead (1–2 paragraphs): what conceptual tension/pair this page covers and why it matters.
+- 2–4 H2 sections.
+- Closing "## See also" with 2–4 relative links.
+- These are workflow-concept pages: explain WHY and WHEN. The /commands/ pages cover WHAT flags.
+- No flag tables. If you need to mention a flag, do it inline in prose.
+- Use winml-cli throughout (never ModelKit/wmk).
+
+Source verification rule: every flag, file, or symbol you cite must exist in src/winml/modelkit/. Verify by reading or running uv run winml <command> --help.
+
+Pages assigned to you:
+
+[PAGE BLOCKS — see below]
+
+After all your pages are written, run:
+  uv run mkdocs build --strict 2>&1 | tail -3
+
+Expected: exit 0, no WARNING lines.
+
+Return: status (DONE/DONE_WITH_CONCERNS/BLOCKED), per-page word count, build output last 3 lines, anything surprising (a flag that doesn't exist where the prompt says it should, a source claim you couldn't verify).
+```
+
+#### Page blocks — Agent C1
+
+```
+PAGE 1 — concepts/primitives-and-pipeline.md
+Title: # Primitives and pipeline
+Theme: Two ways to use winml-cli — invoke individual primitive commands (export, optimize, quantize, compile, perf, eval) one at a time, or use `winml build` as the wrapper that runs them all from a config. Teach when to choose which: primitives for learning / debugging / one-off variations; build for production / CI / reproducibility.
+
+Required H2 sections:
+- "The primitive commands" — list the staged commands with a one-line role each. Reference the order in docs/concepts/how-it-works.md.
+- "The pipeline wrapper" — winml build orchestrates the same stages from a single WinMLBuildConfig.
+- "When to choose which" — bullets contrasting the two.
+- "See also" — 2–4 links. Valid: how-it-works.md, config-and-build.md, ../commands/build.md, ../samples/convnext-primitives.md, ../samples/bert-config-build.md.
+
+PAGE 2 — concepts/config-and-build.md
+Title: # Config and build
+Theme: Producer/consumer pair. winml config generates a WinMLBuildConfig JSON; winml build consumes it. Teach the reproducibility angle (version configs, share across CI, replay later), and the override semantics (CLI flags can override config values).
+
+Required H2 sections:
+- "Generating a config" — short prose about winml config, --task, --no-quant/--no-compile, --trust-remote-code. No full flag table.
+- "Consuming a config" — winml build -c <file>.json --output-dir or --use-cache (exactly one of them). The build runs the stages defined in the config.
+- "Overrides at run time" — flags like --no-quant, --no-compile, --no-optimize on winml build override the corresponding config sections without editing the file. Useful for ad-hoc skips.
+- "Why version a config" — three concrete reasons: reproducibility, CI, sharing.
+- "See also" — 2–4 links. Valid: buildconfig.md, primitives-and-pipeline.md, ../commands/config.md, ../commands/build.md.
+```
+
+#### Page blocks — Agent C2
+
+```
+PAGE 1 — concepts/load-and-export.md
+Title: # Load and export
+Theme: The first conceptual stage of the pipeline — bring a model into memory (from Hugging Face Hub or a local checkpoint), then transform it to ONNX. Teach the load step (the loader module in src/winml/modelkit/loader/) and the export step (the winml export command).
+
+NOTE: "load" is not a CLI verb. The loader is internal. Pair this page is "stage 1 load" + "stage 1 export"; both are part of getting a model into ONNX form.
+
+Required H2 sections:
+- "Loading a model" — winml-cli loads from HF Hub (with cache at ~/.cache/huggingface) or from a local PyTorch checkpoint. winml inspect is the user-facing way to check the loader picked it up correctly. Trust remote code with --trust-remote-code.
+- "Exporting to ONNX" — winml export converts the loaded model to ONNX. Mentions hierarchy preservation (see hierarchy-and-metadata.md), the --no-hierarchy / --clean-onnx flag, and --dynamo for an alternative export backend.
+- "Where it goes wrong" — task mismatch (use --task), shape issues (use --shape-config or --input-specs), custom modules (use --torch-module).
+- "See also" — 2–4 links. Valid: hierarchy-and-metadata.md, graphs-and-ir.md, ../commands/inspect.md, ../commands/export.md.
+
+PAGE 2 — concepts/analyze-and-optimize.md
+Title: # Analyze and optimize
+Theme: Two graph-quality commands that work together. winml analyze checks EP compatibility and reports issues; winml optimize applies fusions and rewrites. They share --optim-config and often run together via winml build's analyzer/optimizer loop.
+
+Required H2 sections:
+- "What analyze does" — runs operator coverage, shape inference, and runtime checks against a target EP; outputs a report. Reference the --format choices.
+- "What optimize does" — applies graph fusions (GELU, LayerNorm, MatMul+Add) and pattern rewrites. References --list-capabilities and the --enable-X / --disable-X dynamic flags. Briefly mention --list-rewrites for the pattern-rewrite family.
+- "The analyzer/optimizer loop" — winml build runs analyze → optimize → analyze → optimize up to --max-optim-iterations times to converge. Mention --no-analyze for deterministic single-pass builds.
+- "See also" — 2–4 links. Valid: compile-and-epcontext.md, primitives-and-pipeline.md, ../commands/analyze.md, ../commands/optimize.md.
+```
+
+#### Page blocks — Agent C3
+
+```
+PAGE 1 — concepts/compile-and-epcontext.md
+Title: # Compile and EPContext
+Theme: What winml compile actually produces. Some EPs (especially QNN) bake a binary blob — the EP context — into the ONNX file at compile time. Compiled models load faster at runtime because the EP-specific setup is pre-computed.
+
+Required H2 sections:
+- "What compilation produces" — for ORT-compatible EPs the compile step writes an ONNX file that the runtime can load directly; for QNN the file embeds a binary EPContext blob.
+- "Embedded vs external EPContext" — winml compile --embed controls whether the QNN context is inlined into the .onnx or stored as a sidecar binary. Trade-offs: inline = one file but bigger; sidecar = smaller .onnx but two files.
+- "Why pre-compile" — runtime cold-start cost. The first inference on a fresh model loads + JIT-compiles; a pre-compiled model loads ready-to-run.
+- "Skipping validation" — --no-validate exists for fast iteration; explain when not to use it (production builds).
+- "See also" — 2–4 links. Valid: eps-and-devices.md, analyze-and-optimize.md, ../commands/compile.md, ../commands/build.md.
+
+PAGE 2 — concepts/perf-and-monitoring.md
+Title: # Perf and monitoring
+Theme: winml perf measures latency/throughput. The --monitor flag adds a live hardware utilization chart (NPU primarily); --op-tracing produces per-operator timing breakdowns. Together they let you see both end-to-end numbers and where the time goes.
+
+Required H2 sections:
+- "What perf measures" — iterations, warmup, batch size; the output is latency p50/p90/mean and throughput. Mention --device for the EP target.
+- "Live monitoring" — --monitor opens a terminal chart of NPU utilization while the benchmark runs. Useful for confirming the workload actually hit the NPU.
+- "Per-operator tracing" — --op-tracing basic|detail produces breakdowns. Useful for finding hot ops.
+- "Per-module benchmarking" — --module <substring> benchmarks just one HF/PyTorch module from the hierarchy (links to hierarchy-and-metadata.md).
+- "See also" — 2–4 links. Valid: hierarchy-and-metadata.md, eval-and-datasets.md, ../commands/perf.md.
+```
+
+#### Page blocks — Agent C4
+
+```
+PAGE 1 — concepts/eval-and-datasets.md
+Title: # Eval and datasets
+Theme: winml eval measures accuracy, not speed. It needs a dataset (typically from Hugging Face) and a way to bind dataset columns to model inputs/outputs. Teach when to use eval (always after quantization), how to point it at a dataset, and the column-mapping pattern.
+
+Required H2 sections:
+- "What eval reports" — the metric depends on the task (accuracy for classification, mAP for detection, etc.). Output is a JSON with per-metric numbers; --format controls the form.
+- "Picking a dataset" — --dataset accepts a Hugging Face dataset path; --dataset-name picks a config; --split selects which split (validation by default); --samples caps the count for quick checks. Note --streaming for large datasets.
+- "Column mapping" — --column key=value to bind dataset columns to model inputs; --label-mapping for label index translation.
+- "Why eval after quantization" — quantization is lossy; the only way to know you didn't break the model is to check accuracy. Link to quantization.md.
+- "See also" — 2–4 links. Valid: quantization.md, perf-and-monitoring.md, ../commands/eval.md.
+```
+
+- [ ] **Step 2: Verify strict build**
+
+```bash
+uv run mkdocs build --strict 2>&1 | tail -3
+```
+
+Expected: exit 0, no WARNING lines.
+
+- [ ] **Step 3: Commit (WinML CLI batch)**
+
+```bash
+git add docs/concepts/
+git commit -m "docs(concepts/winml-cli): author 7 workflow-concept pages
+
+Each page covers a winml-cli workflow pair, explaining the WHY and
+WHEN of using the commands together. Pages: primitives-and-pipeline,
+config-and-build, load-and-export, analyze-and-optimize,
+compile-and-epcontext, perf-and-monitoring, eval-and-datasets.
+
+No flag tables (those live on the per-command reference pages).
+Every flag and symbol verified against src/winml/modelkit/."
+```
+
+---
+
+## Task 4: Tutorials authoring (Batch D)
+
+**Files (full content authoring):**
+- `docs/tutorials/index.md` — short overview (~150 words)
+- `docs/tutorials/npu-convnext.md` — the long-form tutorial (1500–2500 words)
+
+### Why a single agent owns the tutorial
+
+The ConvNeXt-on-NPU tutorial is one long page where prose voice and step transitions matter. A single agent produces more consistent voice than splitting it.
+
+- [ ] **Step 1: Dispatch 1 agent for the tutorial + 1 agent for the index**
+
+Single message, 2 parallel agents (different files, no conflict).
+
+#### Agent D1 — Author `tutorials/npu-convnext.md`
+
+```
+You are authoring the flagship tutorial for the winml-cli docs site. Output: overwrite C:\Users\zhengte\BYOM\ModelKits\mvp\docs\tutorials\npu-convnext.md. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Title: # ConvNeXt on NPU
+Model: facebook/convnext-tiny-224
+Length: 1500–2500 words of prose (excluding code blocks).
+Tone: classroom-style, prescriptive, every step has an explicit "what just happened" callout. Source: adapted from internal WinHECLab lab (saved at temp/winheclab-readme.md as background reference).
+
+Required structure:
+
+# ConvNeXt on NPU
+
+[Lead — 2–3 paragraphs:
+- Goal: take facebook/convnext-tiny-224 from Hugging Face to a benchmark-ready compiled model running on NPU.
+- Primary hardware: Copilot+PC with Snapdragon X-class NPU (or comparable). Explicit CPU/DirectML fallback documented throughout.
+- Two sections: Section A builds the model using primitive commands (so you understand each stage); Section B does the same thing with `winml build` (so you see the wrapper).]
+
+## Prerequisites
+
+- Windows 11 24H2 (required for NPU support)
+- Copilot+PC with NPU (40+ TOPS recommended; CPU/DirectML works as fallback)
+- Python 3.10, uv installed
+- winml-cli installed (see [Installation](../getting-started/installation.md))
+- For NPU: QNN SDK (set QNN_SDK_ROOT env var) or OpenVINO
+
+## Section A — Primitive commands
+
+### Step 1: Inspect the model
+
+[bash block: uv run winml inspect -m facebook/convnext-tiny-224]
+[text block: short abbreviated expected output]
+[!!! note "What we just did" — explains: confirmed task detection, model class, exporter compatibility before transformation.]
+
+### Step 2: Generate a build config
+
+[bash block: uv run winml config -m facebook/convnext-tiny-224 -o convnext_config.json]
+[!!! note callout: this is optional for primitives but useful for versioning.]
+
+### Step 3: Export to ONNX
+
+[bash block: uv run winml export -m facebook/convnext-tiny-224 -o convnext.onnx]
+[Link to ../concepts/hierarchy-and-metadata.md re: what hierarchy preservation adds.]
+
+### Step 4: Analyze for EP compatibility
+
+[bash block: uv run winml analyze -m convnext.onnx --ep qnn]
+(Show that analyze reports operator coverage and any flagged issues.)
+
+### Step 5: Optimize the graph
+
+[bash block: uv run winml optimize -m convnext.onnx -o convnext_optim.onnx]
+
+### Step 6: Quantize
+
+[bash block: uv run winml quantize -m convnext_optim.onnx -o convnext_int8.onnx --precision int8 --samples 32]
+[Link to ../concepts/quantization.md.]
+
+### Step 7: Compile for the target EP
+
+Use pymdownx.tabbed for QNN vs OpenVINO:
+
+=== "QNN (Snapdragon NPU)"
+
+    ```bash
+    # Requires QNN_SDK_ROOT env var set
+    uv run winml compile -m convnext_int8.onnx -o convnext_qnn.onnx --device npu
+    ```
+
+=== "OpenVINO (Intel CPU/GPU/NPU)"
+
+    ```bash
+    uv run winml compile -m convnext_int8.onnx -o convnext_ov.onnx --device npu --ep openvino
+    ```
+
+=== "CPU fallback"
+
+    ```bash
+    uv run winml compile -m convnext_int8.onnx -o convnext_cpu.onnx --device cpu
+    ```
+
+[Link to ../concepts/compile-and-epcontext.md.]
+
+### Step 8: Benchmark
+
+Tabbed by EP:
+
+=== "QNN NPU"
+
+    ```bash
+    uv run winml perf -m convnext_qnn.onnx --device npu --iterations 50 --monitor
+    ```
+
+=== "OpenVINO NPU"
+
+    ```bash
+    uv run winml perf -m convnext_ov.onnx --device npu --ep openvino --iterations 50 --monitor
+    ```
+
+=== "CPU"
+
+    ```bash
+    uv run winml perf -m convnext_cpu.onnx --device cpu --iterations 50
+    ```
+
+[text block: a short example latency/throughput snippet.]
+
+### Step 9 (optional): Evaluate accuracy
+
+[bash block: uv run winml eval -m convnext_int8.onnx --dataset imagenet-1k --split validation --samples 100 --device npu]
+[Link to ../concepts/eval-and-datasets.md.]
+
+## Section B — One-shot with `winml build`
+
+```bash
+uv run winml build -c convnext_config.json --output-dir convnext_out/
+```
+
+[Brief prose: this single command runs export → optimize → quantize → compile and produces the same final artifact. Use --no-quant / --no-compile / --no-optimize to skip stages.]
+
+[Show a benchmark step at the end using the artifact from convnext_out/.]
+
+## Where to go next
+
+- [Concepts → How winml-cli works](../concepts/how-it-works.md)
+- [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md)
+- [Samples → ConvNeXt primitives walkthrough](../samples/convnext-primitives.md) (the CPU/GPU/NPU device comparison version of this material)
+- [Commands → Overview](../commands/overview.md)
+
+## See also
+
+(2–4 relative links — pick the most relevant from above.)
+
+Rules:
+- Use winml-cli (never ModelKit/wmk).
+- Every flag and command must exist in src/winml/modelkit/. Verify by running uv run winml <command> --help.
+- For unverifiable claims (e.g. --device value names), DOUBLE-CHECK against source.
+- Use pymdownx.tabbed syntax verbatim: `=== "Label"` then blank line then 4-space-indented code block.
+- Output snippets use ```text and stay short (5–10 lines).
+- No "TBD", no placeholders.
+- Adapt the WinHECLab content but rewrite in our voice (drop "Step N" classroom numbering for primary headings; keep step numbering inside Section A only).
+- DO NOT reference Visual Studio, Windows App SDK, C#, or any GUI app — Python/CLI only.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, total word count (prose only, exclude code blocks), build output last 3 lines, and confirmation that tabbed blocks rendered (mkdocs --strict accepts them).
+```
+
+#### Agent D2 — Author `tutorials/index.md`
+
+```
+You are authoring the Tutorials chapter overview page. Output: overwrite C:\Users\zhengte\BYOM\ModelKits\mvp\docs\tutorials\index.md. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Title: # Tutorials
+Length: 100–250 words.
+
+Required structure:
+
+# Tutorials
+
+[One paragraph framing: tutorials are linear, prescriptive, end-to-end walkthroughs. For lookup, use Concepts (the WHY/WHEN) or Commands (the WHAT). Tutorials sit alongside Samples (which are reference-style demos comparing options).]
+
+## Available tutorials
+
+| Tutorial | What you'll build | Hardware |
+|---|---|---|
+| [ConvNeXt on NPU](npu-convnext.md) | A quantized ConvNeXt image classifier compiled for Snapdragon NPU (with CPU/DirectML fallback) | Copilot+PC NPU primary; CPU works as fallback |
+
+[One short closing paragraph noting more tutorials coming.]
+
+Rules:
+- Use winml-cli (never ModelKit/wmk).
+- No "TBD", no placeholders.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, word count, build output last 3 lines.
+```
+
+- [ ] **Step 2: Verify strict build**
+
+```bash
+uv run mkdocs build --strict 2>&1 | tail -3
+```
+
+- [ ] **Step 3: Commit (Tutorials batch)**
+
+```bash
+git add docs/tutorials/
+git commit -m "docs(tutorials): add Tutorials chapter with ConvNeXt-on-NPU walkthrough
+
+- tutorials/index.md: chapter overview + tutorial table
+- tutorials/npu-convnext.md: end-to-end ConvNeXt build on NPU,
+  adapted from the internal WinHECLab lab. Primitives walkthrough
+  (Section A) covers each stage in turn; one-shot section (Section B)
+  shows the same result via winml build. QNN, OpenVINO, and CPU
+  paths shown via tabbed code blocks.
+
+Python/winml-cli only — Visual Studio / Windows App SDK / C# app
+content from the lab is deliberately out of scope for this iteration."
+```
+
+---
+
+## Task 5: Getting Started polish (Batch E)
+
+**Files (content edits to existing pages):**
+- `docs/getting-started/installation.md`
+- `docs/getting-started/quickstart.md`
+- `docs/getting-started/end-to-end.md`
+
+- [ ] **Step 1: Dispatch 3 parallel agents**
+
+Single message, 3 Agent tool calls, `model: sonnet`. Agents write only.
+
+#### Agent E1 — Edit `installation.md`
+
+```
+You are editing the winml-cli Installation page. File: C:\Users\zhengte\BYOM\ModelKits\mvp\docs\getting-started\installation.md. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Goal:
+1. Rewrite the prerequisites table to be more specific about NPU requirements.
+2. Add a fallback callout for users without NPU hardware.
+
+Specific edits:
+- Replace the existing "Prerequisites" section with a table that includes:
+  - Windows 11 24H2 or later (required for NPU support)
+  - Copilot+PC with NPU (40+ TOPS NPU recommended for NPU acceleration; not required for CPU/DirectML)
+  - Python 3.10 (the project pins requires-python = ">=3.10,<3.11"; verify before stating)
+  - uv (link https://github.com/astral-sh/uv)
+  - git
+- After the prereqs table, add a !!! note "No NPU?" callout: explain that --device auto falls back to CPU or DirectML, and the rest of the docs apply with minor flag differences.
+- Otherwise keep the page (Install, Verify, Optional extras, Next steps sections all stay).
+- Verify the existing extras text matches pyproject.toml lines 79–82.
+
+Rules:
+- Use winml-cli (never ModelKit/wmk).
+- Keep page under 600 words.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, word count, build output last 3 lines.
+```
+
+#### Agent E2 — Edit `quickstart.md`
+
+```
+You are editing the winml-cli Quickstart page. File: C:\Users\zhengte\BYOM\ModelKits\mvp\docs\getting-started\quickstart.md. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Goal: add winml sys --list-device --list-ep to the verify step. Otherwise leave the page alone.
+
+Specific edit:
+- Wherever the page currently shows `uv run winml sys` as the verify command (probably in a "Verify the install" or similar section), replace it with:
+
+  ```bash
+  uv run winml sys --list-device --list-ep
+  ```
+
+- Update the surrounding prose to mention that this enumerates available devices and execution providers (versus `winml sys` alone, which shows everything).
+- No other changes.
+
+Rules:
+- Use winml-cli (never ModelKit/wmk).
+- Keep page under 600 words.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, word count, build output last 3 lines.
+```
+
+#### Agent E3 — Edit `end-to-end.md`
+
+```
+You are editing the winml-cli End-to-End page. File: C:\Users\zhengte\BYOM\ModelKits\mvp\docs\getting-started\end-to-end.md. DO NOT commit.
+
+Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
+
+Goals:
+1. Add --monitor to the winml perf step.
+2. Add a short CPU-fallback section after the NPU section.
+3. Align prereqs callout with the updated installation.md.
+
+Specific edits:
+- Wherever the page shows `uv run winml perf ... --device npu`, add --monitor:
+
+  ```bash
+  uv run winml perf -m convnext_npu_out/<artifact>.onnx --device npu --iterations 50 --monitor
+  ```
+
+  Add a sentence: "The --monitor flag opens a live chart of NPU utilization while the benchmark runs — confirmation that the workload actually hit the NPU."
+- After the existing NPU perf step, add a new section:
+
+  ```
+  ## (Optional) CPU fallback
+
+  If you don't have NPU hardware, the same artifact runs on CPU via DirectML:
+
+  ```bash
+  uv run winml perf -m convnext_npu_out/<artifact>.onnx --device cpu --iterations 50
+  ```
+
+  Latency will be higher than NPU but the build pipeline is otherwise identical.
+  ```
+
+- In the prerequisites section, reference the updated installation page (link relative path: ../getting-started/installation.md is wrong from within getting-started/ — use installation.md).
+
+Rules:
+- Use winml-cli (never ModelKit/wmk).
+- Keep page under 1100 words.
+
+Verify: uv run mkdocs build --strict 2>&1 | tail -3
+
+Return: status, word count, build output last 3 lines.
+```
+
+- [ ] **Step 2: Verify strict build**
+
+```bash
+uv run mkdocs build --strict 2>&1 | tail -3
+```
+
+- [ ] **Step 3: Commit (Getting Started polish batch)**
+
+```bash
+git add docs/getting-started/
+git commit -m "docs(getting-started): polish prereqs, add NPU monitoring, document CPU fallback
+
+- installation.md: rewrite prereqs as a table (Windows 11 24H2,
+  Copilot+PC, Python 3.10, uv, git); add 'No NPU?' callout pointing
+  at --device auto and CPU/DirectML.
+- quickstart.md: verify step now uses 'winml sys --list-device
+  --list-ep' for a focused capability check.
+- end-to-end.md: add --monitor to the perf step and a short
+  CPU-fallback section after the NPU benchmark."
+```
+
+---
+
+## Task 6: Cross-link sweep (Batch F)
+
+**Files:** any docs file referencing the renamed `onnx-and-eps.md` or `hierarchy.md`.
+
+- [ ] **Step 1: Find broken references**
+
+```bash
+echo "=== References to old onnx-and-eps.md ==="
+grep -rn "onnx-and-eps\.md" docs/ 2>/dev/null | grep -v "docs/superpowers/"
+
+echo ""
+echo "=== References to old hierarchy.md (not hierarchy-and-metadata.md) ==="
+grep -rn "hierarchy\.md" docs/ 2>/dev/null | grep -v "hierarchy-and-metadata\.md" | grep -v "docs/superpowers/"
+```
+
+Expected: zero or a small handful of matches. If empty, skip to Step 3.
+
+- [ ] **Step 2: Fix any matches**
+
+For each match, edit the file replacing:
+- `onnx-and-eps.md` → `eps-and-devices.md`
+- `hierarchy.md` → `hierarchy-and-metadata.md`
+
+If many matches exist (≥3), use sed:
+
+```bash
+files_with_old_eps=$(grep -rl "onnx-and-eps\.md" docs/ | grep -v "docs/superpowers/")
+files_with_old_hier=$(grep -rl "hierarchy\.md" docs/ | grep -v "hierarchy-and-metadata\.md" | grep -v "docs/superpowers/")
+for f in $files_with_old_eps; do sed -i 's|onnx-and-eps\.md|eps-and-devices.md|g' "$f"; done
+for f in $files_with_old_hier; do sed -i 's|\bhierarchy\.md|hierarchy-and-metadata.md|g' "$f"; done
+```
+
+- [ ] **Step 3: Verify strict build (final)**
+
+```bash
+uv run mkdocs build --strict 2>&1 | tail -3
+```
+
+Expected: exit 0, no WARNING lines.
+
+- [ ] **Step 4: Commit (if any link fixes happened)**
+
+```bash
+git add docs/
+git commit -m "docs: fix inbound links to renamed Fundamentals pages
+
+Updates references to onnx-and-eps.md -> eps-and-devices.md and
+hierarchy.md -> hierarchy-and-metadata.md across the docs tree.
+Internal docs and the design/plan files under docs/superpowers/
+are not touched."
+```
+
+If Step 1 found no matches, skip the commit — no changes to record.
+
+- [ ] **Step 5: Final smoke check**
+
+```bash
+echo "=== Page count by chapter ===" && ls docs/getting-started/*.md docs/concepts/*.md docs/commands/*.md docs/samples/*.md docs/tutorials/*.md 2>&1 | wc -l
+
+echo "=== Final commit log on docs/v2 (vs docs/v1) ===" && git log --oneline docs/v1..HEAD
+
+echo "=== Working tree clean? ===" && git status --short
+```
+
+Expected: page count = 32 (3 + 14 + 13 + 3 + 2 - wait, recompute) — actually:
+- getting-started: 3
+- concepts: 14 (the 5 existing + 2 renamed-and-already-existing + 9 new = 14 — but two of those are renamed (eps-and-devices, hierarchy-and-metadata) so net file count after renames is still 14)
+- commands: 13
+- samples: 3
+- tutorials: 2
+
+Total: **35 markdown files** under those chapters. Plus index.md = 36 user-facing markdown files in the site (excluding stubs in reference/, troubleshooting.md, contributing.md).
+
+If page count is off, investigate; otherwise the v2 expansion is complete.
+
+---
+
+## Self-review notes
+
+- **Spec coverage:** Each section of `docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md` maps to a task. §4 IA → Task 1; §5.1 Getting Started → Task 5; §5.2 Tutorials → Task 4; §5.3 Concepts/Fundamentals → Task 2; §5.4 Concepts/WinML CLI → Task 3; §6 nav → Task 1; §7 file inventory → tasks 1–6; §8 implementation strategy → directly the 6 batches; §9 acceptance criteria → end of Task 6.
+- **Type/name consistency:** `winml-cli` used throughout; file paths use `concepts/`, `tutorials/`, `getting-started/`. Pair-page H1 titles match `mkdocs.yml` nav labels.
+- **No placeholders:** every step has actual content. Agent prompts are concrete and self-contained (no "see plan for details").
+- **Agent parallelism is explicit** at the start of each authoring task.
+- **One known acceptable hack:** in Task 2 Step 2, the dtype-content move is a surgical edit done by the orchestrator (not by an agent) because the edit is one paragraph or fewer.
diff --git a/docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md b/docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md
new file mode 100644
index 000000000..bffe1398f
--- /dev/null
+++ b/docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md
@@ -0,0 +1,239 @@
+# ModelKit User-Facing Documentation Site — Design
+
+> **Date:** 2026-05-20
+> **Branch:** `docs/init` (based on `feat/mvp`)
+> **Status:** Design approved; ready for implementation plan.
+
+## 1. Goal
+
+Create a user-facing documentation site for ModelKit (the Python toolkit fronted by the `winml` CLI) targeted at external open-source users discovering the project on GitHub. The site must support markdown authoring, code-block-friendly rendering, mermaid diagrams, and optional Jupyter notebook embedding.
+
+## 2. Audience and scope
+
+- **Primary audience:** External OSS users (developers exporting/quantizing/compiling models for Windows ML deployment). No insider jargon; clear install-to-first-success path required.
+- **Out of scope:** Internal-only sections, MS-internal access controls.
+- **MVP scope:** Full content for the first four chapters (Getting Started, Concepts, Commands, Samples). Reference / Troubleshooting / Contributing exist as nav stubs only and are tracked as P2.
+
+## 3. Framework decision
+
+**MkDocs Material**, hosted on GitHub Pages, sources in `docs/`.
+
+| Considered | Outcome |
+|---|---|
+| **MkDocs Material** | Chosen. Python-native, single `uv add --dev mkdocs-material`, first-class mermaid, code-block tabs, instant search, dark mode. Matches existing toolchain. |
+| Sphinx + MyST + Furo | Rejected for MVP. Heavier config; autodoc not needed for a CLI tool. Revisit if we add a library API surface. |
+| Docusaurus | Rejected. Adds Node ecosystem to a Python repo; MDX features unused. |
+| GitHub Wiki | Rejected. No PR review, no code-search integration, weaker mermaid/notebook support. |
+
+**Notebook integration:** `mkdocs-jupyter` plugin, treated as nice-to-have. No notebooks required in MVP; plugin is installed so future samples can drop in `.ipynb` files.
+
+## 4. Hosting and deploy
+
+- Site lives in-repo under `docs/` (alongside existing internal docs, which remain untouched and excluded from the nav).
+- Built by GitHub Actions, published to the `gh-pages` branch, served by GitHub Pages.
+- **Deploy is held off for now:** the CI workflow is written but configured to require manual `workflow_dispatch`. No automatic pushes to remote during this MVP. All commits stay local on `docs/init` until the user decides to publish.
+
+## 5. Information architecture
+
+```
+ModelKit Docs
+├── Home (landing)
+│
+├── 1. Getting Started
+│   ├── Installation
+│   ├── Quickstart (5-min export)
+│   └── End-to-End: HF → NPU (15-min walkthrough)
+│
+├── 2. Concepts
+│   ├── How ModelKit Works (pipeline diagram)
+│   ├── ONNX & Execution Providers
+│   ├── Quantization & QDQ
+│   ├── Hierarchy Preservation
+│   └── BuildConfig & Kits
+│
+├── 3. Commands
+│   ├── Overview (12-command map, decision table)
+│   ├── Discover  → sys, inspect, hub, analyze
+│   ├── Configure → config, optimize
+│   ├── Build     → export, quantize, compile, build
+│   └── Measure   → perf, eval
+│
+├── 4. Samples
+│   ├── ConvNeXt — primitives walkthrough (all EPs, quantized)
+│   ├── BERT — config + build + perf (workflow focus)
+│   └── Qwen3 — Composite Models (placeholder, "coming soon")
+│
+├── 5. Reference          (P2 — nav stub only)
+├── 6. Troubleshooting    (P2 — nav stub only)
+└── 7. Contributing       (P2 — nav stub only)
+```
+
+### 5.1 Grouping rationale
+
+- Commands grouped by **user intent** (discover / configure / build / measure), not alphabetical — matches how a user actually progresses.
+- Concepts placed **before** Commands so users have a mental model before reading flag tables.
+- Existing `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md` stay where they are; they remain contributor-facing and are linked from Contributing (P2).
+
+## 6. Per-page outlines
+
+### 6.1 Getting Started
+
+- **Installation** — Prereqs (Win 10/11, Python 3.10, `uv`), `git clone` + `uv sync`, verify with `winml sys`.
+- **Quickstart** — 5-minute path: pick any HF classifier, run `winml export`, view the `.onnx`, run `winml inspect`. No EPs, no quantization — proves the install.
+- **End-to-End: HF → NPU** — 15-minute walkthrough: ConvNeXt + `winml build` with QNN, see artifacts, run `winml perf` against NPU. Sets the stage for the Samples chapter.
+
+### 6.2 Concepts
+
+- **How ModelKit Works** — Mermaid pipeline diagram (PyTorch → ONNX → QDQ → EP-compiled). One paragraph per stage with deep-links to its command page.
+- **ONNX & Execution Providers** — What ONNX is, what an EP is, EPs ModelKit supports (QNN, OpenVINO, DML, CPU/GPU), hardware mapping table.
+- **Quantization & QDQ** — Why quantize, INT8/INT16/FP16, calibration vs. static, QDQ node insertion, lossy trade-offs.
+- **Hierarchy Preservation** — Why ONNX needs PyTorch module info, how ModelKit embeds it as metadata, what it enables downstream (per-module benchmarking, targeted optimization).
+- **BuildConfig & Kits** — The unified config object, precision policies, per-task templates, where configs live (`MODEL_BUILD_CONFIGS`).
+
+### 6.3 Commands
+
+**Page template** (applied to all 12 command pages — sections kept as headings even if initially sparse; content filled in incrementally):
+
+```
+# winml <command>
+> one-line tagline
+
+## When to use this
+[1–2 sentences: user intent, place in pipeline]
+
+## Synopsis
+$ winml <command> [options]
+
+## Flags
+[Table: Flag | Short | Type | Default | Description; shared flags collapsed]
+
+## How it works
+[2–3 sentences; optional mermaid diagram for non-trivial commands]
+
+## Examples
+[3–5 progressively richer examples with expected output snippets]
+
+## Common pitfalls
+[Bullet list of gotchas]
+
+## See also
+[Links to related commands and concept pages]
+```
+
+The **Overview** sub-page contains the 12-command map (grouped) and a "which command for which task" decision table.
+
+The 12 command pages: `sys`, `inspect`, `hub`, `analyze`, `config`, `optimize`, `export`, `quantize`, `compile`, `build`, `perf`, `eval`.
+
+### 6.4 Samples
+
+Each sample has a distinct teaching purpose — together they form an abstraction ladder.
+
+- **ConvNeXt — primitives walkthrough**
+  - Style: invoke each command directly (`inspect` → `config` → `export` → `quantize` → `compile` → `perf` → `eval`).
+  - EP coverage: CPU, GPU, NPU. For each EP, document the flags that differ, expected outputs, and a "what we just did" callout per step.
+  - Goal: reader leaves understanding what each command does and how they compose.
+
+- **BERT — config + build + perf**
+  - Style: `winml config` to generate the BuildConfig, `winml build` to run the whole pipeline, `winml perf` on the artifact.
+  - EP focus de-emphasized — the page teaches the wrapper workflow, not the EP matrix.
+  - Goal: reader leaves understanding the production-style one-shot path and how config files become reusable.
+
+- **Qwen3 — Composite Models** (placeholder)
+  - Single page: 1-paragraph teaser, "coming soon" admonition, link to the in-progress feature branch.
+  - Goal: reserve the slot in the nav; signal where ModelKit is headed without blocking MVP on unmerged work.
+
+## 7. Reference handling (P2 — nav stubs in MVP)
+
+- **BuildConfig schema, hub catalog, EP/device matrix, precision options:** hand-written when the time comes (decided against autogeneration for MVP — maintenance burden traded against polish).
+- **Naming conventions:** existing `docs/naming-convention.md` will be linked from the Reference page when written.
+
+## 8. Repository layout
+
+```
+mvp/
+├── docs/
+│   ├── index.md                          ← landing
+│   ├── getting-started/
+│   │   ├── installation.md
+│   │   ├── quickstart.md
+│   │   └── end-to-end.md
+│   ├── concepts/
+│   │   ├── how-it-works.md
+│   │   ├── onnx-and-eps.md
+│   │   ├── quantization.md
+│   │   ├── hierarchy.md
+│   │   └── buildconfig.md
+│   ├── commands/
+│   │   ├── overview.md
+│   │   ├── sys.md
+│   │   ├── inspect.md
+│   │   ├── hub.md
+│   │   ├── analyze.md
+│   │   ├── config.md
+│   │   ├── optimize.md
+│   │   ├── export.md
+│   │   ├── quantize.md
+│   │   ├── compile.md
+│   │   ├── build.md
+│   │   ├── perf.md
+│   │   └── eval.md
+│   ├── samples/
+│   │   ├── convnext-primitives.md
+│   │   ├── bert-config-build.md
+│   │   └── qwen3-composite.md            ← placeholder
+│   ├── reference/                        ← P2 stubs
+│   ├── troubleshooting.md                ← P2 stub
+│   ├── contributing.md                   ← P2 stub
+│   │
+│   ├── design/                           ← UNCHANGED (internal)
+│   ├── naming-convention.md              ← UNCHANGED (internal)
+│   ├── pytest-best-practices.md          ← UNCHANGED (internal)
+│   └── superpowers/specs/                ← UNCHANGED (this file lives here)
+│
+├── mkdocs.yml                            ← new
+└── .github/workflows/docs.yml            ← new (manual dispatch only)
+```
+
+## 9. MkDocs configuration
+
+- **Theme:** `material` with palette toggle (light/dark), instant navigation, code-copy button, "Edit on GitHub" link per page.
+- **Plugins:** `search` (built-in), `mkdocs-jupyter` (notebooks; lazy install).
+- **Markdown extensions:** `pymdownx.superfences` (mermaid, tabbed code), `admonition`, `pymdownx.tabbed`, `pymdownx.details`, `pymdownx.tasklist`.
+- **Nav:** hand-written, mirroring section 5. Chapters 5-7 appear as stub pages in nav.
+- **Strict mode:** `mkdocs build --strict` to fail CI on broken links or missing nav entries.
+- **Excluded from nav:** `docs/design/`, `docs/superpowers/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md` (they remain in the repo for contributors).
+
+## 10. CI workflow
+
+- **File:** `.github/workflows/docs.yml`.
+- **Triggers:** `workflow_dispatch` only (manual) until the user is ready to publish. No auto-trigger on `push` or `pull_request` for MVP.
+- **Steps:** checkout → install `uv` → `uv sync` → `uv run mkdocs build --strict` → `peaceiris/actions-gh-pages` deploy to `gh-pages`.
+- **Local equivalent:** `uv run mkdocs serve` for live preview during authoring.
+
+## 11. Implementation strategy (preview for the plan)
+
+The plan will batch work for parallel execution via subagents:
+
+- **Batch A — Site scaffold (sequential, foundation):** create `mkdocs.yml`, repo layout, landing page, nav stubs, CI workflow. Verify `mkdocs build --strict` succeeds with placeholder content.
+- **Batch B — Concepts pages (5 pages, parallel):** one subagent per concept page; each reads the relevant source module and drafts the page.
+- **Batch C — Command pages (12 command pages + 1 overview page, 4 parallel agents):** one agent per group (Discover / Configure / Build / Measure), each owning 3 commands; agents read source + `--help` output and draft pages using the section 6.3 template. The Commands → Overview page is authored after the 12 command pages settle (sequential), so its decision table reflects the real flag surfaces.
+- **Batch D — Sample pages (3 pages, parallel):** ConvNeXt agent runs the primitive command sequence end-to-end and captures real outputs; BERT agent runs `config + build + perf` and captures outputs; Qwen3 page is a static placeholder.
+- **Batch E — Getting Started (3 pages, sequential after Concepts and Commands):** authored last so it can cross-link to settled concept and command pages.
+
+Each batch ends with `mkdocs build --strict` to catch broken links before moving on.
+
+## 12. Open items / things explicitly punted
+
+- **Versioning:** Not added in MVP. `mike` plugin available if needed later.
+- **Search analytics, Algolia DocSearch:** Not in MVP; Material's built-in search is sufficient.
+- **API reference autogeneration:** Not in MVP. Reconsider if/when a stable library API emerges.
+- **i18n:** Not in MVP.
+
+## 13. Acceptance criteria
+
+- `uv run mkdocs serve` renders the site locally without errors.
+- `uv run mkdocs build --strict` succeeds (no broken links, no missing nav entries).
+- All chapters 1-4 have authored content; chapters 5-7 have stub pages.
+- Mermaid diagrams render in the "How it works" concept page.
+- Existing `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md` are unmodified and not in the user-facing nav.
+- All commits remain on local `docs/init`; nothing pushed to `origin`.
diff --git a/docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md b/docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md
new file mode 100644
index 000000000..c77e11ece
--- /dev/null
+++ b/docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md
@@ -0,0 +1,263 @@
+# Docs Expansion v2 — Design
+
+> **Date:** 2026-05-24
+> **Branch:** `docs/v2` (based on `docs/v1`)
+> **Status:** Design approved verbally; ready for spec self-review and plan.
+
+## 1. Goal
+
+Expand the user-facing winml-cli docs site with: (a) a new **Tutorials** chapter seeded with a ConvNeXt-on-NPU walkthrough adapted from the internal WinHECLab lab, (b) a restructured **Concepts** chapter with two sub-groups (Fundamentals + WinML CLI) totaling 14 pages of pair-topic content, (c) targeted polish to the three existing Getting Started pages.
+
+## 2. Scope and non-goals
+
+### In scope
+
+- 3 Getting Started pages: targeted edits (prereqs alignment, new flag mentions, CPU/DirectML fallback).
+- 2 new Tutorial pages (chapter index + 1 ConvNeXt-on-NPU tutorial).
+- 14 Concepts pages: 5 renamed/touched, 9 newly authored. Sub-grouped into Fundamentals and WinML CLI.
+- `mkdocs.yml` nav restructure to expose Tutorials and the Concepts sub-groups.
+
+### Out of scope
+
+- The C# Windows App SDK demo app from WinHECLab Steps 9–19 (Python/winml-cli only this iteration).
+- Visual Studio / Windows App SDK prerequisites.
+- Hardware-specific lab paths (`C:\LabWinML\...`, `Start\`, `Final\`).
+- Pinned wheel/SDK versions (we use `>=` semantics).
+- Reference, Troubleshooting, Contributing chapters (still P2 stubs).
+- A second tutorial or further Concepts pages beyond the 14 listed.
+
+## 3. Source material
+
+- **WinHECLab README** (`we2-microsoft/WinHECLab`, fetched to `temp/winheclab-readme.md` for this design). External publish OK per design discussion.
+- **Existing winml-cli sources** at `src/winml/modelkit/` (canonical for any flag or behavior we describe).
+- **Existing docs** at `docs/getting-started/`, `docs/concepts/`, `docs/commands/`, `docs/samples/`.
+
+## 4. Information architecture changes
+
+### 4.1 New chapter: Tutorials
+
+A new top-level chapter between **Samples** and **Reference**:
+
+```
+- Samples
+- Tutorials              ← NEW
+    - Overview
+    - ConvNeXt on NPU
+- Reference
+```
+
+The chapter is the home for classroom-style, prescriptive, end-to-end walkthroughs. Distinct from **Samples** (which are reference-style, command-comparison demos) and from **Getting Started** (which is a short onboarding journey).
+
+### 4.2 Concepts restructure
+
+Concepts gets two sub-groups in the nav:
+
+```
+- Concepts
+    - Fundamentals
+        - How winml-cli works
+        - Models, graphs, and the ONNX IR
+        - Tensors and dtypes
+        - Execution Providers and devices
+        - Quantization and QDQ
+        - Hierarchy and ONNX metadata
+        - BuildConfig and kits
+    - WinML CLI
+        - Primitives and pipeline
+        - Config and build
+        - Load and export
+        - Analyze and optimize
+        - Compile and EPContext
+        - Perf and monitoring
+        - Eval and datasets
+```
+
+Every page uses the **pair-topic** framing — the H1 names two related concepts whose contrast or interplay structures the page.
+
+## 5. Per-page detail
+
+### 5.1 Getting Started — 3 pages, targeted edits
+
+#### `installation.md`
+- Rewrite prereqs table in lab style: Windows 11 24H2, Copilot+PC 40+ TOPS NPU (recommended for NPU acceleration), Python 3.10, uv, git. Drop the VS / App SDK lines (those were never in our installation anyway — confirming they stay out).
+- Add a one-paragraph **"No NPU? Use `--device auto`"** callout that explicitly names CPU and DirectML as the fallback.
+
+#### `quickstart.md`
+- Add `winml sys --list-device --list-ep` to the verify step.
+- No other changes — quickstart stays a 5-minute zero-to-export.
+
+#### `end-to-end.md`
+- Add `--monitor` to the `winml perf` step (live NPU utilization chart).
+- Add a short CPU-fallback section after the NPU section showing the same `winml perf` with `--device cpu`.
+- Align prereqs callout with the updated `installation.md`.
+- Model stays **ConvNeXt** (consistency with existing sample pairing).
+
+### 5.2 Tutorials — 2 new pages
+
+#### `tutorials/index.md` (Overview)
+- One paragraph framing: tutorials are linear, end-to-end, prescriptive; for lookup go to Concepts or Commands.
+- One-row table linking to the available tutorials.
+- ~150 words. Grows as more tutorials are added.
+
+#### `tutorials/npu-convnext.md` (ConvNeXt on NPU)
+- **Model:** `facebook/convnext-tiny-224`.
+- **Hardware:** Primary path is Copilot+PC NPU; explicit CPU/DirectML fallback documented throughout.
+- **Structure:**
+  1. **Prerequisites** — adopted from WinHECLab prereqs table.
+  2. **Section A — Primitives walkthrough**: `inspect → config → export → analyze → optimize → quantize → compile → perf`. EP-specific steps (`compile` and `perf`) use **`=== "QNN" / === "OpenVINO"` tabbed code blocks** so readers see both NPU backends inline.
+  3. **Section B — One-shot with `winml build`**: closing section showing the wrapper command produces the same artifact.
+  4. **(Optional) Eval** against an ImageNet sample using `winml eval`.
+  5. **Where to go next** — links to Concepts and Samples.
+- **Length target:** 1,500–2,500 words. This is the longest single page in the site.
+
+### 5.3 Concepts — Fundamentals (7 pages)
+
+Each page uses pair-topic framing. New = needs full authoring; touched = exists but renamed/expanded.
+
+| File | Status | Pair / focus |
+|---|---|---|
+| `concepts/how-it-works.md` | **touched** (rename in nav, content kept) | Pipeline overview, mermaid diagram |
+| `concepts/graphs-and-ir.md` | **new** | What is a model file, graph nodes/edges, opsets, ONNX as IR |
+| `concepts/tensors-and-dtypes.md` | **new** | Weights vs activations vs I/O tensors; fp32/fp16/int8/int16; static vs dynamic shapes |
+| `concepts/eps-and-devices.md` | **touched** (renamed from `onnx-and-eps.md`) | EP vs device, the EP matrix, when to use which |
+| `concepts/quantization.md` | **touched** (small content tightening; dtype family moves to Tensors page) | Why quantize, calibration, QDQ pattern |
+| `concepts/hierarchy-and-metadata.md` | **touched** (renamed from `hierarchy.md`, broadened) | `winml.hierarchy.tag` plus other metadata winml-cli writes |
+| `concepts/buildconfig.md` | **touched** (rename in nav, content kept) | WinMLBuildConfig structure, kits, MODEL_BUILD_CONFIGS |
+
+**Rename mapping:**
+- `onnx-and-eps.md` → `eps-and-devices.md`
+- `hierarchy.md` → `hierarchy-and-metadata.md`
+- The other three existing pages keep their file names; only the nav label changes.
+
+Any inbound links from other docs files (Commands, Samples, Getting Started, Tutorials) must be updated to the new file paths.
+
+### 5.4 Concepts — WinML CLI (7 new pages)
+
+All seven are new. Each is a workflow concept page (the **why** and **when**), not a command reference (the **what**). Cross-link to per-command pages under `docs/commands/`.
+
+| File | Pair / focus |
+|---|---|
+| `concepts/primitives-and-pipeline.md` | Staged commands (`export`, `quantize`, `compile`, …) vs the one-shot `winml build` wrapper. When to choose which. Opens the chapter. |
+| `concepts/config-and-build.md` | `winml config` produces a `WinMLBuildConfig`; `winml build` consumes it. The wrapper-flow pair — reproducibility, sharing configs across runs and CI, override flags vs config values. |
+| `concepts/load-and-export.md` | The "load model into memory, then transform it to ONNX" arc. Covers HF Hub loading, local PyTorch loading, the `winml inspect` pre-flight check, and `winml export` itself. (Note: `winml load` is not a CLI verb — "load" here is the conceptual stage in the loader module, paired with the `export` command that follows it.) |
+| `concepts/analyze-and-optimize.md` | Graph-quality commands. How analyze reports problems and how optimize applies fusions. Shared `--optim-config`. |
+| `concepts/compile-and-epcontext.md` | What `winml compile` produces. QNN EPContext binary blobs embedded in ONNX. Why compiled models load faster at runtime. |
+| `concepts/perf-and-monitoring.md` | `winml perf` plus `--monitor` (live NPU chart) and `--op-tracing`. When to use each. |
+| `concepts/eval-and-datasets.md` | `winml eval` plus dataset semantics (`--dataset`, `--split`, `--column`, `--label-mapping`). When eval matters. |
+
+**Length target per page:** 400–700 words of prose. Same shape as the existing Concepts pages.
+
+**Discipline:** workflow pages explain *why and when*; command pages document *what flags exist*. No flag-table duplication.
+
+## 6. `mkdocs.yml` nav changes
+
+Full updated nav structure:
+
+```yaml
+nav:
+  - Home: index.md
+  - Getting Started:
+      - Installation: getting-started/installation.md
+      - Quickstart: getting-started/quickstart.md
+      - End-to-End — HF → NPU: getting-started/end-to-end.md
+  - Concepts:
+      - Fundamentals:
+          - How winml-cli works: concepts/how-it-works.md
+          - Models, graphs, and the ONNX IR: concepts/graphs-and-ir.md
+          - Tensors and dtypes: concepts/tensors-and-dtypes.md
+          - Execution Providers and devices: concepts/eps-and-devices.md
+          - Quantization and QDQ: concepts/quantization.md
+          - Hierarchy and ONNX metadata: concepts/hierarchy-and-metadata.md
+          - BuildConfig and kits: concepts/buildconfig.md
+      - WinML CLI:
+          - Primitives and pipeline: concepts/primitives-and-pipeline.md
+          - Config and build: concepts/config-and-build.md
+          - Load and export: concepts/load-and-export.md
+          - Analyze and optimize: concepts/analyze-and-optimize.md
+          - Compile and EPContext: concepts/compile-and-epcontext.md
+          - Perf and monitoring: concepts/perf-and-monitoring.md
+          - Eval and datasets: concepts/eval-and-datasets.md
+  - Commands: (unchanged)
+  - Samples: (unchanged)
+  - Tutorials:
+      - Overview: tutorials/index.md
+      - ConvNeXt on NPU: tutorials/npu-convnext.md
+  - Reference: (unchanged P2 stub)
+  - Troubleshooting: (unchanged P2 stub)
+  - Contributing: (unchanged P2 stub)
+```
+
+## 7. File-system changes summary
+
+### New files (11)
+
+- `docs/tutorials/index.md`
+- `docs/tutorials/npu-convnext.md`
+- `docs/concepts/graphs-and-ir.md`
+- `docs/concepts/tensors-and-dtypes.md`
+- `docs/concepts/primitives-and-pipeline.md`
+- `docs/concepts/config-and-build.md`
+- `docs/concepts/load-and-export.md`
+- `docs/concepts/analyze-and-optimize.md`
+- `docs/concepts/compile-and-epcontext.md`
+- `docs/concepts/perf-and-monitoring.md`
+- `docs/concepts/eval-and-datasets.md`
+
+### Renamed files (2 — rename + content edit)
+
+- `docs/concepts/onnx-and-eps.md` → `docs/concepts/eps-and-devices.md` (small content reframe to match the new pair-topic title)
+- `docs/concepts/hierarchy.md` → `docs/concepts/hierarchy-and-metadata.md` (broaden content to cover other metadata winml-cli writes, not just `winml.hierarchy.tag`)
+
+### Modified files (5 — content edits, no rename)
+
+- `docs/getting-started/installation.md`
+- `docs/getting-started/quickstart.md`
+- `docs/getting-started/end-to-end.md`
+- `docs/concepts/quantization.md` (tightening — dtype content moves to the new Tensors page)
+- `mkdocs.yml` (full nav restructure to introduce the Concepts sub-groups and Tutorials chapter)
+
+### Inbound links to update
+
+Any reference to `onnx-and-eps.md` or `hierarchy.md` from other pages (Commands, Samples, Tutorials, Getting Started) must be updated to the new paths. Estimated 6–10 inbound links across the site (to be confirmed during implementation).
+
+## 8. Implementation strategy preview
+
+For the plan to formalize:
+
+- **Batch A — Scaffolding (sequential, foundation):** create stubs for all 11 new pages; rename the 2 renamed pages; update `mkdocs.yml` nav. Verify `mkdocs build --strict` passes with stubs. Single commit.
+- **Batch B — Concepts (Fundamentals) authoring (parallel, 4–5 agents):** new pages (`graphs-and-ir`, `tensors-and-dtypes`) authored in parallel with content-touch passes on the 5 existing pages.
+- **Batch C — Concepts (WinML CLI) authoring (parallel, 3–4 agents):** 7 new workflow pages, agents own 2 pages each (one agent owns 1).
+- **Batch D — Tutorials authoring (sequential, 1 agent):** the ConvNeXt-on-NPU tutorial. Single big page — best authored by one agent for consistency. Plus the small overview page.
+- **Batch E — Getting Started polish (parallel, 3 agents):** small edits to the 3 existing pages.
+- **Batch F — Cross-link fix-up (sequential):** sweep the rest of the docs site for inbound links to the renamed files and update them.
+
+Each batch ends with `uv run mkdocs build --strict` to catch broken links.
+
+## 9. Acceptance criteria
+
+- `uv run mkdocs build --strict` exits 0 with zero warnings on the final commit.
+- All 11 new pages exist and contain non-stub content of at least 300 words each (Tutorials index is exempt — it's a short overview).
+- All 2 renamed pages have been renamed at the filesystem level (not just nav).
+- No remaining inbound links reference the old paths `onnx-and-eps.md` or `hierarchy.md`.
+- Tutorial uses `facebook/convnext-tiny-224`, contains tabbed QNN/OpenVINO code blocks for EP-specific steps, contains both a primitives section and a one-shot `winml build` section.
+- Every flag mentioned in the new content is verified against `src/winml/modelkit/commands/` source (no invented flags).
+- Existing internal docs (`docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md`, `docs/superpowers/`) are unmodified.
+- All commits remain on local `docs/v2` until publish.
+
+## 10. Risks and mitigations
+
+| Risk | Mitigation |
+|---|---|
+| 19 doc pages is a lot — author agents may drift from each other in tone | Provide every agent the same template and a short "voice guide" excerpt; require source-grounded claims; consider splitting Batch B into two waves so the first wave's voice anchors the second |
+| Inbound-link sweep is easy to miss | Dedicated final batch (F) with `grep` verification before commit |
+| `winml.hierarchy.tag` and other metadata details are real source claims | Each agent verifies via source path + line; reported in the agent's return summary |
+| Tutorial scope creep (toward classroom-style screenshots etc.) | Length cap (1,500–2,500 words); no screenshots in this iteration |
+| ConvNeXt + ConvNeXt overlap between `samples/convnext-primitives.md` and `tutorials/npu-convnext.md` | Sample focuses on **device comparison** (CPU/GPU/NPU); tutorial focuses on **NPU production path** (QNN vs OpenVINO). Different teaching purposes documented in each page's intro paragraph |
+
+## 11. Open items explicitly punted
+
+- A second tutorial (e.g. BERT-config-build on a fresh model). Available content-wise from WinHECLab but deferred to v3.
+- Screenshots and embedded outputs. Not in this iteration; can add later under `docs/tutorials/images/`.
+- Reference, Troubleshooting, Contributing chapter content. Still P2.
+- Versioning (mike plugin). Still P2 from the v1 spec.
+- Migration of internal `docs/design/` content into the public docs. Not in scope.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
new file mode 100644
index 000000000..21b5c4c89
--- /dev/null
+++ b/docs/troubleshooting.md
@@ -0,0 +1,4 @@
+# Troubleshooting
+
+!!! note "Coming soon"
+    This page is part of the documentation MVP and will be authored shortly.
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
new file mode 100644
index 000000000..c5c6838d2
--- /dev/null
+++ b/docs/tutorials/index.md
@@ -0,0 +1,11 @@
+# Tutorials
+
+Tutorials are linear, prescriptive, end-to-end walkthroughs that guide you through building something concrete with `winml-cli`. Each tutorial moves in one direction—start to finish—so you can follow along without making decisions. If you need to understand the reasoning behind a feature, see the Concepts section (the why and when). If you need a quick reference for a specific command, see Commands (the what). Tutorials sit alongside Samples, which are reference-style demos that compare multiple approaches side by side rather than walking through a single path.
+
+## Available tutorials
+
+| Tutorial | What you'll build | Hardware |
+|---|---|---|
+| [ConvNeXt on NPU](npu-convnext.md) | A quantized ConvNeXt image classifier compiled for Snapdragon NPU (with CPU/DirectML fallback) | Copilot+PC NPU primary; CPU works as fallback |
+
+More tutorials are coming, covering additional model families, execution providers, and deployment scenarios. Check back as the `winml-cli` documentation expands.
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
new file mode 100644
index 000000000..4359e239e
--- /dev/null
+++ b/docs/tutorials/npu-convnext.md
@@ -0,0 +1,281 @@
+# ConvNeXt on NPU
+
+!!! info "Pick the right ConvNeXt page"
+    Three pages use ConvNeXt as their vehicle, each with a different teaching purpose:
+
+    - **This tutorial** — the canonical deep-dive: full pipeline with both QNN and OpenVINO NPU backends, plus the `winml build` one-shot. Start here if you want to ship to NPU.
+    - **[ConvNeXt — Primitives Walkthrough](../samples/convnext-primitives.md)** — a CPU vs GPU vs NPU comparison using the primitive commands. Start here if you want to compare devices on the same model.
+    - **[End-to-End Tour](../getting-started/end-to-end.md)** — the short Getting Started introduction. Start here for a 15-minute taste.
+
+This tutorial walks you through the complete journey from a pretrained Hugging Face model — `facebook/convnext-tiny-224` — to a quantized, compiled artifact running on an NPU. By the end you will have benchmarked the model on your device and measured real inference latency. Nothing is skipped, and every command produces a file you can inspect or reuse.
+
+The primary hardware target is a Copilot+PC with a Snapdragon X-class NPU (40+ TOPS). If you do not have an NPU, every step works on CPU or DirectML as a fallback — the only thing that changes is the `--device` and `--ep` flags on the compile and perf commands. Those variations are shown explicitly in the tabbed blocks below.
+
+The tutorial is split into two sections. Section A runs through eight primitive commands — one per pipeline stage — so you understand what each stage does, what artifact it produces, and why it matters. Section B shows you that `winml build` runs the same pipeline in a single command once you have a config file. Most production workflows live in Section B; Section A is how you learn to trust it.
+
+---
+
+## Prerequisites
+
+- **Windows 11 24H2** — required for NPU stack support
+- **Copilot+PC with NPU** — 40+ TOPS recommended; CPU and DirectML work as fallback throughout
+- **Python 3.10** and **uv** installed (`pip install uv` or follow [astral.sh/uv](https://astral.sh/uv))
+- **winml-cli** installed — see [Installation](../getting-started/installation.md)
+- **For QNN (Snapdragon NPU):** QAIRT SDK installed and `QNN_SDK_ROOT` set to its root directory
+- **For OpenVINO (Intel CPU/GPU/NPU):** OpenVINO runtime installed and registered as an ONNX Runtime EP
+
+> No NPU? Set `--device cpu` wherever you see `--device npu` and drop `--monitor` from perf commands. Every other flag stays the same.
+
+---
+
+## Section A — Primitive commands
+
+Working through the primitive commands one at a time is the best way to understand what the `winml build` wrapper does under the hood. Each step accepts the output of the previous step as its input, so the chain is explicit and every intermediate artifact is available for inspection.
+
+### Step 1: Inspect the model
+
+Before downloading any weights, confirm that winml-cli knows how to handle `facebook/convnext-tiny-224`.
+
+```bash
+uv run winml inspect -m facebook/convnext-tiny-224
+```
+
+You should see output similar to the following:
+
+```text
+Model               facebook/convnext-tiny-224
+Task                image-classification
+Model class         ConvNextForImageClassification
+Exporter            optimum/onnx
+Input               pixel_values: float32 [1, 3, 224, 224]
+Output              logits: float32 [1, 1000]
+Support status      supported
+```
+
+!!! note "What we just did"
+    `winml inspect` queries the Hugging Face model card and winml-cli's internal registry without downloading weights. It confirms three things: the auto-detected task (`image-classification`), the model class that will be used for loading, and the exporter that will handle the ONNX conversion. If this command fails, stop here — something about the model is unsupported and proceeding would waste time. A successful inspect is the green light for every stage that follows.
+
+---
+
+### Step 2: Generate a build config
+
+Generate a `WinMLBuildConfig` JSON file for the model. For the primitive workflow this file is optional — you can drive each stage entirely through CLI flags — but generating it now gives you a versioned record of every auto-detected setting, and it is required for Section B.
+
+```bash
+uv run winml config -m facebook/convnext-tiny-224 --device npu --precision int8 -o convnext_config.json
+```
+
+Open `convnext_config.json` to see what was auto-detected: the task, I/O tensor shapes, quantization parameters, and the compile target. The `--device npu --precision int8` flags tell the config generator to pre-populate the quantization and compile sections for NPU deployment rather than leaving them at defaults.
+
+!!! note "What we just did"
+    `winml config` auto-resolves every setting that would otherwise require you to look up flags manually. The resulting JSON is the single source of truth for a reproducible build. You can commit it to version control, share it with teammates, edit a single field to try a different precision, and replay the exact same build on any machine. See [Concepts → Config and build](../concepts/config-and-build.md) for a deeper look at the config schema and how the stages interact.
+
+---
+
+### Step 3: Export to ONNX
+
+Download the pretrained weights and convert the PyTorch model to ONNX format.
+
+```bash
+uv run winml export -m facebook/convnext-tiny-224 -o convnext.onnx
+```
+
+This runs an eight-stage export pipeline: model preparation, input generation, hierarchy building, ONNX conversion, node tagging, tag injection, and metadata generation. The result is a standards-compliant ONNX file with winml-cli's Hierarchy-preserving Tags Protocol (HTP) metadata embedded in node `metadata_props`. That metadata is what lets downstream tools make architecture-aware optimization decisions without hardcoded model knowledge.
+
+!!! note "What we just did"
+    The default export embeds hierarchy tags — a tree of source module names mapped onto ONNX nodes — so that the optimizer and analyzer can reason about the graph in terms of the original model structure rather than flat node lists. If you need a clean ONNX without that metadata (for compatibility with other tools), add `--clean-onnx`. See [Concepts → Load and export](../concepts/load-and-export.md) for what hierarchy preservation adds and when it matters.
+
+---
+
+### Step 4: Analyze for EP compatibility
+
+Before spending time on optimization and quantization, check that the model's operators are supported by your target execution provider.
+
+```bash
+uv run winml analyze -m convnext.onnx --ep qnn --device npu
+```
+
+The analyzer performs static analysis — no runtime required — and classifies every operator in the graph as **supported**, **partial**, or **unsupported** for the target EP. It reports a coverage summary, flags any operators that may fall back to CPU, and exits with code 0 for full support or 1 for partial support.
+
+For CPU fallback, run:
+
+```bash
+uv run winml analyze -m convnext.onnx --ep cpu --device cpu
+```
+
+!!! note "What we just did"
+    Knowing your operator coverage before you quantize or compile saves you from discovering EP incompatibilities at the very last step of a long pipeline. ConvNeXt's operators (Conv, GELU, LayerNorm, Add) have broad support across QNN and OpenVINO, so this command should exit 0. If it exits 1, the output tells you which operators are problematic and includes recommendations for resolving them — typically by enabling a graph rewrite in the optimizer that fuses the unsupported pattern into a supported one. See [Concepts → Analyze and optimize](../concepts/analyze-and-optimize.md) for details on the analyzer's recommendation engine.
+
+---
+
+### Step 5: Optimize the graph
+
+Apply graph-level optimizations: operator fusion, constant folding, shape inference, and EP-specific graph rewrites.
+
+```bash
+uv run winml optimize -m convnext.onnx -o convnext_optim.onnx
+```
+
+The optimizer reports how many nodes it reduced. A typical ConvNeXt-tiny optimization fuses several element-wise sequences and removes redundant reshape operations, cutting the node count noticeably without changing model semantics. If you want to apply a specific preset suited to the Snapdragon NPU, add `--preset qnn-compatible` to disable fusions that QNN does not benefit from.
+
+!!! note "What we just did"
+    Graph optimization is a separate stage from quantization so that you can inspect the intermediate graph, compare node counts, and selectively enable or disable individual fusion passes using the `--enable-*` / `--disable-*` flags. Run `uv run winml optimize --list-capabilities` to see every registered optimization flag and its default state. Optimization always happens on the floating-point graph; quantization is applied after so that calibration statistics are computed on the already-fused topology.
+
+---
+
+### Step 6: Quantize
+
+Insert QDQ (Quantize-Dequantize) nodes into the optimized graph using static calibration. This reduces model size and speeds up inference on hardware with integer execution units, which includes Snapdragon NPUs and Intel NPUs.
+
+```bash
+uv run winml quantize -m convnext_optim.onnx -o convnext_int8.onnx --precision int8 --samples 32
+```
+
+The quantizer generates 32 random calibration samples, runs them through the model to collect activation statistics, and uses those statistics (with the default `minmax` method) to set the quantization scale and zero-point for each tensor. Thirty-two samples is sufficient for a vision model with fixed-size inputs like ConvNeXt. For models with variable-length inputs or complex activation distributions, increase `--samples` to 64 or 128.
+
+!!! note "What we just did"
+    `--precision int8` sets both weights and activations to 8-bit integers, which is the precision most NPU compilers expect. The output model still contains standard `QuantizeLinear` and `DequantizeLinear` ONNX nodes, so it is portable and can run on any ONNX Runtime backend — you do not need special tooling to inspect it. See [Concepts → Quantization and QDQ](../concepts/quantization.md) for a detailed explanation of the QDQ node pattern, calibration methods, and how to choose between per-tensor and per-channel quantization.
+
+---
+
+### Step 7: Compile for the target EP
+
+Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time. This is the step that produces a device-locked artifact — the output is tied to the specific EP and, for QNN, to the QNN SDK version.
+
+=== "QNN (Snapdragon NPU)"
+
+    ```bash
+    # Requires QNN_SDK_ROOT env var set to your QAIRT SDK root
+    uv run winml compile -m convnext_int8.onnx --device npu
+    ```
+
+=== "OpenVINO (Intel CPU/GPU/NPU)"
+
+    ```bash
+    uv run winml compile -m convnext_int8.onnx --device npu --ep openvino
+    ```
+
+=== "CPU fallback"
+
+    ```bash
+    uv run winml compile -m convnext_int8.onnx --device cpu
+    ```
+
+The compiled output file appears in the same directory as the input model. For QNN, the file name follows the pattern `convnext_int8_qnn_ctx.onnx` and an accompanying `.bin` context binary is written alongside it. For OpenVINO, the compiled artifact is named `convnext_int8_openvino_ctx.onnx`. For CPU, the output is `convnext_int8_cpu_ctx.onnx`.
+
+!!! note "What we just did"
+    Compilation embeds EP context — the compiled binary — inside or alongside the ONNX file using the `EPContext` node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15–60 second JIT penalty on first load. winml-cli locates the QAIRT SDK libraries needed for QNN compilation through `QNN_SDK_ROOT` (set as an environment variable, or passed with `--qnn-sdk-root` on `winml compile`). `winml build` reads only the env var. See [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) for the full picture of what gets embedded and how the context is consumed at runtime.
+
+---
+
+### Step 8: Benchmark
+
+Measure inference latency and throughput with the `--monitor` flag to see live NPU utilization alongside the timing numbers.
+
+=== "QNN NPU"
+
+    ```bash
+    uv run winml perf -m convnext_int8_qnn_ctx.onnx --device npu --iterations 50 --monitor
+    ```
+
+=== "OpenVINO NPU"
+
+    ```bash
+    uv run winml perf -m convnext_int8_openvino_ctx.onnx --device npu --ep openvino --iterations 50 --monitor
+    ```
+
+=== "CPU"
+
+    ```bash
+    uv run winml perf -m convnext_int8_cpu_ctx.onnx --device cpu --iterations 50
+    ```
+
+A representative run on a Snapdragon X Elite NPU produces output like the following:
+
+```text
+Device:       npu
+Task:         image-classification
+Iterations:   50 (+ 10 warmup)
+Batch Size:   1
+
+Latency (ms)
+  Avg    P50    P90    P95    P99    Min    Max    Std
+  2.14   2.11   2.31   2.38   2.59   1.98   2.71   0.14
+
+Throughput:  467.29 samples/sec
+
+Hardware (during benchmark)
+  NPU: 72.4% avg, 89.1% peak  |  CPU: 3.2% avg
+  Sys Mem: 1842 MB  |  Device Mem: 48/12 MB (local/shared)
+```
+
+The CPU fallback (same model, `--device cpu`) will typically show latencies 8–15x higher and near-zero NPU utilization. The contrast between those two runs is the best proof that your NPU path is actually being used.
+
+!!! note "What we just did"
+    `winml perf` generates random inputs matching the model's I/O spec, runs the configured number of warmup iterations (excluded from statistics), then the benchmark iterations, and reports full latency percentiles alongside throughput. The `--monitor` flag activates live hardware utilization polling at 200 ms intervals, displaying an in-terminal chart and attaching the hardware metrics to the JSON report saved alongside the console output. See [Concepts → Perf and monitoring](../concepts/perf-and-monitoring.md) for how to interpret the utilization numbers and what `hw_monitor` fields look like in the JSON report.
+
+---
+
+### Step 9 (optional): Evaluate accuracy
+
+After quantization it is good practice to verify that INT8 accuracy is close to the FP32 baseline. The `winml eval` command runs the model against a held-out dataset slice and reports task-relevant metrics.
+
+```bash
+uv run winml eval -m convnext_int8.onnx --model-id facebook/convnext-tiny-224 --dataset imagenet-1k --split validation --samples 100 --device npu
+```
+
+The `--model-id` flag is required when passing an ONNX file, because the evaluator needs it to locate the preprocessor and label mappings. The command downloads 100 shuffled validation samples, runs inference, and reports top-1 and top-5 accuracy. A well-quantized ConvNeXt-tiny should lose less than 0.5 percentage points of top-1 accuracy compared to the floating-point checkpoint.
+
+!!! note "What we just did"
+    Accuracy evaluation gives you a principled stopping criterion for quantization decisions. If the accuracy drop is larger than acceptable, return to Step 6 and try `--precision int16` or per-channel quantization (`--per-channel`) instead of the default per-tensor int8. See [Concepts → Eval and datasets](../concepts/eval-and-datasets.md) for the full list of supported datasets, tasks, and column mapping options.
+
+---
+
+## Section B — One-shot with `winml build`
+
+Once you understand what each primitive stage does (which you now do), you can collapse the entire pipeline into a single command. `winml build` orchestrates export, optimize, quantize, and compile in sequence using the config file you generated in Step 2.
+
+```bash
+uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/
+```
+
+The command downloads the pretrained weights, runs all four pipeline stages, and writes every intermediate and final artifact into `convnext_out/`. The stage timing is printed as each stage completes, and the final line tells you the path of the compiled model.
+
+You can selectively skip stages using the override flags:
+
+- `--no-optimize` — skip graph optimization (rarely needed; useful if you have a pre-optimized ONNX)
+- `--no-quant` — skip quantization (produces a floating-point compiled model)
+- `--no-compile` — skip compilation (produces a quantized but not device-locked ONNX)
+
+For example, to produce an optimized and quantized model without the compile step:
+
+```bash
+uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/ --no-compile
+```
+
+!!! note "What we just did"
+    `winml build` is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary. The config file you pass with `-c` fully specifies the device target, precision, and EP — so you get an NPU-targeted INT8 compiled model without needing to repeat those flags on every primitive command. The QNN SDK path is read from the `QNN_SDK_ROOT` environment variable, not from the config or CLI flags.
+
+Once the build completes, benchmark the final artifact from `convnext_out/`:
+
+```bash
+uv run winml perf -m convnext_out/model.onnx --device npu --iterations 50 --monitor
+```
+
+The result should match what you saw in Step 8, confirming that the `winml build` pipeline produces bit-identical output to the manual primitive chain.
+
+---
+
+## Where to go next
+
+- [Concepts → How winml-cli works](../concepts/how-it-works.md) — the full mental model for the pipeline
+- [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) — understanding the compiled artifact format
+- [Samples → ConvNeXt primitives walkthrough](../samples/convnext-primitives.md) — a side-by-side CPU vs. GPU vs. NPU device comparison using the same model
+- [Commands → Overview](../commands/overview.md) — quick reference for every flag on every command
+
+## See also
+
+- [Concepts → Quantization and QDQ](../concepts/quantization.md)
+- [Concepts → Analyze and optimize](../concepts/analyze-and-optimize.md)
+- [Concepts → Perf and monitoring](../concepts/perf-and-monitoring.md)
+- [Concepts → Eval and datasets](../concepts/eval-and-datasets.md)
diff --git a/mkdocs.yml b/mkdocs.yml
new file mode 100644
index 000000000..0256f3829
--- /dev/null
+++ b/mkdocs.yml
@@ -0,0 +1,120 @@
+site_name: winml-cli
+site_description: A CLI toolkit to build portable, performant, and high-quality models for Windows ML.
+site_url: https://microsoft.github.io/winml-cli/
+repo_url: https://github.com/microsoft/winml-cli
+repo_name: microsoft/winml-cli
+edit_uri: edit/main/docs/
+
+docs_dir: docs
+
+exclude_docs: |
+  /design/
+  /superpowers/
+  /naming-convention.md
+  /pytest-best-practices.md
+
+theme:
+  name: material
+  features:
+    - navigation.instant
+    - navigation.tracking
+    - navigation.tabs
+    - navigation.sections
+    - navigation.top
+    - content.code.copy
+    - content.action.edit
+    - toc.follow
+    - search.suggest
+    - search.highlight
+  palette:
+    - media: "(prefers-color-scheme: light)"
+      scheme: default
+      primary: indigo
+      accent: indigo
+      toggle:
+        icon: material/brightness-7
+        name: Switch to dark mode
+    - media: "(prefers-color-scheme: dark)"
+      scheme: slate
+      primary: indigo
+      accent: indigo
+      toggle:
+        icon: material/brightness-4
+        name: Switch to light mode
+
+plugins:
+  - search
+
+markdown_extensions:
+  - admonition
+  - attr_list
+  - md_in_html
+  - tables
+  - toc:
+      permalink: true
+  - pymdownx.details
+  - pymdownx.highlight:
+      anchor_linenums: true
+      line_spans: __span
+      pygments_lang_class: true
+  - pymdownx.inlinehilite
+  - pymdownx.snippets
+  - pymdownx.superfences:
+      custom_fences:
+        - name: mermaid
+          class: mermaid
+          format: !!python/name:pymdownx.superfences.fence_code_format
+  - pymdownx.tabbed:
+      alternate_style: true
+  - pymdownx.tasklist:
+      custom_checkbox: true
+
+nav:
+  - Home: index.md
+  - Getting Started:
+      - Installation: getting-started/installation.md
+      - Quickstart: getting-started/quickstart.md
+      - End-to-End Tour: getting-started/end-to-end.md
+  - Concepts:
+      - Fundamentals:
+          - How winml-cli works: concepts/how-it-works.md
+          - Graph and IR: concepts/graphs-and-ir.md
+          - Weight and Activation: concepts/weight-and-activation.md
+          - EP and Device: concepts/eps-and-devices.md
+          - Datatype and Quantization: concepts/quantization.md
+      - WinML CLI:
+          - Primitives and pipeline: concepts/primitives-and-pipeline.md
+          - Load and export: concepts/load-and-export.md
+          - Analyze and optimize: concepts/analyze-and-optimize.md
+          - Compile and EPContext: concepts/compile-and-epcontext.md
+          - Perf and monitoring: concepts/perf-and-monitoring.md
+          - Eval and datasets: concepts/eval-and-datasets.md
+          - Config and build: concepts/config-and-build.md
+  - Commands:
+      - Overview: commands/overview.md
+      - Discover:
+          - sys: commands/sys.md
+          - inspect: commands/inspect.md
+          - hub: commands/hub.md
+          - analyze: commands/analyze.md
+      - Configure:
+          - config: commands/config.md
+          - optimize: commands/optimize.md
+      - Build:
+          - export: commands/export.md
+          - quantize: commands/quantize.md
+          - compile: commands/compile.md
+          - build: commands/build.md
+      - Measure:
+          - perf: commands/perf.md
+          - eval: commands/eval.md
+  - Samples:
+      - ConvNeXt — Primitives Walkthrough: samples/convnext-primitives.md
+      - BERT — Config + Build + Perf: samples/bert-config-build.md
+      - Qwen3 — Composite Models: samples/qwen3-composite.md
+  - Tutorials:
+      - Overview: tutorials/index.md
+      - ConvNeXt on NPU: tutorials/npu-convnext.md
+  - Reference: reference/index.md
+  - Troubleshooting: troubleshooting.md
+  - Contributing: contributing.md
diff --git a/pyproject.toml b/pyproject.toml
index 35f4c3b66..4e116aeb9 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -69,8 +69,11 @@ optional-dependencies.dev = [
   "jupyter>=1.1.1",
   "markdown-it-py>=3",
   "matplotlib>=3.10",
+  "mkdocs-jupyter>=0.25",
+  "mkdocs-material>=9.5",
   "mypy>=1.18",
   "nbconvert>=7.16",
+  "pymdown-extensions>=10.7",
   "pytest>=8.4",
   "pytest-cov>=7",
   "pytest-timeout>=2.3",

From e15d9f198e689a356fc12444b1a7736ff46f8083 Mon Sep 17 00:00:00 2001
From: Zac <1221537+tezheng@users.noreply.github.com>
Date: Wed, 27 May 2026 08:41:22 +0800
Subject: [PATCH 002/143] docs: add docs/README.md for contributor workflow +
 exclude it from the site

Adds a contributor-facing README at docs/README.md covering:
- uv-based dev setup
- mkdocs serve / build --strict workflow
- gh-deploy publish (local one-shot)
- .github/workflows/docs.yml CI workflow (currently workflow_dispatch only)
- Authoring conventions (winml-cli name, flag verification, admonitions,
  tabbed code blocks)
- Excluded paths reference

Updates mkdocs.yml exclude_docs to include /README.md so the new file
doesn't collide with docs/index.md as the chapter index.
---
 docs/README.md | 130 +++++++++++++++++++++++++++++++++++++++++++++++++
 mkdocs.yml     |   1 +
 2 files changed, 131 insertions(+)
 create mode 100644 docs/README.md

diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 000000000..6fa5d68bf
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,130 @@
+# Contributing to winml-cli docs
+
+This folder hosts the source for the [winml-cli](https://github.com/microsoft/winml-cli) documentation site, built with [MkDocs Material](https://squidfunk.github.io/mkdocs-material/).
+
+## Quick reference
+
+| Task | Command |
+|---|---|
+| Install dev deps | `uv sync --extra dev` |
+| Live preview | `uv run mkdocs serve` |
+| Build for CI | `uv run mkdocs build --strict` |
+| Publish (one-shot from laptop) | `uv run mkdocs gh-deploy --force` |
+| Publish (CI workflow) | GitHub Actions → "Build & Publish Docs" → Run workflow |
+
+## What's in here
+
+```
+docs/
+├── index.md                          ← landing page
+├── getting-started/                  ← 3 onboarding pages
+├── concepts/                         ← 12 conceptual pages in two sub-groups
+│   ├── how-it-works.md, graphs-and-ir.md, weight-and-activation.md,
+│   │     eps-and-devices.md, quantization.md         (Fundamentals)
+│   └── primitives-and-pipeline.md, load-and-export.md, analyze-and-optimize.md,
+│         compile-and-epcontext.md, perf-and-monitoring.md, eval-and-datasets.md,
+│         config-and-build.md                         (WinML CLI workflows)
+├── commands/                         ← per-command reference (overview + 12 commands)
+├── samples/                          ← reference-style walkthroughs
+├── tutorials/                        ← classroom-style walkthroughs
+├── reference/                        ← P2 stubs
+├── troubleshooting.md                ← P2 stub
+├── contributing.md                   ← P2 stub
+│
+├── superpowers/                      ← specs, plans, review notes (excluded from build)
+├── design/                           ← internal ADRs and design docs (excluded)
+├── naming-convention.md              ← internal style guide (excluded)
+└── pytest-best-practices.md          ← internal style guide (excluded)
+```
+
+The site config (`mkdocs.yml`) lives at the repo root, not inside `docs/`. The build outputs to `site/` (gitignored).
+
+## Local development
+
+### Prerequisites
+
+Python 3.10+ and [uv](https://github.com/astral-sh/uv).
+
+### Setup and preview
+
+```bash
+# from the repo root
+uv sync --extra dev
+uv run mkdocs serve
+```
+
+Open http://127.0.0.1:8000/ in a browser. The server auto-reloads when you edit any `.md` file under `docs/`. Changes to `mkdocs.yml` (nav, theme, plugins) require a manual server restart.
+
+### Validate before pushing
+
+```bash
+uv run mkdocs build --strict
+```
+
+`--strict` must exit 0 with no `WARNING` lines. Common causes of strict-mode failures:
+
+- A new page added without an entry in `nav:` (gives a "not included in nav" warning)
+- A nav entry pointing at a file that doesn't exist
+- A relative link like `[text](other-page.md)` whose target file is missing
+- A markdown anchor like `[link](#section-heading)` that doesn't match any heading slug
+
+## Publishing
+
+The site publishes to **GitHub Pages** from the `gh-pages` branch. The repo's `Settings → Pages` source is set to "Deploy from a branch" → `gh-pages` → `/ (root)`.
+
+### One-shot publish from your laptop
+
+```bash
+uv run mkdocs gh-deploy --force
+```
+
+This builds the site locally, commits the static HTML to a local `gh-pages` branch, and force-pushes it to `origin/gh-pages`. GitHub Pages picks up the new commit within ~30–60 seconds.
+
+### Publish via CI
+
+The workflow at `.github/workflows/docs.yml` does the same thing in CI:
+
+1. `Settings → Actions → Build & Publish Docs → Run workflow`
+2. Select the branch you want to publish from (typically `main`)
+
+The workflow is `workflow_dispatch` only — there is no automatic publish on push. If you want auto-publish on every push to `main`, change the trigger:
+
+```yaml
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'docs/**'
+      - 'mkdocs.yml'
+      - 'pyproject.toml'
+      - '.github/workflows/docs.yml'
+  workflow_dispatch:
+```
+
+## Authoring conventions
+
+- **Product name**: `winml-cli` (lowercase, hyphenated) throughout user-facing prose. Use `WinML CLI` (or `Windows ML`) only where the broader Microsoft brand is meant.
+- **Command name**: the CLI invocation is always `winml <subcommand>`. Never `wmk`.
+- **Flag verification**: every flag mentioned in docs must exist in `src/winml/modelkit/commands/<cmd>.py`. Run `uv run winml <cmd> --help` to confirm.
+- **Source citations**: when documenting source-grounded behavior (e.g., "the default opset is 17"), cite the file path and ideally the symbol name. Avoid line numbers — they drift fast.
+- **Mermaid diagrams**: use `pymdownx.superfences` syntax (already configured in `mkdocs.yml`).
+- **Tabbed code blocks**: use `pymdownx.tabbed` (`=== "Label"` followed by a blank line and 4-space-indented code block).
+- **Admonitions**: `!!! note "Title"`, `!!! warning "Title"`, `!!! info "Title"`.
+- **No emojis** in pages unless they're part of an external attribution (e.g., a GitHub badge).
+
+## Excluded paths
+
+The following are present in `docs/` but **excluded from the published site** via the `exclude_docs:` block in `mkdocs.yml`. They are kept in-repo for contributors:
+
+- `docs/design/` — internal architecture decision records and design notes
+- `docs/superpowers/` — specs, plans, and review notes accumulated during doc development
+- `docs/naming-convention.md` — internal naming conventions for code review
+- `docs/pytest-best-practices.md` — internal testing style guide
+
+If you add new internal-only content, either place it under one of these excluded paths or add a new entry to `exclude_docs` in `mkdocs.yml`.
+
+## See also
+
+- [MkDocs Material reference](https://squidfunk.github.io/mkdocs-material/reference/)
+- [MkDocs Material navigation setup](https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/)
+- [MkDocs Material color palette](https://squidfunk.github.io/mkdocs-material/setup/changing-the-colors/)
diff --git a/mkdocs.yml b/mkdocs.yml
index 0256f3829..651a80d8d 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -12,6 +12,7 @@ exclude_docs: |
   /superpowers/
   /naming-convention.md
   /pytest-best-practices.md
+  /README.md
 
 theme:
   name: material

From 6109af436dfce03cc2f5de8f8a0ff81ea680ba0c Mon Sep 17 00:00:00 2001
From: Zac <1221537+tezheng@users.noreply.github.com>
Date: Wed, 27 May 2026 08:46:43 +0800
Subject: [PATCH 003/143] docs(review): record fact-check findings against
 microsoft/winml-cli source

Six parallel review agents fact-checked all 34 user-facing doc files
against microsoft/winml-cli @ 5e25579. Output: one issue file per
source doc at docs/superpowers/2026-05-27-doc-issues/.

A validator agent then cross-checked every Critical and Important
claim and produced the consolidated, false-positive-filtered list at
docs/superpowers/2026-05-27-validated-issues.md.

Summary: 25 Critical + 22 Important kept; 6 rejected as false
positives. Major theme: docs were authored against feat/mvp source
where some symbols and defaults differ from main (e.g., _KNOWN_PRECISIONS
in _options.py vs _NAMED_PRECISIONS in precision.py; winml hub vs
winml catalog; many flag defaults flipped to 'auto'; DML/CPU no
longer produce _ctx.onnx artifacts).

Next step: per-file fix agents will apply the validated list.
---
 .../analyze-and-optimize.md                   |  43 +++++
 .../2026-05-27-doc-issues/analyze.md          |  44 +++++
 .../bert-config-build.md                      |  96 +++++++++++
 .../2026-05-27-doc-issues/build.md            |  37 +++++
 .../compile-and-epcontext.md                  |  38 +++++
 .../2026-05-27-doc-issues/compile.md          |  31 ++++
 .../2026-05-27-doc-issues/config-and-build.md |  57 +++++++
 .../2026-05-27-doc-issues/config.md           |  43 +++++
 .../convnext-primitives.md                    |  82 ++++++++++
 .../2026-05-27-doc-issues/end-to-end.md       |  15 ++
 .../2026-05-27-doc-issues/eps-and-devices.md  |  26 +++
 .../eval-and-datasets.md                      |  18 +++
 .../superpowers/2026-05-27-doc-issues/eval.md |  23 +++
 .../2026-05-27-doc-issues/export.md           |  32 ++++
 .../2026-05-27-doc-issues/graphs-and-ir.md    |  21 +++
 .../2026-05-27-doc-issues/how-it-works.md     |  22 +++
 docs/superpowers/2026-05-27-doc-issues/hub.md |  37 +++++
 .../2026-05-27-doc-issues/index.md            |  24 +++
 .../2026-05-27-doc-issues/inspect.md          |  35 ++++
 .../2026-05-27-doc-issues/installation.md     |  13 ++
 .../2026-05-27-doc-issues/load-and-export.md  |  43 +++++
 .../2026-05-27-doc-issues/npu-convnext.md     |  18 +++
 .../2026-05-27-doc-issues/optimize.md         |  48 ++++++
 .../2026-05-27-doc-issues/overview.md         |  17 ++
 .../perf-and-monitoring.md                    |  18 +++
 .../superpowers/2026-05-27-doc-issues/perf.md |  41 +++++
 .../primitives-and-pipeline.md                |  32 ++++
 .../2026-05-27-doc-issues/quantization.md     |  29 ++++
 .../2026-05-27-doc-issues/quantize.md         |  33 ++++
 .../2026-05-27-doc-issues/quickstart.md       |  13 ++
 .../2026-05-27-doc-issues/qwen3-composite.md  |  43 +++++
 docs/superpowers/2026-05-27-doc-issues/sys.md |  33 ++++
 .../2026-05-27-doc-issues/tutorials-index.md  |  27 ++++
 .../weight-and-activation.md                  |  18 +++
 .../2026-05-27-validated-issues.md            | 151 ++++++++++++++++++
 35 files changed, 1301 insertions(+)
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/analyze.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/bert-config-build.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/build.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/compile.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/config-and-build.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/config.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/end-to-end.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/eval.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/export.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/how-it-works.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/hub.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/index.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/inspect.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/installation.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/load-and-export.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/npu-convnext.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/optimize.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/overview.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/perf.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/quantization.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/quantize.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/quickstart.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/sys.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/tutorials-index.md
 create mode 100644 docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md
 create mode 100644 docs/superpowers/2026-05-27-validated-issues.md

diff --git a/docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md b/docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md
new file mode 100644
index 000000000..030d88c8f
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md
@@ -0,0 +1,43 @@
+# Issues: docs/concepts/analyze-and-optimize.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+
+- **`--output results.json` flag for `winml analyze`** (line 9): The doc says "add `--output results.json` to save the report as JSON". The actual flag is `--output` (source: `commands/analyze.py` line 653 `@cli_utils.output_option("Save JSON output to file")`). This is valid and correct.
+
+- **`--preset` flag on `winml optimize`** (line 21): The doc says "Use presets (`--preset transformer-optimized`, `--preset qnn-compatible`) as a starting point." No `--preset` flag exists on `winml optimize`. The command has `--config` (a config file) and capability flags, but no `--preset` option (source: `commands/optimize.py` — the full file was read and contains no `--preset` option). This is a fabricated flag that would cause `Error: No such option: --preset` if a user tries it.
+
+## Important (misleading or stale claim)
+
+- **Exit codes described as 0/1/2** (line 10): Doc says "zero is full support, one is partial support with unsupported operators, two is a configuration error." Source confirms: `commands/analyze.py` line 1212-1213 (`sys.exit(0 if overall_supported else 1)`) and lines 734-736, 1216-1222 use `sys.exit(2)` for errors. This matches the doc.
+
+- **`--save-node unsupported` or `--save-node partial`** (line 11): Doc says "Use `--save-node unsupported` or `--save-node partial`". Source shows `--save-node` with `multiple=True` and choices `["partial", "unsupported"]` (`commands/analyze.py` lines 673-676). The flag exists and the values are valid.
+
+- **`--max-optim-iterations` default described as "three"** (line 26): Doc says "default: three". Source confirms `default: 3` in the help text (`commands/build.py` line 310) and `hack_max_optim_iterations` defaults to `3` in the build pipeline (`commands/build.py` line 1112, 1234). Correct.
+
+- **`--no-analyze` on `winml build`** (line 27): The doc says "`winml build` runs analyze and optimize in an alternating loop" and "Use `--no-analyze` to skip the loop". Source confirms `--no-analyze` on `winml build` (`commands/build.py` lines 294-298) which sets `hack_max_optim_iterations = 0`. Correct.
+
+- **`--commit a specific combination to a `--config` file`** (line 21): Doc says "commit a specific combination to a `--config` file". The `winml optimize` command has `--config` / `-c` (source: `commands/optimize.py` lines 176-180). This is valid.
+
+## Minor (style, polish, low-impact)
+
+- **`--list-capabilities` and `--list-rewrites` flags** (lines 17, 19): Both exist on `winml optimize` → `commands/optimize.py` lines 153, 160. Correct.
+
+- **Pattern-rewrite flag form `--enable-<source-slug>-<target-slug>`** (line 19): Consistent with source → `commands/optimize.py` lines 217-224, which documents `--enable-gelu-singlegelu` as example. Correct.
+
+- **Cross-links** `[compile-and-epcontext.md]`, `[primitives-and-pipeline.md]`, `[../commands/analyze.md]`, `[../commands/optimize.md]` (lines 31-34): All files exist.
+
+## Verified correct (anchored claims you checked)
+
+- `winml analyze` `--ep` flag exists and takes provider name → `commands/analyze.py` lines 628-639
+- `winml analyze` `--device` flag with CPU/GPU/NPU choices → `commands/analyze.py` lines 641-650
+- `winml analyze` `--information` / `--no-information` flag (default: enabled) → `commands/analyze.py` lines 654-657
+- `winml analyze` `--output` flag for JSON → `commands/analyze.py` line 653
+- `winml analyze` exit codes 0/1/2 → `commands/analyze.py` lines 1212-1213, 1216-1222
+- `winml optimize` `--enable-<name>` / `--disable-<name>` flag pattern → `commands/optimize.py` lines 124-131
+- `winml optimize` `--list-capabilities` flag → `commands/optimize.py` lines 153-158
+- `winml optimize` `--list-rewrites` flag → `commands/optimize.py` lines 160-164
+- `winml optimize` `--config` file flag → `commands/optimize.py` lines 176-180
+- Fusions include GeLU, LayerNorm, MatMul+Add → `optim/pipes/graph.py` lines 242-243
+- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/analyze.md b/docs/superpowers/2026-05-27-doc-issues/analyze.md
new file mode 100644
index 000000000..ef5d9cf2e
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/analyze.md
@@ -0,0 +1,44 @@
+# Issues: docs/commands/analyze.md
+
+Source verified against: `src/winml/modelkit/commands/analyze.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **`--device` default is documented as `NPU`** (doc line 21: "Default: `NPU`") but source line 644 sets `default="auto"` with `show_default=True`. Running `winml analyze --model model.onnx` will use `device="auto"` (infer from local availability), not NPU. A user relying on the doc to know their model will be analyzed against NPU by default will be wrong.
+
+- **`--ep` default is documented as "none — all supported EPs are analyzed"** (doc line 20) but source line 633 sets `default="auto"`. The "auto" mode (source lines 759–768) infers from local availability, not "all supported EPs". Running with no `--ep` is not the same as `--ep all`. The doc's description of the default behavior is wrong.
+
+- **`--run-unknown-op` default is documented as "enabled"** (doc line 26: "flag / enabled") but source line 668 has `default=False`. The pitfall at doc line 84 even says "Disable when the local machine lacks the required libraries" — implying it is on by default — which is incorrect. The correct default is disabled; users must pass `--run-unknown-op` to enable it.
+
+- **`--optim-config` flag is missing from the flag table.** Source lines 677–681 define `@click.option("--optim-config", type=click.Path(path_type=Path), default=None, help="Save auto-discovered optimization config to JSON file")`. This is a functional flag for saving optimization settings and is not documented at all.
+
+- **`--model` has no short form `-m` in the analyze command.** The doc flag table shows no short for `--model` (doc line 19 has empty Short column), which is correct — `model_path_option` in `cli.py` line 68 uses `"--model", "-m"`. Wait — actually it does have `-m`. Let me clarify: the doc table (line 19) shows `| \`--model\` | | \`PATH\` |` with an *empty* Short column, meaning the doc claims there is no short `-m` form. But `model_path_option` (cli.py line 68) uses `click.option("--model", "-m", ...)`, so `-m` is valid. This is a documentation error — users will not know they can use `-m model.onnx`.
+
+- **`--verbose` / `-v` and `--quiet` / `-q` flags are absent from the flag table.** Source uses `@cli_utils.verbosity_options` (line 651) which adds `--verbose / -v` (count) and `--quiet / -q` (flag) — see `cli.py` lines 181–209. Neither appears in the doc.
+
+- **`--config` / `-c` (build config) flag is absent from the flag table.** Source uses `@cli_utils.build_config_option` (line 652) which adds `-c/--config` accepting a `WinMLBuildConfig` JSON file — see `cli.py` lines 212–222. The doc does not mention this.
+
+## Important (misleading or stale)
+
+- **`--ep` choice type** — doc says it accepts full names and short aliases. Source line 634 uses `type=click.Choice([*ALL_EP_NAMES, "all", "auto"], case_sensitive=False)`. The "auto" and "all" values are valid choices but are not mentioned in the doc. The doc's description "When omitted, all supported EPs are analyzed" is wrong (see Critical above); the actual valid special values are "all" and "auto".
+
+- **`--device` choice type** — source line 644 uses `type=click.Choice([*SUPPORTED_DEVICES, "all", "auto"], case_sensitive=False)`. The "all" and "auto" values are not mentioned in the doc.
+
+- **Example "Analyze against all supported EPs"** (doc line 37) runs `winml analyze --model microsoft/resnet-50.onnx` with no `--ep`. Given the actual default is `auto` (not all), the example's described output showing both QNN and OpenVINO may or may not match what runs on a given machine.
+
+## Minor (polish)
+
+- The "Common pitfalls" section says "Omitting `--ep` analyzes every EP" (line 82) — this repeats the incorrect claim from the default description.
+- Exit code documentation (codes 0, 1, 2) matches source lines 1212–1214 and is correct.
+
+## Verified correct (key claims checked)
+
+- `--model` exists (via `model_path_option`) and is required → `cli.py` line 57, `analyze.py` line 627.
+- `--information/--no-information` flag exists with `default=True` → source lines 654–658.
+- `--htp-metadata` flag exists with `type=click.Path(exists=True)`, default `None` → source lines 659–664.
+- `--run-unknown-op/--no-run-unknown-op` flag exists → source lines 665–669.
+- `--save-node` flag exists as `multiple=True, type=Choice(["partial", "unsupported"])` → source lines 670–676.
+- `--output / -o` flag exists → via `cli_utils.output_option`, `cli.py` line 98.
+- Static analysis via `ONNXStaticAnalyzer` → source line 819.
+- Exit codes 0/1/2 → source lines 1212–1218.
+- VitisAI special-cases `--run-unknown-op` to always False → source lines 537–542.
diff --git a/docs/superpowers/2026-05-27-doc-issues/bert-config-build.md b/docs/superpowers/2026-05-27-doc-issues/bert-config-build.md
new file mode 100644
index 000000000..4d696beef
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/bert-config-build.md
@@ -0,0 +1,96 @@
+# Issues: docs/samples/bert-config-build.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- **Final artifact name is wrong.** Step 2 output block says:
+  `Final artifact: bert_out/bert-base-uncased_ctx.onnx`
+  The actual build pipeline in `commands/build.py` (line 714) always writes the
+  final output as `model.onnx` inside the output directory:
+  `final_path = resolved_dir / _name("model.onnx")`
+  For a non-cached build the artifact is `bert_out/model.onnx`, not
+  `bert_out/bert-base-uncased_ctx.onnx`. The `_name()` helper only prepends a
+  cache key when `--use-cache` is active; with `-o bert_out/` it stays `model.onnx`.
+
+- **Step 3 perf command references the wrong artifact.**
+  `winml perf -m bert_out/bert-base-uncased_ctx.onnx` will fail because the file
+  does not exist (see above). Should be `winml perf -m bert_out/model.onnx`.
+
+## Important
+
+- **`build` command flag: doc uses `-o bert_out/` but the flag is `--output-dir`.**
+  In `commands/build.py` line 250-252 the short alias `-o` maps to `--output-dir`.
+  The `-o` short form is defined, so the command works — but the doc never
+  mentions `--output-dir` anywhere (the "Customizing the config" section also
+  uses `-o`), leaving readers who try `--help` unable to find it easily.
+  The step 2 command itself is syntactically valid; this is a doc clarity issue.
+
+- **JSON excerpt uses `"optim"` key.** `config/build.py` line 17 in the config
+  hierarchy comment shows `optim: WinMLOptimizationConfig`. The serialised key
+  from `WinMLBuildConfig.to_dict()` must be verified. Check that `optim` (not
+  `optimize` or `optimization`) is the actual JSON key. Based on the config
+  hierarchy definition in `config/build.py` the field is named `optim`, which
+  aligns with the doc. Verified plausible, but should be confirmed by reading the
+  `to_dict()` / `from_dict()` implementation in `config/build.py`.
+
+- **JSON excerpt `"optim"` section fields: `gelu_fusion`, `layer_norm_fusion`,
+  `matmul_add_fusion`.** These field names must match `WinMLOptimizationConfig`.
+  The optimize command uses a capability registry; the field names in the
+  serialised JSON depend on how `WinMLOptimizationConfig.to_dict()` names them.
+  The doc claims them without source verification — they may differ from the
+  actual serialised keys.
+
+- **JSON excerpt `"compile"` section.** The doc shows:
+  ```json
+  "compile": {
+    "execution_provider": "qnn",
+    "enable_ep_context": true,
+    "compiler": "ort"
+  }
+  ```
+  These map to `WinMLCompileConfig.to_dict()` in `compiler/configs.py` lines 232-247.
+  `execution_provider`, `enable_ep_context`, and `compiler` are all present in
+  `to_dict()`. Verified correct for those three keys.
+
+- **Note mentions `--max-optim-iterations` flag.** In `commands/build.py` line
+  307 the flag is `--max-optim-iterations` (not `--max-optimize-iterations`).
+  The doc spells it `--max-optim-iterations`, which matches. Verified correct.
+
+- **`--no-quant` and `--no-compile` flags on `winml build`.** Both exist in
+  `build.py` (`--no-quant` line 272, `--no-compile/--compile` line 277). Verified.
+
+- **`winml config --precision fp16`.** `config.py` has `-p`/`--precision` with
+  `type=str` accepting `fp16`. Verified valid.
+
+- **`bert-base-uncased` model ID.** The canonical HF ID is
+  `google-bert/bert-base-uncased`; `bert-base-uncased` is a redirect that still
+  works. The doc uses the short alias consistently. Acceptable but not canonical.
+
+## Minor
+
+- **Step 1: `winml config -m bert-base-uncased -t text-classification -o bert_config.json`.**
+  The `-t` flag on `config` is for `--task`. Verified in `config.py` line 78-79.
+  Valid.
+
+- **Note: `quant.weight_type` and `quant.activation_type` editing instructions.**
+  The doc suggests setting these to `"int8"` or `"uint16"`. Valid options per
+  `quantize.py` line 71: `type=click.Choice(["uint8", "int8", "uint16", "int16"])`.
+  Verified correct.
+
+## Verified correct
+
+- `winml config -m bert-base-uncased -t text-classification -o bert_config.json`
+  — all flags valid.
+- `winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/`
+  (`-o` short form) — command structure valid (see Critical note on artifact name).
+- `winml build ... --no-quant` — flag verified in `build.py`.
+- Top-level JSON keys `loader`, `export`, `optim`, `quant`, `compile` — match
+  `WinMLBuildConfig` field names.
+- `quant.mode`, `quant.weight_type`, `quant.activation_type`, `quant.samples`,
+  `quant.calibration_method`, `quant.task`, `quant.model_name` — all present as
+  fields on `WinMLQuantizationConfig` (verified in `quantize.py` config usage).
+- No `wmk` or `ModelKit` strings in user-facing prose.
+- Cross-links to `convnext-primitives.md`, `../concepts/config-and-build.md`,
+  `../commands/config.md`, `../commands/build.md`, `../commands/perf.md` are
+  consistent with repo structure.
diff --git a/docs/superpowers/2026-05-27-doc-issues/build.md b/docs/superpowers/2026-05-27-doc-issues/build.md
new file mode 100644
index 000000000..157a69542
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/build.md
@@ -0,0 +1,37 @@
+# Issues: docs/commands/build.md
+
+Source verified against: `src/winml/modelkit/commands/build.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **`--random-init` flag does not exist.** The flag table lists `--random-init` as "Skip weight download; build with random weights". A full search of `build.py` finds no `--random-init` or `random_init` option definition. The behavior (random-weight build) is supported by omitting `-m` (see `build.py:247`: "Omit for random-weight build"), but there is no `--random-init` flag. Users who pass `--random-init` will get "No such option".
+- **`--config` / `-c` listed as *(required)* but source marks it `required=False`.** `build.py:237` sets `required=False` with `default=None`. When `-c` is omitted, config is auto-generated from `-m`. The doc makes it sound mandatory.
+
+## Important (misleading or stale)
+
+- **`--qnn-sdk-root` should not appear in this page.** The flag does not exist in `build.py` (confirmed: zero hits for `qnn_sdk_root` or `qnn-sdk-root` in the option definitions). It is a `winml compile` flag only. Its appearance in the flag table is a copy-paste error.
+- **`--no-compile` is documented as a simple flag but source defines a `--no-compile/--compile` toggle pair.** `build.py:275-282` shows `--no-compile/--compile` as a boolean toggle with `default=None`. The doc only shows `--no-compile`, omitting `--compile` (which forces compilation on when the config has a compile section). The `--compile` positive form is useful and undocumented.
+- **Flag table omits `--trust-remote-code`.** `build.py:312-314` defines this via `cli_utils.trust_remote_code_option(...)`. Users building custom architecture models (e.g., Mu2) need it.
+- **`--max-optim-iterations` table shows default `3` but source default is `None`.** `build.py:309` sets `default=None`. The actual default of `3` is enforced inside the pipeline helpers (`build.py:1112, 1234`), not at the CLI layer. If the user does not pass the flag, Click resolves it as `None`, not `3`.
+
+## Minor (polish)
+
+- **Flag table omits `--verbose` / `-v`.** Defined at `build.py:315-320`.
+- **"How it works" says pipeline is "export → optimize → quantize → compile" in the intro, but the synopsis shows the full correct form.** The command map table in overview.md correctly shows "export → optimize → quantize → compile". The build.md intro paragraph at line 44 says only "export → quantize → compile" (missing optimize). Minor omission but inconsistent.
+
+## Verified correct (key claims checked)
+
+- `--config` / `-c` path, optional → `build.py:233-241`
+- `--model` / `-m` string default None → `build.py:242-248`
+- `--output-dir` / `-o` path default None → `build.py:249-256`
+- `--use-cache` flag default false → `build.py:257-262`
+- `--rebuild` flag default false → `build.py:263-268`
+- `--no-quant` flag default false → `build.py:269-274`
+- `--no-optimize` flag default false → `build.py:299-304`
+- `--no-analyze` flag default false → `build.py:293-298`
+- `--ep` defined via `cli_utils.ep_option` → `build.py:283-286`
+- `--device` defined via `cli_utils.device_option` default `auto` → `build.py:287-292`
+- Mutual exclusion: `--output-dir` and `--use-cache` → `build.py:376-379`
+- `--use-cache` not supported in module mode → `build.py:491-495`
+- ONNX input skips export stage → `build.py:691-711` (`_build_onnx_pipeline`)
+- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md b/docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md
new file mode 100644
index 000000000..9ca0efc38
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md
@@ -0,0 +1,38 @@
+# Issues: docs/concepts/compile-and-epcontext.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+
+- **`--no-quant` on `winml compile`** (line 29): The doc says "`winml compile` also accepts `--no-quant` to skip the quantization pass for already-quantized (QDQ) models." There is no `--no-quant` flag on `winml compile`. The `commands/compile.py` file was fully read and contains no `--no-quant` option. This is a flag that exists on `winml build`, not `winml compile`. A user passing `--no-quant` to `winml compile` will get `Error: No such option: --no-quant`.
+
+## Important (misleading or stale claim)
+
+- **`--ep qnn` and `--ep vitisai` described as "QNN-family EPs"** (line 11): The doc lumps these together as both producing "EP context blobs". Source shows `WinMLCompileConfig.for_provider()` treats them distinctly — `vitisai` uses `VitisAIExecutionProvider` and `qnn` uses `QNNExecutionProvider` (`commands/compile.py` lines 214-221, `compiler/configs.py` lines 209-221). Both do produce EPContext, but the doc's grouping as interchangeable is a simplification that may mislead users.
+
+- **External EPContext described as "default"** (lines 17-21): Doc says "By default the blob is written as a sidecar `.bin` file alongside the `.onnx`." Source confirms `embed_context: bool = False` as default in `EPConfig` (`compiler/configs.py` line 46), so external is indeed the default. Correct.
+
+- **`--embed` flag** (line 17): Doc says "Passing `--embed` instead inlines the blob". Source confirms `--embed` is a flag on `winml compile` (`commands/compile.py` lines 96-99), which sets `embed_context=True`. Correct.
+
+- **`--compiler qairt` and `--qnn-sdk-root`** (line 13): Doc says "select `--compiler qairt` and point `--qnn-sdk-root`". Source confirms both flags on `winml compile` (`commands/compile.py` lines 83-93). Correct.
+
+- **`--no-validate` flag** (line 34): The actual flag on `winml compile` is `--validate/--no-validate` (source: `commands/compile.py` lines 72-74). The doc says "The `--no-validate` flag skips that pass." This is accurate — `--no-validate` is the negative form of the `--validate/--no-validate` pair.
+
+## Minor (style, polish, low-impact)
+
+- **Validation described as "default: enabled"** (line 33): Confirmed — `WinMLCompileConfig.validate: bool = True` (`compiler/configs.py` line 86) and `--validate/--no-validate` defaults to `True` (`commands/compile.py` line 74). Correct.
+
+- **Cross-links** `[eps-and-devices.md]`, `[analyze-and-optimize.md]`, `[../commands/compile.md]`, `[../commands/build.md]` (lines 39-43): All target files exist.
+
+## Verified correct (anchored claims you checked)
+
+- `winml compile` `--ep` flag exists → `commands/compile.py` lines 66-69
+- `winml compile` `--device` flag with auto/npu/gpu/cpu choices → `commands/compile.py` lines 58-65
+- `winml compile` `--compiler` flag with choices `["ort", "qairt"]` → `commands/compile.py` lines 83-87
+- `winml compile` `--qnn-sdk-root` flag exists → `commands/compile.py` lines 88-93
+- `winml compile` `--embed` flag exists → `commands/compile.py` lines 96-99
+- `winml compile` `--validate/--no-validate` flag exists, default enabled → `commands/compile.py` lines 72-74
+- `EPConfig.embed_context` defaults to `False` (external sidecar) → `compiler/configs.py` line 46
+- `EPConfig.enable_ep_context` defaults to `True` → `compiler/configs.py` line 45
+- Compiler backend `ort` is the default → `commands/compile.py` line 87 (`default="ort"`)
+- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/compile.md b/docs/superpowers/2026-05-27-doc-issues/compile.md
new file mode 100644
index 000000000..418997126
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/compile.md
@@ -0,0 +1,31 @@
+# Issues: docs/commands/compile.md
+
+Source verified against: `src/winml/modelkit/commands/compile.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **`--device` default listed as `npu` but source default is `auto`.** The flag table and "Common pitfalls" both claim "default is `npu`" and "`--device` default is `npu`, not `auto`". Source `compile.py:59-65` defines `default="auto"`. Users relying on the doc who expect NPU targeting without passing `--device` will instead get auto-detection. This is a direct behavioral contradiction.
+
+## Important (misleading or stale)
+
+- **`--no-quant` flag does not exist in compile.py.** The flag table shows `--no-quant` with description "Flag retained for compatibility; quantization is no longer performed during compile." A search of `compile.py` finds zero occurrences of `no-quant`, `no_quant`, or `--no-quant`. The flag is documented but not defined; any user who passes it will get a "No such option" error.
+- **`--validate` / `--no-validate` is a toggle pair, not a simple `--no-validate` flag.** Source `compile.py:72-74` defines `--validate/--no-validate` as a boolean toggle with `default=True`. The table shows only `--no-validate` as an independent flag; this is accurate in effect but hides the positive form `--validate` and implies a different UI contract.
+- **`--output` (file path) is not documented in the flag table.** Source `compile.py:51` registers `cli_utils.output_option(...)`, which adds `--output` / `-o`. The table jumps straight to `--output-dir`. Users cannot discover `-o` for writing to a specific file path.
+
+## Minor (polish)
+
+- **Flag table omits `--verbose` / `-v`.** Defined at `compile.py:76-81`.
+- **"Common pitfalls" says `--no-quant` is a no-op** — this is correct in spirit (quantization is not done at compile time), but the flag does not exist, so the pitfall note is misleading. Replace with a note that the flag was removed and users should not pass it.
+
+## Verified correct (key claims checked)
+
+- `--model` / `-m` optional (required unless `--list`) → `compile.py:44-50`
+- `--output-dir` path default None → `compile.py:53-57`
+- `--device` choice `auto|npu|gpu|cpu` → `compile.py:59-65`
+- `--ep` choice of provider names → `compile.py:66-69` via `cli_utils.ep_option`
+- `--compiler` choice `ort|qairt` default `ort` → `compile.py:82-87`
+- `--qnn-sdk-root` path default None → `compile.py:88-93`
+- `--embed` flag default false → `compile.py:94-99`
+- `--list` flag default false → `compile.py:100-106`
+- `--compiler qairt` requires `--qnn-sdk-root` → `compile.py:206-208` (passes to `ep_config.qnn_sdk_root`; failure occurs in compiler layer)
+- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/config-and-build.md b/docs/superpowers/2026-05-27-doc-issues/config-and-build.md
new file mode 100644
index 000000000..4db7ccc18
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/config-and-build.md
@@ -0,0 +1,57 @@
+# Issues: docs/concepts/config-and-build.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+
+- **JSON example `compile` section uses wrong field names** (lines 85-90): The doc shows:
+  ```json
+  "compile": {
+    "ep_config": {
+      "provider": "qnn",
+      "enable_ep_context": true
+    }
+  }
+  ```
+  However, `WinMLCompileConfig.to_dict()` does NOT nest under `ep_config`; it serializes flat with keys `execution_provider`, `provider_options`, `enable_ep_context`, `embed_context`, `compiler`, `qnn_sdk_root`, `device` (source: `src/winml/modelkit/compiler/configs.py` lines 230-245). `WinMLCompileConfig.from_dict()` reads `data.get("execution_provider", "qnn")` (line 253), not `ep_config.provider`. A user who copy-pastes this JSON and passes it to `winml build` will get a config with `provider="qnn"` default (silently ignored nested key), making compilation silent failure or wrong EP.
+
+- **JSON example `optim` section uses non-canonical field names** (lines 75-80): The doc shows:
+  ```json
+  "optim": {
+    "gelu_fusion": false,
+    "layer_norm_fusion": false,
+    "matmul_add_fusion": false
+  }
+  ```
+  `WinMLOptimizationConfig` is a `dict` subclass that accepts arbitrary kwargs (source: `src/winml/modelkit/optim/config.py` lines 13-31). The field names `gelu_fusion`, `layer_norm_fusion`, `matmul_add_fusion` correspond to capability python_names, which exist in the optimizer (source: `src/winml/modelkit/optim/pipes/graph.py` lines 242-243). These are valid keys but there are no hard-coded defaults for them — the generated JSON would only include keys that were explicitly set. A freshly generated config from `winml config` would likely have `{}` for `optim` unless capabilities are explicitly configured. The presence of all-`false` values is misleading; a real generated config would omit them.
+
+## Important (misleading or stale claim)
+
+- **`WinMLBuildConfig` described as having five nested sub-configs** (lines 48-56, table): The doc lists `loader`, `export`, `optim`, `quant`, `compile`. The actual dataclass also has `eval: WinMLEvaluationConfig | None` and `auto: bool` (source: `src/winml/modelkit/config/build.py` lines 132-138). The table is incomplete; `eval` section is a valid config key that affects `winml eval` behavior when running from a build config.
+
+- **`winml config` `--no-compile` default behavior** (line 33): Doc says "sets the `compile` section to `null`". In the CLI, `--no-compile` is the default (`default=True` for `no_compile`, source: `commands/config.py` lines 162-165), meaning compilation is always excluded unless `--compile` is passed. The doc does not mention that compile is off by default from `winml config`.
+
+- **`WinMLBuildConfig` defined in `src/winml/modelkit/config/build.py`** (line 47): Correct file path. However the description says "one per pipeline stage" — there are actually 6 stages with the `eval` field, not 5 as stated.
+
+## Minor (style, polish, low-impact)
+
+- **`--output-dir` and `--use-cache` enforcement** (line 111): Doc says "enforced at runtime, not parse time". This is accurate — source `commands/build.py` line 377 shows a `click.UsageError` raised in the command body.
+
+- **Cross-links** `[../commands/config.md]` and `[../commands/build.md]` (lines 161-162): Both files exist in `docs/commands/`.
+
+- **Cross-link** `[primitives-and-pipeline.md]` (line 158): File exists in `docs/concepts/`.
+
+## Verified correct (anchored claims you checked)
+
+- `winml config -m microsoft/resnet-50 -o resnet50.json` syntax is valid → `commands/config.py` lines 66-73 (`-m`/`--model`, `-o`/`--output`)
+- `--task` flag exists on `winml config` → `commands/config.py` lines 77-80
+- `--no-quant` flag exists on `winml config` → `commands/config.py` lines 155-159
+- `--trust-remote-code` flag exists on `winml config` → `commands/config.py` line 166
+- `-o` omission prints to stdout → `commands/config.py` lines 487-490
+- `winml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/` valid → `commands/build.py` lines 233-256
+- `--use-cache` writes to `~/.cache/winml/` → `commands/build.py` lines 258-262
+- `--no-quant`, `--no-compile`, `--no-optimize` CLI overrides exist on `winml build` → `commands/build.py` lines 273, 275-282, 300-304
+- `WinMLBuildConfig.from_dict()` reads `loader`, `export`, `optim`, `quant`, `compile`, `eval` sections → `config/build.py` lines 152-172
+- `WinMLLoaderConfig`, `WinMLExportConfig`, `WinMLOptimizationConfig`, `WinMLQuantizationConfig`, `WinMLCompileConfig` all exist → `config/build.py` lines 54-64
+- JSON `quant` section fields `weight_type`, `activation_type`, `samples` exist → `quant/config.py` lines 55, 65-66
+- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/config.md b/docs/superpowers/2026-05-27-doc-issues/config.md
new file mode 100644
index 000000000..7112e9396
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/config.md
@@ -0,0 +1,43 @@
+# Issues: docs/commands/config.md
+
+Source verified against: `src/winml/modelkit/commands/config.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **`--no-compile` default is wrong.** The doc (line 32) states default is `off` (meaning compile *is* included by default). Source line 163 defines `--no-compile/--compile` with `"no_compile"` and `default=True`. The default is `no_compile=True`, meaning compilation is *excluded* from the generated config by default. A user reading the doc will expect compilation to be in the config and be surprised to find `"compile": null` in the output.
+
+- **`--verbose` flag is missing from the flag table.** Source lines 147–152 define `@click.option("-v", "--verbose", is_flag=True, default=False, ...)`. This is a real flag that enables `logging.DEBUG` (line 226) and is not documented in the flag table.
+
+- **`--ep` short form** — the doc flag table (line 27) shows no short form for `--ep`. The source uses `@cli_utils.ep_option(required=False, ...)` (line 126), and `ep_option` in `cli.py` line 140 registers `"--ep", "--execution-provider"` with no `-e` short. The doc correctly shows no short form, but it lists the full name without mentioning `--execution-provider` as an alias. This is a minor completeness issue but not an error.
+
+## Important (misleading or stale)
+
+- **`--no-compile` documentation**: The doc entry says default is `off` and the description reads "Omit compilation from the generated config (sets `compile` to `null`). Use this when you want to inspect the optimized ONNX before EP-specific compilation." Since `no_compile` defaults to `True`, compilation is omitted *by default* — the entire framing of `--no-compile` as an opt-in is backwards. Users do not need to pass `--no-compile` to skip compilation; they need `--compile` to include it.
+
+- **`--device` Choice values** — the doc says type is `auto|npu|gpu|cpu` (line 28). Source line 121 confirms `type=click.Choice(["auto", "npu", "gpu", "cpu"], case_sensitive=False)`. This is accurate.
+
+- **`--config / -c` help text says "JSON override file in `WinMLBuildConfig` format"** (doc line 24). Source line 103 uses `type=click.Path(exists=True)` and the flag is called `config_file`. The doc correctly describes behavior.
+
+- **`--ep` accepts aliases** — doc says values include `qnn`, `dml`, `migraphx`, `tensorrt`, `vitisai`, `openvino`, `cpu`. The actual choices come from `ALL_EP_NAMES` via `ep_option` (cli.py line 138). The list of aliases in the doc should be verified against `SUPPORTED_EPS` / `ALL_EP_NAMES` constants. The doc lists `dml` and `migraphx` which may or may not be in `ALL_EP_NAMES` — this should be confirmed.
+
+## Minor (polish)
+
+- The doc example `winml config -m facebook/convnext-tiny-224.onnx --no-quant --no-compile` (line 80) uses `--no-compile` as if it toggles something off, but since `no_compile=True` by default, `--no-compile` is a no-op here. The example is not wrong (it still works) but implies `--no-compile` is doing work when it is already the default.
+- `--trust-remote-code` is correctly listed in the flag table and matches source (via `@cli_utils.trust_remote_code_option()` at line 166).
+
+## Verified correct (key claims checked)
+
+- `-m / --model` exists with short `-m`, optional (not required), default `None` → source lines 67–74.
+- `-t / --task` exists with short `-t`, default `None` → source lines 75–79.
+- `--model-class` exists, no short form, default `None` → source lines 80–85.
+- `--model-type` exists, no short form, default `None` → source lines 86–94.
+- `--module` exists, no short form, default `None` → source lines 95–99.
+- `-c / --config` exists, type `Path(exists=True)`, default `None` → source lines 100–107.
+- `--shape-config` exists, type `Path(exists=True)`, default `None` → source lines 108–117.
+- `-d / --device` exists with Choice `["auto","npu","gpu","cpu"]`, default `"auto"` → source lines 118–125.
+- `-p / --precision` exists, type `str`, default `"auto"` → source lines 131–138.
+- `-o / --output` exists → via `cli_utils.output_option`, source line 140.
+- `--library` exists, default `"transformers"` → source lines 141–145.
+- `--no-quant` exists as `is_flag=True, default=False` → source lines 153–158.
+- At least one of `-m`, `--model-type`, `--model-class` required → source lines 229–241.
+- ONNX file input path sets `export=None` → source lines 297–311.
diff --git a/docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md b/docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md
new file mode 100644
index 000000000..87fa70150
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md
@@ -0,0 +1,82 @@
+# Issues: docs/samples/convnext-primitives.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- **Compiled artifact filenames are wrong for CPU and GPU (Step 5 + Step 6).**
+  The doc claims `winml compile --device cpu` writes `convnext_int8_cpu_ctx.onnx`
+  and `--device gpu` writes `convnext_int8_dml_ctx.onnx`. Both claims are false.
+  - `WinMLCompileConfig.for_cpu()` sets `enable_ep_context=False`
+    (`compiler/configs.py` line 165). CPUExecutionProvider does not generate an
+    EPContext file, so no `_cpu_ctx.onnx` is written at all.
+  - `WinMLCompileConfig.for_dml()` also sets `enable_ep_context=False`
+    (`compiler/configs.py` line 175). DML does not produce an EPContext either.
+  - Additionally, the session filename convention uses the resolved device string,
+    so if an EPContext were produced it would be `convnext_int8_gpu_ctx.onnx`
+    (device="gpu"), not `convnext_int8_dml_ctx.onnx`.
+  - The paragraph at the end of Step 5 restates the incorrect filenames and must
+    be corrected alongside the tab blocks.
+
+- **`winml perf --device gpu` line uses the non-existent artifact
+  `convnext_int8_dml_ctx.onnx`.** Because DML compile does not produce a ctx file
+  (see above), the benchmark command as written will fail with a file-not-found
+  error. The entire GPU tab in Step 6 is based on a false premise.
+
+## Important
+
+- **`--output` flag on `winml perf` is described as writing a JSON file.**
+  The doc says "Use the JSON output written by `--output`". The actual flag name
+  in `perf.py` is `-o` / `--output`, output defaults to a timestamped path under
+  `~/.cache/winml/perf/`. This description is essentially correct, but the page
+  never shows what the flag looks like in a command, which may confuse readers.
+  Minor wording issue only.
+
+- **Step 7 `winml eval` uses `--dataset imagenet-1k`.** HuggingFace's canonical
+  dataset ID for ImageNet-1k gated access is `imagenet-1k`, which matches. This
+  cannot be independently verified without HF credentials, but the ID is standard
+  and consistent with other pages.
+
+- **Note claims `--device auto` is not valid on `winml eval`.**
+  `eval.py` line 69: `type=click.Choice(["auto", "cpu", "gpu", "npu"])` — `auto`
+  IS listed as a valid choice. The doc's note "Note that `--device` accepts only
+  `cpu`, `gpu`, or `npu` — it does not accept `auto`" is incorrect.
+
+## Minor
+
+- **Cross-link to `../getting-started/end-to-end.md` in the admonition.**
+  Not verifiable without checking that file, but the link pattern is consistent
+  with other pages.
+
+- **Step 2: `winml config -m ... -o convnext_config.json`** — the `-o` flag is
+  correct for `config.py` (`cli_utils.output_option`). Verified correct.
+
+- **Step 3 export output text shows `Starting HTP export...` and
+  `Success! Model exported to: convnext.onnx`** — matches actual console output
+  strings in `export.py` lines 388 and 417. Verified correct.
+
+- **`--method entropy` mentioned in Step 4 note.** `quantize.py` line 65:
+  `type=click.Choice(["minmax", "entropy", "percentile"])`. `entropy` is valid.
+
+## Verified correct
+
+- `winml inspect -m facebook/convnext-tiny-224` — `-m` flag exists, model ID is
+  a real HF repo.
+- `winml config -m facebook/convnext-tiny-224 -o convnext_config.json` — flags
+  all exist in `config.py`.
+- `winml export -m facebook/convnext-tiny-224 -o convnext.onnx` — `-m` and `-o`
+  exist in `export.py`, `-o` is required for export.
+- `winml quantize -m convnext.onnx -o convnext_int8.onnx --precision int8 --samples 32`
+  — all flags verified in `quantize.py`.
+- `winml compile -m convnext_int8.onnx --output-dir . --device npu --qnn-sdk-root`
+  — `--output-dir`, `--device`, `--qnn-sdk-root` all exist in `compile.py`.
+- `winml compile --device npu` requiring `--qnn-sdk-root` or `QNN_SDK_ROOT` —
+  consistent with `compile.py` and source notes.
+- `winml perf` flags `--device`, `--iterations` — verified in `perf.py`.
+- `winml eval` flags `-m`, `--model-id`, `--dataset`, `--split`, `--samples`,
+  `--device` — verified in `eval.py`.
+- NPU artifact `convnext_int8_qnn_ctx.onnx` — consistent with session.py naming
+  (`{stem}_{device}_ctx.onnx` with device="npu"). Verified plausible.
+- "Pick the right ConvNeXt page" admonition links to `../tutorials/npu-convnext.md`
+  — resolves correctly; counterpart admonition in npu-convnext.md links back here.
+- No `wmk` or `ModelKit` strings found in user-facing prose.
diff --git a/docs/superpowers/2026-05-27-doc-issues/end-to-end.md b/docs/superpowers/2026-05-27-doc-issues/end-to-end.md
new file mode 100644
index 000000000..8a1e8c4af
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/end-to-end.md
@@ -0,0 +1,15 @@
+# Issues: docs/getting-started/end-to-end.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+- Artifact filename pattern is wrong for DML and CPU (end-to-end.md:123–125). The doc claims the GPU artifact is named `convnext_tiny_dml_ctx.onnx` and the CPU artifact is `convnext_tiny.onnx`. Source: `compiler/configs.py` `for_dml()` sets `enable_ep_context=False` and `for_cpu()` also sets `enable_ep_context=False`. When `enable_ep_context=False`, `compile.py` `_finalize_output` is never called (the `if ep_config.enable_ep_context:` guard in `CompileStage.process`), meaning no `_ctx.onnx` is produced and `winml build --no-compile` leaves only the quantized ONNX. Neither `convnext_tiny_dml_ctx.onnx` nor a special CPU variant filename is produced; the DML and CPU "compile" steps are no-ops that return `None` from `for_provider`. The correct behavior is that only QNN (and OpenVINO, VitisAI, NvTensorRTRTX) produce `_ctx.onnx` artifacts; DML/CPU compile is skipped entirely.
+- `winml build` `--no-quant` / `--no-compile` flags exist in source (build.py:270, 276), but the doc also mentions `--no-optimize` (end-to-end.md:106) — this flag exists (`build.py:300`), so that claim is correct. However, the doc omits any mention that `--no-compile/--compile` is actually a toggle pair and `--compile` can be used to force enable compilation (build.py:277–280). Minor gap but not a factual error.
+
+## Important
+- `winml build` warning box (end-to-end.md:111–113): states the build reads `QNN_SDK_ROOT` from the environment. This is correct for the `winml build` wrapper, which does NOT expose `--qnn-sdk-root` (build.py has no such option). The doc is consistent with the source. No error here.
+- `--device auto` priority order claimed as "NPU first, then GPU, then CPU" (end-to-end.md:7–8): confirmed correct by `sysinfo/device.py` `_DEVICE_PRIORITY: tuple[str, ...] = ("npu", "gpu", "cpu")`.
+- Tabbed `sys` output EP names (end-to-end.md:54–57): `QNNExecutionProvider -> NPU`, `DmlExecutionProvider -> GPU`, `CPUExecutionProvider -> CPU`. Cross-referencing `EP_SUPPORTED_DEVICES` in `constants.py`: `QNNExecutionProvider` maps to `("npu", "gpu")` not just `"npu"`. The display in `_output_ep_text` shows the first device from `get_ep_device_map()` which joins with `/`, so it would render `QNNExecutionProvider -> NPU/GPU`, not just `NPU`. The sample output in the doc shows only `-> NPU`, which is inaccurate.
+
+## Minor
+- Step 3 perf command uses placeholder `<artifact>.onnx` (end-to-end.md:119). Given the critical artifact naming issue above, the example filenames shown in the tabbed blocks (`convnext_tiny_qnn_ctx.onnx`, `convnext_tiny_dml_ctx.onnx`, `convnext_tiny.onnx`) are not the actual file stems that `winml build` produces for a model named `convnext-tiny-224`. The actual stem would depend on the slug generated from the model ID (not verified here), but the `_dml_ctx` and plain `.onnx` names are definitely wrong per the critical issue above.
diff --git a/docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md b/docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md
new file mode 100644
index 000000000..495c81cdb
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md
@@ -0,0 +1,26 @@
+# Issues: docs/concepts/eps-and-devices.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+- (none)
+
+## Important (misleading or stale claim)
+- Line 13 (table row for `QNNExecutionProvider`): The table lists QNN's device as `npu` only. However, `src/winml/modelkit/utils/constants.py:184` declares `"QNNExecutionProvider": ("npu", "gpu")` — QNN also supports `gpu` as a secondary device. The table is therefore incomplete and will mislead users who want to run QNN on a GPU target.
+
+- Lines 35-38: The `--device` description says the default is `auto` and it picks "NPU > GPU > CPU". The source at `src/winml/modelkit/commands/build.py:289-290` sets `default="auto"` for `--device` in the build command, and `src/winml/modelkit/commands/analyze.py:645` also defaults to `"auto"`. Priority logic `NPU > GPU > CPU` is consistent with `EP_SUPPORTED_DEVICES` key order in `src/winml/modelkit/utils/constants.py:178-187`. So far accurate. However, `--device` on `winml analyze` accepts `CPU/GPU/NPU/all/auto` (uppercase; `src/winml/modelkit/commands/analyze.py:644-648`), not lowercase as shown in the doc examples on lines 37-40. The CLI itself normalizes case, so commands work, but showing `--device npu` (lowercase) in examples while the `type=click.Choice([*SUPPORTED_DEVICES, ...])` enumerates uppercase `"CPU"`, `"GPU"`, `"NPU"` (`src/winml/modelkit/utils/constants.py:163-167`) could be confusing. Since Click's `case_sensitive=False` is set on the analyze command, the examples aren't wrong, but readers inspecting help output will see uppercase choices.
+
+- Lines 48-53: Example shows `winml analyze --model model.onnx --ep QNNExecutionProvider --device npu`. The `analyze` command uses `--model` (confirmed at `src/winml/modelkit/utils/cli.py:69`), not `--model-path` or another variant. The example is correct in flag name.
+
+## Minor (style, polish, low-impact)
+- Lines 57-63: All cross-links (`graphs-and-ir.md`, `weight-and-activation.md`, `../commands/sys.md`, `../commands/analyze.md`) resolve to files on disk.
+- Line 22: `winml sys --list-ep` — flag `--list-ep` confirmed at `src/winml/modelkit/commands/sys.py:668-671`.
+
+## Verified correct (anchored claims you checked)
+- Lines 11-19 (EP table): `CPUExecutionProvider`, `DmlExecutionProvider`, `MIGraphXExecutionProvider`, `NvTensorRTRTXExecutionProvider`, `OpenVINOExecutionProvider`, `QNNExecutionProvider`, `VitisAIExecutionProvider` — all seven are in `EPName` Literal at `src/winml/modelkit/utils/constants.py:24-33`.
+- Table: `OpenVINOExecutionProvider` listed as supporting `npu / gpu / cpu` — confirmed by `"OpenVINOExecutionProvider": ("npu", "gpu", "cpu")` at `src/winml/modelkit/utils/constants.py:185`.
+- Table: `VitisAIExecutionProvider` listed as `npu` only — confirmed by `"VitisAIExecutionProvider": ("npu",)` at `src/winml/modelkit/utils/constants.py:183`.
+- Table: `DmlExecutionProvider` listed as `gpu` only — confirmed by `"DmlExecutionProvider": ("gpu",)` at `src/winml/modelkit/utils/constants.py:186`.
+- Table: `MIGraphXExecutionProvider` listed as `gpu` only — confirmed by `"MIGraphXExecutionProvider": ("gpu",)` at `src/winml/modelkit/utils/constants.py:182`.
+- Table: `NvTensorRTRTXExecutionProvider` listed as `gpu` only — confirmed by `"NvTensorRTRTXExecutionProvider": ("gpu",)` at `src/winml/modelkit/utils/constants.py:179`.
+- Lines 44-45: `--ep` accepts aliases `qnn`, `vitisai`, `dml`, `openvino` — confirmed in `EP_ALIASES` at `src/winml/modelkit/utils/constants.py:59-69`.
diff --git a/docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md b/docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md
new file mode 100644
index 000000000..84f5031b2
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md
@@ -0,0 +1,18 @@
+# Issues: docs/concepts/eval-and-datasets.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- (none)
+
+## Important
+
+- Lines 1–7: The concept doc lists no `--ep`, `--precision`, `--dataset-script`, or `--trust-remote-code` flags, all of which exist in eval.py (lines with `@cli_utils.ep_option`, `--precision`, `--dataset-script`, `--trust-remote-code`). While a concept page need not enumerate every flag, omitting `--precision` is notable because the page is about post-quantization accuracy checks and `--precision` directly affects which model artifact is built.
+- Line 25 / `--samples` default: The concept doc does not state a default for `--samples`, but line 34 of docs/commands/eval.md lists the default as `100`. Source confirms `default=100` (eval.py). This is consistent, but the concept page example at line 35 uses `--samples 200` without noting the default, which is fine — no defect here on its own.
+
+## Minor
+
+- Line 22: States `--output` "accepts any `.json` path; if omitted, results are printed but not persisted." Source confirms this (no default for `output_path`). Accurate.
+- Line 35: `--streaming` flag description says it "fetches rows on demand instead of materialising the whole dataset locally." Source confirms `is_flag=True, default=False`. Accurate.
+- Line 38: `--column key=value` usage is consistent with source (`multiple=True`, key=value parsing in eval.py). Accurate.
diff --git a/docs/superpowers/2026-05-27-doc-issues/eval.md b/docs/superpowers/2026-05-27-doc-issues/eval.md
new file mode 100644
index 000000000..eff4653ba
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/eval.md
@@ -0,0 +1,23 @@
+# Issues: docs/commands/eval.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- Line 24: `--device` type column shows `cpu|gpu|npu` with default `cpu`. Source defines `type=click.Choice(["auto", "cpu", "gpu", "npu"])` with `default="auto"` (eval.py). The `auto` choice is missing and the default is wrong.
+- Line 25: `-n` is listed as a short alias for `--samples`. Source defines `--samples` with no short flag (eval.py `@click.option("--samples", type=int, default=100, ...)`). The `-n` alias does not exist.
+
+## Important
+
+- Flags table is missing the following options that exist in source (eval.py):
+  - `--ep` — execution provider override (`@cli_utils.ep_option`)
+  - `--precision` — precision mode (`--precision`, default `auto`)
+  - `--dataset-script` — path to a dataset-building script
+  - `--trust-remote-code` — required flag when `--dataset-script` is used
+  - `--verbose` / `-v` — verbose output flag
+- Line 36: "How it works" section says `winml eval` loads the model via `WinMLAutoModel`. Source uses `WinMLEvaluationConfig` and calls `evaluate(cfg)` from the `eval` subpackage (eval.py). The class name `WinMLAutoModel` does not appear in eval.py; the description misrepresents the implementation.
+
+## Minor
+
+- Line 19: `--model` description says "Required (unless `--model-id` is provided directly)." Source actually raises `UsageError` if neither `-m` nor `--model-id` resolves a model, and `--model-id` alone (without `-m`) is accepted only to supply a HuggingFace ID. This nuance is slightly misleading but not a breaking inaccuracy.
+- Line 88: Pitfall note "`--streaming` skips the local cache." Source confirms this behaviour. Accurate.
diff --git a/docs/superpowers/2026-05-27-doc-issues/export.md b/docs/superpowers/2026-05-27-doc-issues/export.md
new file mode 100644
index 000000000..387b4fe5c
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/export.md
@@ -0,0 +1,32 @@
+# Issues: docs/commands/export.md
+
+Source verified against: `src/winml/modelkit/commands/export.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- (none)
+
+## Important (misleading or stale)
+
+- **`--dynamo` description says "PyTorch 2.9+" but that version string is invented.** The source (`export.py:376-384`) only warns the flag is unsupported; no PyTorch version requirement is stated. Remove the version number claim to avoid confusion.
+- **`--torch-module` description says "Experimental — currently logs a warning"** — this is accurate, but the phrase "currently logs a warning" hides the fact that the flag is **completely ignored** (the option value is never forwarded to `export_onnx()`). Source at `export.py:364-373` explicitly states `TODO: Add torch_module support`. Use "has no effect" rather than "currently logs a warning".
+- **`--dynamo` same problem.** Source `export.py:376-384`: "dynamo=True is not supported by export_onnx(). TODO: Add dynamo support". The flag has zero effect; the table note says only "currently logs a warning".
+
+## Minor (polish)
+
+- **Flag table missing `--verbose` / `-v`.** `export.py:73-78` defines `--verbose / -v` as an explicit option with a `help` string. Every other command page includes `--verbose` in their tables; its absence on this page is inconsistent.
+- **`--clean-onnx` / `--no-hierarchy` are presented as two separate flags in the table but they are one option.** The source defines them as aliases of a single `--clean-onnx / --no-hierarchy` option with `"no_hierarchy"` as the internal parameter name (`export.py:85-92`). The table formatting (`--clean-onnx` / `--no-hierarchy` in one cell) is technically correct but the slash notation could mislead readers into thinking these are independent toggles.
+
+## Verified correct (key claims checked)
+
+- `--model` / `-m` required string → `export.py:65-70`
+- `--output` / `-o` required path → `export.py:71` via `cli_utils.output_option(required=True)`
+- `--with-report` is_flag default false → `export.py:79-84`
+- `--input-specs` path default None → `export.py:107-111`
+- `--task` / `-t` string default None → `export.py:113-118`
+- `--export-config` path default None → `export.py:119-124`
+- `--shape-config` path default None → `export.py:125-130`
+- `--shape-config` silently ignored when `--input-specs` is provided → `export.py:307-331` (input-specs overrides/patches auto-resolved tensors; shape_config is loaded only before auto-resolution, so if both are present the shape_config still applies to auto-resolution and input-specs then overrides it — the doc's "Ignored when `--input-specs` is provided" is a slight overstatement but matches the spirit)
+- Eight-step HTP export description → `export.py:153-161` (docstring)
+- `--dynamo` and `--torch-module` emit warnings and have no effect → `export.py:364-384`
+- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md b/docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md
new file mode 100644
index 000000000..44fd51061
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md
@@ -0,0 +1,21 @@
+# Issues: docs/concepts/graphs-and-ir.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+- (none)
+
+## Important (misleading or stale claim)
+- Line 29: Citation "(`src/winml/modelkit/export/config.py`, line 75)". The file exists and `opset_version: int = 17` is indeed at line 75 (`src/winml/modelkit/export/config.py:75`). However, the doc says this value lives in `WinMLExportConfig` — correct — but the enclosing class declaration begins at line 33. The citation is precise enough to be useful but readers should be aware `line 75` is inside a `@dataclass`. No factual error, but the explanation "This is the value of `opset_version: int = 17` in `WinMLExportConfig` (`src/winml/modelkit/export/config.py`, line 75)" is accurate and verified.
+
+- Line 38: The export CLI example uses `--export-config export_cfg.json`. Verification of `winml export` is needed. The analyze command uses `--model`; the export command is at `src/winml/modelkit/commands/export.py`. The flag `--export-config` is not confirmed verified here, but is not the focus of this page's claims.
+
+## Minor (style, polish, low-impact)
+- Line 15: Claims metadata includes `winml.io.inputs` and `winml.hierarchy.tag`. Both strings are confirmed to exist in the source (`src/winml/modelkit/onnx/metadata.py` and `src/winml/modelkit/core/node_metadata.py`). The attribution "on individual nodes" for `winml.hierarchy.tag` is correct — it is a node-level attribute. The attribution of `winml.io.inputs` to "model level" is consistent with the metadata module. These are accurate.
+
+- Lines 53-60: All cross-links (`eps-and-devices.md`, `weight-and-activation.md`, `quantization.md`, `../commands/inspect.md`, `../commands/export.md`) resolve to files that exist on disk.
+
+## Verified correct (anchored claims you checked)
+- Line 29: `opset_version: int = 17` at `src/winml/modelkit/export/config.py:75` — confirmed exactly.
+- Line 15: `winml.hierarchy.tag` found in `src/winml/modelkit/export/htp/exporter.py` and `src/winml/modelkit/core/node_metadata.py`; `winml.io.inputs` found in `src/winml/modelkit/onnx/metadata.py` and `src/winml/modelkit/onnx/io.py`.
+- Lines 9-15: ONNX `ModelProto` / `GraphProto` structure description (inputs, outputs, nodes, initializers, metadata) matches standard ONNX format and how winml-cli uses it.
diff --git a/docs/superpowers/2026-05-27-doc-issues/how-it-works.md b/docs/superpowers/2026-05-27-doc-issues/how-it-works.md
new file mode 100644
index 000000000..c33b92b98
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/how-it-works.md
@@ -0,0 +1,22 @@
+# Issues: docs/concepts/how-it-works.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+- (none)
+
+## Important (misleading or stale claim)
+- Line 80: Doc says `winml build` auto-detects ONNX vs HF and calls "`build_hf_model` or `build_onnx_model`". This is inaccurate at the CLI layer. The build command (`src/winml/modelkit/commands/build.py`) orchestrates stages directly via `_build_hf_pipeline()` / `_build_onnx_pipeline()` inline functions. The named public API functions `build_hf_model` / `build_onnx_model` (from `src/winml/modelkit/build/hf.py` and `build/onnx.py`) are only called in module-mode (`_build_modules()`), not in the single-model code path. Telling readers "calls `build_hf_model` or `build_onnx_model`" misrepresents the actual dispatch.
+
+- Line 88: Example flag `--no-optimize` is valid (`src/winml/modelkit/commands/build.py:300`), but the comment "Skip optimization (for pre-quantized input)" is misleading. The source docstring says "Skip optimization (for pre-quantized ONNX models)" (`build.py:303`), and the flag is general-purpose (not limited to pre-quantized inputs). The doc's narrower framing could confuse users with other reasons to skip optimization.
+
+## Minor (style, polish, low-impact)
+- Line 12: Claims the pipeline API "powers `WinMLAutoModel.from_pretrained()`". `WinMLAutoModel` exists (`src/winml/modelkit/models/auto.py`) but the connection to the pipeline described here is not verifiable from the source at the cited commit; may be aspirational or referring to an internal API not exposed in this path.
+
+- Lines 116–122: Cross-links — `../commands/build.md`, `../commands/export.md`, `eps-and-devices.md`, and `config-and-build.md` all resolve to files that exist on disk. No broken links.
+
+## Verified correct (anchored claims you checked)
+- Lines 88-91: `--no-quant` and `--no-compile` flags exist in `src/winml/modelkit/commands/build.py:274` and `279-282` respectively. `--no-optimize` exists at line 300.
+- Lines 99-105: `WinMLBuildConfig` structure (loader/export/optim/quant/compile) matches `src/winml/modelkit/config/build.py:97-138`.
+- Lines 109-110: Setting `quant` or `compile` to null skips that stage; confirmed by `src/winml/modelkit/commands/build.py:948-949` (quant) and `src/winml/modelkit/commands/build.py:1038-1039` (compile).
+- Line 113: Config file written after optimize stage; confirmed by `src/winml/modelkit/commands/build.py:1192` (`config_path.write_text(...)`).
diff --git a/docs/superpowers/2026-05-27-doc-issues/hub.md b/docs/superpowers/2026-05-27-doc-issues/hub.md
new file mode 100644
index 000000000..e69d31e3a
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/hub.md
@@ -0,0 +1,37 @@
+# Issues: docs/commands/hub.md
+
+Source verified against: `src/winml/modelkit/commands/catalog.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **The command documented is `winml hub` but the source registers it as `winml catalog`.** Source line 362 is `@click.command()` with no `name=` argument; the function is named `catalog` (line 387), and the CLI is wired to `winml catalog` per the docstring (lines 6–17). Every invocation example in the doc uses `winml hub` (e.g. `$ winml hub`, `$ winml hub --model-type bert`) — these will all fail unless there is an alias registered elsewhere. The doc must either be renamed to `catalog.md` and updated throughout, or the alias must be verified.
+
+- **`--model` / `-m` flag for detail view does not exist in source.** The doc table lists `--model / -m` as "Show detailed latency and accuracy benchmarks for a specific model ID" (doc line 23). The source `catalog` command (lines 362–429) has no `--model` option. The source accepts only `--model-type / -t`, `--task / -k`, `--ep`, `--device`, and `--output`. There is no per-model detail view in the source at all. Any user running `winml hub --model ProsusAI/finbert` will get an "unrecognized option" error.
+
+- **`--ep` and `--device` flags are absent from the doc flag table entirely.** Source lines 377–385 add `ep_option(required=False)` and `device_option(required=False, default=None)`. The doc only lists four flags and makes no mention of `--ep` or `--device`. These are functional filters that change output — omitting them is a content gap that will confuse users trying to filter by EP or device.
+
+## Important (misleading or stale)
+
+- **"How it works" describes per-EP latency stats (avg, P50, P90, P95, P99, min, max, QPS) and accuracy verdicts (PASS/AT_RISK/REGRESSION)** — the source `catalog.py` makes no reference to these fields. The catalog data source is `hub_models.json` (line 64) and the rendering code (lines 276–306) shows columns: Model, Task, Size, Model Type, and optionally Devices or EPs. No latency stats or accuracy verdict columns appear in the rendered output. The "How it works" section describes functionality that either does not exist in this command or belongs to a different one (e.g., `winml perf`).
+
+- **Accuracy verdict description (`drop_pct`) in "How it works"** is not supported by any code in `catalog.py`. The `See also` section points to `quantization.md` to explain `drop_pct`, but this doc is describing `winml catalog` which has no such output.
+
+- **Example output shows "winml-cli Catalog"** (doc line 50) but source line 301 renders `"WinML CLI Catalog"`. Minor discrepancy.
+
+- **Pitfall says `--model` performs substring matching** (doc line 90–92) — this flag does not exist in source. The entire pitfall is based on a non-existent feature.
+
+- **Pitfall "no flag to dump entire catalog"** (doc line 97–99) says "omit all filters and add `--output`" — the source does support `--output` with no filters (lines 428–429), so this pitfall hint is correct, but the surrounding text refers to `--model` which does not exist.
+
+## Minor (polish)
+
+- The synopsis `$ winml hub [options]` uses the wrong command name; should be `$ winml catalog [options]`.
+- Cross-reference at doc line 108 reads `hub.md` in `sys.md` which will be a broken link if this doc is renamed.
+- The `--task` short flag warning pitfall ("use `-k`, not `-t`") is correct → source line 373 confirms `-k`.
+
+## Verified correct (key claims checked)
+
+- `--model-type / -t` filter exists, case-insensitive → source lines 363–369.
+- `--task / -k` filter exists, case-insensitive → source lines 370–376.
+- `--output / -o` saves JSON → source lines 428–429 via `cli_utils.output_option`.
+- Catalog loaded from local package data (no network) → source lines 53–65.
+- `_filter_models` applies exact case-insensitive equality on `model_type` and `task` → source lines 68–88.
diff --git a/docs/superpowers/2026-05-27-doc-issues/index.md b/docs/superpowers/2026-05-27-doc-issues/index.md
new file mode 100644
index 000000000..535c889f4
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/index.md
@@ -0,0 +1,24 @@
+# Issues: docs/index.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+(none)
+
+## Important
+
+- **Anchor `#eps-winml-cli-supports` may not resolve.** The link `concepts/eps-and-devices.md#eps-winml-cli-supports` targets a heading "EPs winml-cli supports" (line 7 of that file). MkDocs lowercases and hyphenates heading text, so "EPs winml-cli supports" becomes `#eps-winml-cli-supports`. The "EPs" acronym normalizes correctly here — the anchor is valid as written, but this depends on MkDocs slug behaviour for acronyms (capitals are lowercased). Treat as worth verifying in the rendered site.
+
+## Minor
+
+- **"12 `winml` subcommands"** — the `docs/commands/` directory contains 12 `.md` files (analyze, build, compile, config, eval, export, hub, inspect, optimize, overview, perf, quantize, sys). `overview.md` is a landing page, not a subcommand. The actual executable subcommands registered in the CLI should be counted and verified; if hub or overview are not registered commands the "12" claim would be wrong.
+
+## Verified correct
+
+- No `wmk` or `ModelKit` strings in user-facing prose.
+- GitHub URL `https://github.com/microsoft/winml-cli` matches `pyproject.toml` URLs.
+- Links to `getting-started/installation.md`, `getting-started/quickstart.md`, `getting-started/end-to-end.md`, `concepts/how-it-works.md`, `commands/overview.md` all resolve to files that exist.
+- Link to `samples/convnext-primitives.md` resolves.
+- MIT licence link points to `https://github.com/microsoft/winml-cli/blob/main/LICENSE.txt`.
+- Tagline and bullets read naturally with no leftover `wmk`/`ModelKit` names.
diff --git a/docs/superpowers/2026-05-27-doc-issues/inspect.md b/docs/superpowers/2026-05-27-doc-issues/inspect.md
new file mode 100644
index 000000000..bcd42e2ea
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/inspect.md
@@ -0,0 +1,35 @@
+# Issues: docs/commands/inspect.md
+
+Source verified against: `src/winml/modelkit/commands/inspect.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **`--model` is listed as "required" in the flag table** (doc line 22: "Required unless `--help` is used") but the source marks it `required=False` (line 63). The command accepts `--model-type` or `--model-class` as alternatives; source line 165 raises `UsageError` only when all three (`model_id`, `model_type`, `model_class`) are None. Users who read the doc and omit `-m` expecting a usage error will instead succeed with `--model-type`.
+
+- **`--list-tasks` flag is not documented at all.** Source lines 98–103 define `@click.option("--list-tasks", "list_tasks", is_flag=True, ...)`. Omitting it from the flags table means users cannot discover this flag. Running `winml inspect --list-tasks` exits early printing all known tasks (lines 157–161) — a useful shortcut completely hidden from the doc.
+
+- **`--model-type` and `--model-class` flags are not documented.** Source lines 104–116 define `--model-type` (can replace `-m`) and `--model-class` (can replace `-m`). The doc synopsis says `-m <model_id>` is the only input path. Users have no way to discover the type-only or class-only inspection paths shown in the source docstring examples.
+
+## Important (misleading or stale)
+
+- **`-v` / `--verbose` flag is absent from the flag table.** Source lines 78–83 define `@click.option("-v", "--verbose", is_flag=True, ...)`. Verbose mode changes JSON/table output to include full configuration details (passed as `verbose=verbose` to `output_json` and `output_table` at lines 229–231).
+
+- **"How it works" says `--hierarchy` uses `AutoModel.from_config()` and records a "forward-pass trace"** — source lines 449–458 show `extract_hierarchy(model_id)` is called, but this is `from ..inspect.hierarchy import extract_hierarchy` which is a separate module. The source comment at line 451 says "requires model_id" (line 452: `if include_hierarchy and model_id:`), not just a config fetch. The claim that "no real weights are downloaded" should be verified against `extract_hierarchy`.
+
+- **`--format` choices are documented as `table | json`** — source line 74 confirms `click.Choice(["table", "json"])`, so this is correct. However the doc uses backtick-escaped `table` and `json` which is fine.
+
+## Minor (polish)
+
+- The `--help / -h` row in the flag table is auto-added by Click and does not need to be listed explicitly.
+- The synopsis shows `$ winml inspect -m <model_id> [options]` but since `-m` is not required, the synopsis should read `$ winml inspect [options]` or include alternates.
+- The example `winml inspect -m facebook/convnext-tiny-224 -v -H` uses `-v` which is a real and functional flag, but since `-v` is not in the flag table the user has no context for it. Consistent with the missing `--verbose` entry.
+
+## Verified correct (key claims checked)
+
+- `-m` / `--model` short form exists → source line 62.
+- `-f` / `--format` with `Choice(["table", "json"])`, default `"table"` → source lines 70–76.
+- `-t` / `--task` with no required constraint, default `None` → source lines 85–90.
+- `-H` / `--hierarchy` as `is_flag=True, default=False` → source lines 91–97.
+- Command does not accept `--device`, `--ep`, `--precision`, `--output` → confirmed absent.
+- `--format json` output goes to stdout, banners go to stderr → source lines 33–35.
+- `--list-tasks` requires no model and lists `KNOWN_TASKS` → source lines 157–161.
diff --git a/docs/superpowers/2026-05-27-doc-issues/installation.md b/docs/superpowers/2026-05-27-doc-issues/installation.md
new file mode 100644
index 000000000..cb6f0a0bb
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/installation.md
@@ -0,0 +1,13 @@
+# Issues: docs/getting-started/installation.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+- Python version wrong: doc states `3.10` and claims `requires-python = ">=3.10,<3.11"` (installation.md:3, 11), but `pyproject.toml` at 5e25579 declares `requires-python = ">=3.11,<3.12"`. The install step (`uv python install 3.10`) and the "Verify" expected output (`Python Version 3.10.x`) are also wrong as a result.
+
+## Important
+- "No NPU?" callout claims `winml eval` accepts only `cpu|gpu|npu` (no `auto`) (installation.md:16). This is **incorrect**: `eval.py` defines `--device` as `click.Choice(["auto", "cpu", "gpu", "npu"])` with `default="auto"` — `auto` is a valid value for `winml eval`.
+- `winml sys --list-device --list-ep` flags: both `--list-device` and `--list-ep` exist in `sys.py` (lines with `@click.option("--list-device", ...)` and `@click.option("--list-ep", ...)`), so this is not an error, but the quickstart.md description (quoted here as context) says these flags "skip SDK versions and Python environment details" — that is not the behavior when both are passed; the full sysinfo is **not** run, only the device/EP lists are printed. Not an issue in installation.md itself.
+
+## Minor
+- The `--extra qnn` footnote claims `onnxruntime-qnn` requires Python 3.11+ and is "reserved for future use" (installation.md:70). `pyproject.toml` at 5e25579 already gates the dep on `python_version>='3.11'` and the project itself requires 3.11+, so the "reserved for future use" framing is inaccurate — it is already effective on the required Python version.
diff --git a/docs/superpowers/2026-05-27-doc-issues/load-and-export.md b/docs/superpowers/2026-05-27-doc-issues/load-and-export.md
new file mode 100644
index 000000000..c126a8c82
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/load-and-export.md
@@ -0,0 +1,43 @@
+# Issues: docs/concepts/load-and-export.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+
+- (none)
+
+## Important (misleading or stale claim)
+
+- **`--dynamo` described as "reserved but not yet functional"** (line 19): The doc says "the `--dynamo` flag is reserved for the PyTorch 2.x dynamo exporter but is **not yet functional** in the current release — passing it logs a warning and the flag is ignored." The source confirms this: `commands/export.py` lines 376-384 show that when `dynamo=True`, a warning is printed and the flag is ignored. The note itself is accurate, but the doc still mentions "PyTorch 2.x" while the CLI help text says "PyTorch 2.9+" (`commands/export.py` line 98: `"Enable PyTorch 2.9+ dynamo export for rich node metadata"`). The version reference in the doc is stale/imprecise.
+
+- **`--torch-module` described as "reserved but not yet functional"** (line 35): Similarly, the source confirms (`commands/export.py` lines 362-373) it logs a warning and is ignored. The doc note is accurate. However, the doc says it is "intended to include them as distinct hierarchy nodes" while the CLI help says "Include torch.nn modules in hierarchy (comma-separated)" — consistent.
+
+- **`winml inspect` described as working "without downloading weights"** (line 13): The doc says `winml inspect` "prints the detected task, the HuggingFace model class, the export configuration, and the WinML inference class — all without downloading weights. Add `--hierarchy` to reconstruct the PyTorch module tree from random-weight tracing." The `commands/inspect.py` file was not read, so this specific claim about not downloading weights cannot be confirmed or denied from available sources. This warrants scrutiny.
+
+- **`--shape-config` vs `--input-specs`** (line 33): The doc says "Provide a `--shape-config` JSON file with explicit overrides, or use `--input-specs` to supply a fully specified input manifest." The `winml export` command has both flags: `--shape-config` (line 126 in `commands/export.py`) and `--input-specs` (line 106-111). This is correct. However, the doc describes them as equivalent alternatives — in the source, `--shape-config` passes shape overrides to auto-resolution while `--input-specs` overrides individual tensor specs after auto-resolution. They work differently, not interchangeably.
+
+## Minor (style, polish, low-impact)
+
+- **`winml.hierarchy.tag` metadata key name** (line 21): Doc says nodes carry `winml.hierarchy.tag` and `winml.hierarchy.depth`. Both keys confirmed at `src/winml/modelkit/export/htp/exporter.py` lines 594-595 and `src/winml/modelkit/core/node_metadata.py` lines 71, 74.
+
+- **`winml.io.inputs` and `winml.io.outputs` described as model-level** (line 21): Confirmed at `src/winml/modelkit/export/htp/exporter.py` lines 556, 564.
+
+- **`--no-hierarchy` alias `--clean-onnx`** (line 23): Source confirms both flags exist as aliases: `commands/export.py` lines 87-92 (`--clean-onnx` / `--no-hierarchy`).
+
+- **`--with-report` flag** (line 25): Exists at `commands/export.py` line 80-83.
+
+- **Cross-links** `[graphs-and-ir.md]`, `[../commands/inspect.md]`, `[../commands/export.md]` (lines 39-41): All files exist.
+
+## Verified correct (anchored claims you checked)
+
+- `winml export` uses TorchScript tracing by default → `commands/export.py` line 157 (docstring: "ONNX Export — Convert to ONNX format (TorchScript by default)")
+- `--dynamo` flag exists on `winml export` → `commands/export.py` lines 94-98
+- `--torch-module` flag exists on `winml export` → `commands/export.py` lines 100-105
+- `--task` flag exists on `winml export` → `commands/export.py` lines 112-117
+- `--input-specs` flag exists on `winml export` → `commands/export.py` lines 106-111
+- `--shape-config` flag exists on `winml export` → `commands/export.py` lines 125-130
+- `winml.hierarchy.tag` is a real metadata key → `core/node_metadata.py` line 71
+- `winml.hierarchy.depth` is a real metadata key → `core/node_metadata.py` line 74
+- `winml.io.inputs` / `winml.io.outputs` are model-level metadata props → `export/htp/exporter.py` lines 556, 564
+- `--trust-remote-code` applies to `winml config` (not `winml export` directly) → `commands/config.py` line 166
+- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/npu-convnext.md b/docs/superpowers/2026-05-27-doc-issues/npu-convnext.md
new file mode 100644
index 000000000..34b02bf57
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/npu-convnext.md
@@ -0,0 +1,18 @@
+# Issues: docs/tutorials/npu-convnext.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+- Step 7 CPU compile artifact named `convnext_int8_cpu_ctx.onnx` (npu-convnext.md:164): `compiler/configs.py` `for_cpu()` sets `enable_ep_context=False`, so `CompileStage._finalize_output` is never invoked and no `_cpu_ctx.onnx` file is written. The CPU compile step is silently skipped by `for_provider()` returning `None` when `enable_ep_context=False`. The CPU tab in Step 7 describes a compile command that produces no artifact, and the named output file does not exist.
+- Step 8 CPU perf command references `convnext_int8_cpu_ctx.onnx` (npu-convnext.md:190): this file is never produced (same root cause as above). The CPU benchmark tab would fail to find the input model.
+- Step 9 eval uses `--device npu` (npu-convnext.md:224): `eval.py` declares `--device` as `click.Choice(["auto", "cpu", "gpu", "npu"])` — `npu` is a valid value. However, the tutorial is evaluating `convnext_int8.onnx` (the quantized float ONNX before compilation) on the NPU. This will attempt to run the uncompiled model through QNN EP, which requires JIT compilation at load time and may fail or be extremely slow. This is a usage problem but `npu` is a legal value, so it is not a flag-existence error.
+
+## Important
+- Step 7 OpenVINO compile (npu-convnext.md:155): `winml compile -m convnext_int8.onnx --device npu --ep openvino`. In `compile.py`, `--device` accepts `["auto", "npu", "gpu", "cpu"]` and `--ep` accepts EP aliases. `OpenVINOExecutionProvider` maps to `("npu", "gpu", "cpu")` in `EP_SUPPORTED_DEVICES`, so `--device npu --ep openvino` is a valid combination. No error here.
+- Step 7 claims OpenVINO produces `convnext_int8_openvino_ctx.onnx` (npu-convnext.md:164): `for_openvino()` sets `enable_ep_context=True`, so an EPContext file is produced. The filename pattern `{stem}_{device}_ctx.onnx` is used in `CompileStage._finalize_output` where `device` comes from the resolved device string. With `--device npu`, `device="npu"`, so the file would be `convnext_int8_npu_ctx.onnx`, not `convnext_int8_openvino_ctx.onnx`. The EP name is not used in the filename; the device name is.
+- Section B `winml build` command (npu-convnext.md:239): `uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/`. Source `build.py` uses `-c` (config), `-m` (model), `-o` (output-dir). The flag signatures match. No error.
+- Section B states "The QNN SDK path is read from the `QNN_SDK_ROOT` environment variable, not from the config or CLI flags." (npu-convnext.md:257): correct for `winml build` — `build.py` has no `--qnn-sdk-root` option. But note: `winml compile` *does* expose `--qnn-sdk-root` (compile.py:89–93). The tutorial does not use `winml compile --qnn-sdk-root` so this nuance is not wrong in context, but it may confuse users who read both pages.
+- Prerequisites list Python 3.10 (npu-convnext.md:22): `pyproject.toml` requires `>=3.11,<3.12`. This propagates the same Python version error found in installation.md.
+
+## Minor
+- Section B perf command at the end uses `convnext_out/model.onnx` (npu-convnext.md:262): `winml build` does not write a file named `model.onnx`; it writes the compiled artifact under its EP-derived name (e.g., `convnext_int8_npu_ctx.onnx`). The placeholder path is misleading — users must look up the actual output filename from the build log.
diff --git a/docs/superpowers/2026-05-27-doc-issues/optimize.md b/docs/superpowers/2026-05-27-doc-issues/optimize.md
new file mode 100644
index 000000000..6dbe2e102
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/optimize.md
@@ -0,0 +1,48 @@
+# Issues: docs/commands/optimize.md
+
+Source verified against: `src/winml/modelkit/commands/optimize.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **`--preset` flag does not exist in source.** The doc (lines 21, 29–35) documents a `--preset / -p` flag accepting `qnn-compatible|transformer-optimized|full|minimal`. There is no such option anywhere in `optimize.py`. The source `@click.command()` definition (lines 151–187) has `--list-capabilities`, `--list-rewrites`, `--model`, `--output`, `--config`, `--verbose`, and the dynamically-generated capability flags. No `--preset` option is defined. Any user running `winml optimize -m model.onnx --preset qnn-compatible` will get "Error: no such option: --preset". The entire "Built-in presets" table (doc lines 29–35) and every preset-based example in the doc are invalid.
+
+- **`-p` short form is documented for `--preset`** (doc line 21) but in source, no `-p` exists. The `--model` flag does have `-m` and `--output` has `-o`, but there is no `-p` anywhere in the command definition.
+
+- **"Configuration precedence" claims preset is step 3** (doc lines 38–43) with order: CLI flags > config file > preset > capability defaults. The actual source precedence (lines 363–383) is: capability defaults, then config file, then CLI options. There is no preset layer. The precedence documented is for a different version or planned feature.
+
+## Important (misleading or stale)
+
+- **`--verbose / -v` flag is absent from the doc flag table.** Source lines 180–185 define `@click.option("--verbose", "-v", is_flag=True, default=False, ...)`. The doc table lists only `--model`, `--output`, `--preset`, `--config`, `--list-capabilities`, `--list-rewrites`, and dynamic flags — `--verbose` is missing entirely.
+
+- **`--model` short form `-m`** is not shown in the doc's flag table (the Short column is empty for `--model` at doc line 19). Source line 167 defines `"--model", "-m"`. Users will not know `-m` works.
+
+- **"Configuration precedence" in source is 3-level, not 4-level.** Source lines 363–383 implement: (1) capability defaults, (2) config file, (3) CLI options. The doc describes 4 levels including "preset". Without the preset, the doc's precedence section incorrectly numbers and describes the chain.
+
+- **Examples use `--preset`** (doc lines 71–85) — all preset-based examples produce errors with the current source. The only valid examples are:
+  - `winml optimize -m model.onnx` (default caps)
+  - `winml optimize --list-capabilities`
+  - `winml optimize --list-rewrites`
+  - `winml optimize -m model.onnx --enable-<cap>` / `--disable-<cap>`
+  - `winml optimize -m model.onnx -c config.json`
+
+- **`--config` type described as `PATH`** — the doc says "YAML or JSON configuration file" (doc line 23). Source line 175 uses `type=click.Path(exists=True, path_type=Path)` and `load_config()` (lines 48–70) supports `.yaml/.yml` and `.json`. This is correct.
+
+## Minor (polish)
+
+- The doc's dynamic flags section (line 25) correctly describes `--enable-<name>/--disable-<name>` pairs from the capability registry and `--list-capabilities` to discover them. This matches source lines 109–148.
+- The claim that "adding a new optimization to the registry automatically makes it available as a CLI flag" matches source — `capability_options` decorator (lines 109–148) auto-generates flags at import time.
+- `--list-capabilities` with `-l` short form → source lines 153–157 confirm `-l` is the short form. Correctly documented.
+- `--list-rewrites` (no short form) → source lines 159–163 confirm. Correctly documented.
+- Output path default `{input}_opt.onnx` → source lines 352–353 confirm.
+- Before/after node count reduction report → source lines 419–423 confirm.
+
+## Verified correct (key claims checked)
+
+- `--model / -m` exists, `required=False` (only required when not listing) → source lines 165–171.
+- `--output / -o` exists via `cli_utils.output_option` → source line 172.
+- `--config / -c` exists, type `Path(exists=True)` → source lines 173–179.
+- `--list-capabilities / -l` exists as flag → source lines 151–157.
+- `--list-rewrites` exists as flag (no short form) → source lines 159–163.
+- Dynamic `--enable-X/--disable-X` flags from capability registry → source lines 109–148.
+- Missing `--model` when not listing raises `UsageError` → source lines 336–338.
+- Config file supports YAML and JSON → source lines 48–70.
diff --git a/docs/superpowers/2026-05-27-doc-issues/overview.md b/docs/superpowers/2026-05-27-doc-issues/overview.md
new file mode 100644
index 000000000..38f065449
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/overview.md
@@ -0,0 +1,17 @@
+# Issues: docs/commands/overview.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- Line 2: States "12 subcommands". Source has 14 command modules (`analyze`, `build`, `catalog`, `compile`, `config`, `eval`, `export`, `inspect`, `optimize`, `perf`, `quantize`, `run`, `serve`, `sys`). `run` and `serve` are disabled at runtime via `_DISABLED_COMMANDS` (cli.py) but the command count is still wrong at 12 — the actual exposed count is 12 only if the two disabled commands are excluded AND `catalog` is counted as `hub`. The command map (line 29) lists `hub` which does not exist; the actual command is `catalog` (catalog.py, `@click.command()` function named `catalog`). There is no `hub` command in the codebase at this commit.
+- Line 29 (table row): `hub` command listed as "Browse the curated winml-cli catalog of validated models and benchmarks." The command is named `catalog`, not `hub` (catalog.py). `winml hub` would fail at the CLI.
+
+## Important
+
+- Line 55: References `src/winml/modelkit/commands/_options.py` as the "canonical contract" for global flags. This file does not exist at commit 5e25579 (verified via `git ls-tree`). Global flags are defined in `src/winml/modelkit/cli.py` directly.
+- Lines 41–48 ("Choosing a command"): The entry "I want to know if my model is supported → `winml inspect`" is reasonable, but `winml analyze` (Verify EP operator compatibility) is a closer match for pre-deployment compatibility checks. The distinction between `inspect` and `analyze` is not reflected in the choosing-a-command list, making `analyze` effectively undiscoverable from this guide.
+
+## Minor
+
+- Line 63: Shared flags claim "`-p` / `--precision`" is shared. `perf` and `eval` both have `--precision` but `inspect`, `sys`, `hub`/`catalog`, and `analyze` do not. The claim "Defaults and accepted values can differ per command" partially covers this, but listing `-p` as a shared flag implies it exists on most commands, which overstates its reach.
diff --git a/docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md b/docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md
new file mode 100644
index 000000000..aa5f3fe5c
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md
@@ -0,0 +1,18 @@
+# Issues: docs/concepts/perf-and-monitoring.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- Line 11: `--device` is described as accepting `cpu`, `gpu`, or `npu` only, but `perf` calls `cli_utils.device_option(include_auto=True, default="auto")` (perf.py:1113), so `auto` is also a valid choice and is the actual default. The sentence "The `--device` flag selects the target EP — `cpu`, `gpu`, or `npu`" omits `auto` and misstates the default.
+- Line 13: Output path default stated as `{model_slug}_perf.json` (implying the current directory). Source writes to `~/.cache/winml/perf/<slug>/<timestamp>.json` (perf.py:871–876). The default location is wrong and the timestamp-per-run filename structure is omitted entirely.
+
+## Important
+
+- Lines 25–31: `--op-tracing` is documented as a user-facing feature with two levels. In source the option is decorated `hidden=True` (perf.py:1183), meaning it is intentionally hidden from `--help`. Documenting a hidden flag as a supported feature is misleading.
+- Lines 17–21: `--monitor` is described as streaming "NPU utilisation". Source tracks whichever device is being benchmarked: NPU, GPU, or CPU (`monitor_device = self._model.device or self.config.device or "auto"`, perf.py:409). Calling it NPU-specific is inaccurate.
+
+## Minor
+
+- Line 19: States the chart "updates in place during the iteration loop". The live chart is managed by `LiveMonitorDisplay` (perf.py:943), but this detail is accurate. No issue.
+- Line 37: `--module` docstring in source says the argument is a "PyTorch module class name (NOT a dotted module path)" (perf.py:1166–1169). The concept doc example `winml perf -m bert-base-uncased --module BertAttention` is correct, but the doc does not warn users that a dotted path will silently not match, which is the primary pitfall documented in the source help text.
diff --git a/docs/superpowers/2026-05-27-doc-issues/perf.md b/docs/superpowers/2026-05-27-doc-issues/perf.md
new file mode 100644
index 000000000..ebd669440
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/perf.md
@@ -0,0 +1,41 @@
+# Issues: docs/commands/perf.md
+
+Source verified against: `src/winml/modelkit/commands/perf.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- (none)
+
+## Important (misleading or stale)
+
+- **`--compare-devices` flag does not exist in source.** The flag table lists `--compare-devices | TEXT | — | Not yet implemented`. A full search of `perf.py` shows zero occurrences of `compare_devices` or `compare-devices` as a defined click option. The flag is documented but never registered; passing it will produce a "No such option" error. The note "Not yet implemented" is insufficient — the flag should either be removed from the table entirely or marked explicitly as "not defined, will error if passed".
+- **`--op-tracing` is hidden in source.** `perf.py:1184`: `hidden=True` — the flag is intentionally hidden from `--help` output. The doc exposes it in the flag table without any note that it does not appear in `--help`. Consider adding "(hidden from --help output; not ready for general use)" to the description.
+- **Default output path documented as "`{model_slug}_perf.json`" is wrong.** Source `perf.py:871` generates the path as `~/.cache/winml/perf/<slug>[/<module_class>]/<timestamp>.json`, not a file in the current working directory. Users expecting a local file will be confused.
+
+## Minor (polish)
+
+- **Flag table omits `--verbose` / `-v`.** Defined at `perf.py:1183-1191`.
+- **Flag table omits `--build-config` / `-c` (the shared build config option).** `perf.py:1192` registers `@cli_utils.build_config_option`.
+- **`--shape-config` description says "Ignored for pre-exported ONNX files and in `--module` mode"** — correct; both branches issue warnings at `perf.py:1280-1284` and `perf.py:1351-1356`. The doc accurately describes this behavior.
+
+## Verified correct (key claims checked)
+
+- `--model` / `-m` required (enforced in body, not `required=True`) → `perf.py:1092`, `perf.py:1243`
+- `--task` string default None → `perf.py:1093-1098`
+- `--iterations` IntRange min=1 default 100 → `perf.py:1099-1105`
+- `--warmup` IntRange min=0 default 10 → `perf.py:1106-1111`
+- `--device` choice `auto|cpu|gpu|npu` default `auto` → `perf.py:1113` via `cli_utils.device_option`
+- `--precision` string default `auto` → `perf.py:1114-1120`
+- `--ep` via `cli_utils.ep_option` → `perf.py:1121-1124`
+- `--batch-size` int default 1 → `perf.py:1129-1135`
+- `--shape-config` path default None → `perf.py:1136-1142`
+- `--no-quantize` flag default false → `perf.py:1143-1148`
+- `--rebuild` flag default false → `perf.py:1149-1153`
+- `--ignore-cache` flag default false → `perf.py:1154-1159`
+- `--module` string default None → `perf.py:1160-1170`
+- `--monitor` flag default false → `perf.py:1171-1176`
+- `--op-tracing` choice `basic|detail` default None → `perf.py:1177-1184`
+- `--compare-devices` marked "not yet implemented" → confirmed not implemented (flag absent from source)
+- Statistics include mean, min, max, P50, P90, P95, P99, std → `perf.py:104-109` (BenchmarkResult fields)
+- `--monitor` includes hw metrics in JSON → `perf.py:127`, `perf.py:167-168`
+- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md b/docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md
new file mode 100644
index 000000000..a9115cbb1
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md
@@ -0,0 +1,32 @@
+# Issues: docs/concepts/primitives-and-pipeline.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+
+- (none)
+
+## Important (misleading or stale claim)
+
+- **`--use-cache` described as alternative to `-o`/`--output-dir`** (line 62-65): Doc says "accepts `--use-cache` in place of `-o`/`--output-dir`". The short flag for output directory in `winml build` is `-o` but the parameter is named `--output-dir`, not `--output`. The doc uses `-o`/`--output-dir` inconsistently: line 62 says "in place of `-o`/`--output-dir`" but elsewhere uses `--output-dir`. Source: `src/winml/modelkit/commands/build.py` lines 249-262 — the option is `--output-dir` / `-o`. This is technically fine but the description shorthand could confuse users.
+
+- **`winml build -c config.json -m microsoft/resnet-50 -o output/`** (line 49): The short flag `-o` maps to `--output-dir` in the build command (source: `commands/build.py` line 250-256). This is valid but worth noting: `-o` is the shorthand for `--output-dir`, not `--output`. The doc uses `-o` which is correct.
+
+- **`WinMLBuildConfig` has six nested sub-configs, not five** (line 51 in config-and-build.md references five — this doc only lists five): The `WinMLBuildConfig` dataclass also has an `eval: WinMLEvaluationConfig | None` field and an `auto: bool` field (source: `src/winml/modelkit/config/build.py` lines 132-138). The doc does not mention these — omission rather than error, but the `eval` field could be relevant to users combining `winml build` and `winml eval`.
+
+## Minor (style, polish, low-impact)
+
+- **Cross-link `[ConvNeXT primitives sample](../samples/convnext-primitives.md)`** (line 104): The file `docs/samples/convnext-primitives.md` exists and the link is valid.
+
+- **`winml build` without `-c`** (lines 49, 62): Doc implies `-c` is required for `winml build`. Source shows `-c` is `required=False` (`commands/build.py` line 236-241) — if omitted, config is auto-generated from `-m`. The doc's initial description of the command is accurate but does not mention the `-c`-less shorthand.
+
+## Verified correct (anchored claims you checked)
+
+- `WinMLBuildConfig` exists as a dataclass → `src/winml/modelkit/config/build.py` line 97
+- `winml build` flags `--no-quant`, `--no-compile`, `--no-optimize` all exist → `commands/build.py` lines 273, 275-282, 300-304
+- `--use-cache` flag exists and is mutually exclusive with `--output-dir` → `commands/build.py` lines 258-262, 376-379
+- `--rebuild` flag exists → `commands/build.py` lines 263-268
+- Setting `quant` or `compile` to `null` skips those stages → `config/build.py` lines 133-136 (both are `| None`)
+- `~/.cache/winml/` as global cache path → `commands/build.py` line 261 (`~/.cache/winml/`)
+- Six primitive commands listed are all real CLI commands → `commands/` directory contains `export.py`, `optimize.py`, `quantize.py`, `compile.py`, `perf.py`, `eval.py`
+- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/quantization.md b/docs/superpowers/2026-05-27-doc-issues/quantization.md
new file mode 100644
index 000000000..9c2ffd2e7
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/quantization.md
@@ -0,0 +1,29 @@
+# Issues: docs/concepts/quantization.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+- Line 9: "every precision from `_KNOWN_PRECISIONS` in `_options.py`". Neither `_KNOWN_PRECISIONS` nor `_options.py` exist anywhere in the source tree. The actual symbol is `_NAMED_PRECISIONS` (a `frozenset` at `src/winml/modelkit/config/precision.py:71`) and there is no file named `_options.py`. This is a fabricated source citation. A reader trying to cross-reference the table against source code will find nothing.
+
+- Line 9: "the resolved quantization types from `config/precision.py`". The file path should be `src/winml/modelkit/config/precision.py`. The abbreviated form `config/precision.py` is navigable by context, but the companion citation `_options.py` is entirely wrong (see above). The combined sentence creates a misleading impression of where the table data lives.
+
+- Line 18 (table row `int8`): "default for NPU via QNN EP". The actual NPU auto-precision default is `w8a16`, not `int8`. `_AUTO_PRECISION = {"npu": "w8a16", ...}` at `src/winml/modelkit/config/precision.py:32-36`. Using `--precision int8` (or the `int8` named preset) resolves to `uint8/uint8` and is _valid_ for QNN, but it is not the auto-selected default. The annotation "default for NPU via QNN EP" is wrong.
+
+- Line 20 (table row `w4a16`): "Recognized as a precision string but raises an error at quantization time; no 4-bit weight dtype mapping exists in `precision.py` yet." This overstates what the code does. `w4a16` is NOT recognized at all. `is_quantized_precision("w4a16")` returns `False` (because `4` is not in `_BITS_TO_WEIGHT_TYPE`), and `_resolve_quant_types()` in `src/winml/modelkit/commands/quantize.py:260-269` raises `click.BadParameter` for any non-quantized, non-auto precision string — including `w4a16`. The doc's claim that it is "recognized as a precision string" is incorrect; it is rejected before reaching quantization time.
+
+## Important (misleading or stale claim)
+- Line 17 (table row `auto`): "Resolves to `int8` (NPU), `fp16` (GPU/CPU) at runtime". Partially wrong. For NPU the auto-precision resolves to `w8a16` (not `int8`), per `_AUTO_PRECISION["npu"] = "w8a16"` at `src/winml/modelkit/config/precision.py:33`. For GPU and CPU the `fp16` claim is correct (`_AUTO_PRECISION["gpu"] = "fp16"`, `_AUTO_PRECISION["cpu"] = "fp16"`, lines 34-35).
+
+- Line 16 (table row `int16`): Weight dtype listed as `int16`, activation dtype as `uint16`. Source at `src/winml/modelkit/config/precision.py:43-50` shows `_WEIGHT_TYPE["int16"] = "int16"` and `_ACTIVATION_TYPE["int16"] = "uint16"`. The weight type `int16` is correct. However, the resolution goes through `_BITS_TO_WEIGHT_TYPE[16] = "int16"` when using the `w{x}a{y}` form. The named-preset path matches the table. This row is correct.
+
+## Minor (style, polish, low-impact)
+- Lines 63-65: Cross-links (`weight-and-activation.md`, `eps-and-devices.md`, `../commands/quantize.md`, `../commands/eval.md`) all resolve to files that exist on disk.
+- Lines 32-35: `--samples` default `10` and `--method` choices `minmax`, `entropy`, `percentile` — all confirmed at `src/winml/modelkit/commands/quantize.py:57-65`.
+- Line 22: "`--weight-type` and `--activation-type` flags on `winml quantize` accept `uint8`, `int8`, `uint16`, or `int16`" — confirmed at `src/winml/modelkit/commands/quantize.py:67-76`.
+
+## Verified correct (anchored claims you checked)
+- Line 16 (table row `fp16`): No QDQ nodes, float16 throughout — matches `_WEIGHT_TYPE["fp16"] = None` at `src/winml/modelkit/config/precision.py:41`.
+- Line 15 (table row `fp32`): No quantization, baseline — matches `_WEIGHT_TYPE["fp32"] = None` at `src/winml/modelkit/config/precision.py:40`.
+- Line 19 (table row `w8a8`): `uint8/uint8`, equivalent to `int8` — matches `_MIXED_RE` path resolving `w8a8` -> `_BITS_TO_WEIGHT_TYPE[8]="uint8"`, `_BITS_TO_ACTIVATION_TYPE[8]="uint8"` at `src/winml/modelkit/config/precision.py:57-65`.
+- Line 19 (table row `w8a16`): `uint8` weights, `uint16` activations — matches `_BITS_TO_WEIGHT_TYPE[8]="uint8"`, `_BITS_TO_ACTIVATION_TYPE[16]="uint16"` at `src/winml/modelkit/config/precision.py:57-65`.
+- Lines 40-41: `--samples` default `10`, `--method` default `minmax` — confirmed at `src/winml/modelkit/commands/quantize.py:57-65`.
diff --git a/docs/superpowers/2026-05-27-doc-issues/quantize.md b/docs/superpowers/2026-05-27-doc-issues/quantize.md
new file mode 100644
index 000000000..ce74fbea0
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/quantize.md
@@ -0,0 +1,33 @@
+# Issues: docs/commands/quantize.md
+
+Source verified against: `src/winml/modelkit/commands/quantize.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- (none)
+
+## Important (misleading or stale)
+
+- **`--precision` accepted values listed as `int8`, `int16`, or `w8a16` but source also accepts `auto` and the full `w{x}a{y}` family.** The doc's flag table says only "`int8`, `int16`, or mixed-precision like `w8a16`". Source `quantize.py:50-53` documents: "Accepted: auto, int8, int16, or w{x}a{y} where x,y in {8,16} (e.g., w8a8, w8a16, w16a16)." The `auto` value and `w8a8` / `w16a16` forms are silently omitted from the table.
+- **Flag table omits `--task` and `--model-name`.** Both are real options defined in source (`quantize.py:92-109`). `--task` selects a calibration dataset; `--model-name` enables task-aware calibration with the model's preprocessor. Users who need task-aware calibration have no documentation to guide them.
+- **Flag table omits `--verbose` / `-v`.** Defined at `quantize.py:104-109`.
+
+## Minor (polish)
+
+- **Default output path description says "`{input}_qdq.onnx`" but should clarify stem only.** Source uses `model.stem + "_qdq.onnx"` in the same directory as the input (`quantize.py:189`), which matches, but "`{input}`" is ambiguous about whether the full path or just the stem is used.
+- **"Quantizing an already-quantized model is unsupported" pitfall** mentions `winml compile --no-quant` as the alternative. As noted in compile.md, `--no-quant` is a no-op in compile. The pitfall advice is therefore unhelpful and should be updated to reflect actual behavior.
+
+## Verified correct (key claims checked)
+
+- `--model` / `-m` required path → `quantize.py:37-43`
+- `--output` / `-o` optional path, default `{stem}_qdq.onnx` → `quantize.py:44` + `quantize.py:189`
+- `--precision` / `-p` string default None → `quantize.py:45-53`
+- `--samples` integer default 10 → `quantize.py:54-58`
+- `--method` choice `minmax|entropy|percentile` default `minmax` → `quantize.py:59-65`
+- `--weight-type` choice `uint8|int8|uint16|int16` default None → `quantize.py:66-71`
+- `--activation-type` choice `uint8|int8|uint16|int16` default None → `quantize.py:72-77`
+- `--per-channel` flag default false → `quantize.py:78-83`
+- `--symmetric` flag default false → `quantize.py:84-89`
+- Explicit type flags override `--precision` → `quantize.py:271-276`
+- Default types when no precision specified: uint8/uint8 → `quantize.py:263` (precision=None or "auto" → default_w/a = "uint8")
+- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/quickstart.md b/docs/superpowers/2026-05-27-doc-issues/quickstart.md
new file mode 100644
index 000000000..a446a26dd
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/quickstart.md
@@ -0,0 +1,13 @@
+# Issues: docs/getting-started/quickstart.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+- (none)
+
+## Important
+- `winml sys --list-device --list-ep` (quickstart.md:14): the doc claims these flags "skip SDK versions and Python environment details that plain `winml sys` would include." This is misleading. In `sys.py`, when `list_device` or `list_ep` is set, the command takes a separate branch that runs *only* the device/EP listing and returns early — it does not run `_gather_system_info()` at all, so "skipping" implies it runs a subset of the normal command, when in fact it is a separate code path. This is a documentation accuracy issue but not a flag-existence issue.
+- `winml inspect -m resnet50.onnx` (quickstart.md:40): `inspect.py` explicitly raises `click.ClickException("ONNX file inspection is not yet supported. Use 'winml config -m model.onnx' for ONNX build config.")` when passed a `.onnx` local file. The command as documented will fail rather than produce the shown output.
+
+## Minor
+- (none)
diff --git a/docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md b/docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md
new file mode 100644
index 000000000..c3db1d35c
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md
@@ -0,0 +1,43 @@
+# Issues: docs/samples/qwen3-composite.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- (none)
+
+## Important
+
+- **GitHub URL points to `https://github.com/microsoft/winml-cli` (plain repo root).**
+  The "Track progress" section says "Follow development and check current status
+  at https://github.com/microsoft/winml-cli". There is no issues link, milestone
+  link, or branch link. For a placeholder page whose purpose is to track an
+  in-progress feature, the URL is minimal but not wrong. However, if the
+  feature is tracked on a specific branch or issue, the link should be more
+  precise. Acceptable as-is for a placeholder.
+
+- **Forward-looking sketch references `BuildConfig`** (capitalised as proper
+  noun) without tying it to `WinMLBuildConfig`. Readers coming from the BERT
+  sample know the class name; first-time readers may not. Minor wording issue.
+
+## Minor
+
+- **`!!! info "Coming soon"` admonition** — correctly identifies the page as a
+  placeholder. Format is valid MkDocs Material admonition syntax.
+
+- **Cross-link `../samples/bert-config-build.md`** — from inside `docs/samples/`,
+  a self-referential path resolves to `docs/samples/bert-config-build.md`.
+  The `../samples/` prefix from within `samples/` is redundant (resolves one
+  directory up and back in) but should still resolve correctly in MkDocs.
+  Could be simplified to `bert-config-build.md`.
+
+## Verified correct
+
+- Page correctly identifies itself as a placeholder and defers all content to
+  after the composite-model feature branch lands.
+- No commands, flags, or artifact names are asserted (none to verify wrong).
+- GitHub URL `https://github.com/microsoft/winml-cli` is the correct upstream
+  repository URL.
+- No `wmk` or `ModelKit` strings found in user-facing prose.
+- "What composite models are" section contains only conceptual prose — no
+  verifiable command syntax.
diff --git a/docs/superpowers/2026-05-27-doc-issues/sys.md b/docs/superpowers/2026-05-27-doc-issues/sys.md
new file mode 100644
index 000000000..68e52ec54
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/sys.md
@@ -0,0 +1,33 @@
+# Issues: docs/commands/sys.md
+
+Source verified against: `src/winml/modelkit/commands/sys.py` @ 5e25579
+
+## Critical (flag/behavior wrong; user gets error)
+
+- **`--verbose` / `-v` flag is missing from the flag table entirely.** Source lines 653–659 define `@click.option("--verbose", "-v", is_flag=True, default=False, ...)`. The doc table lists only `--format`, `--list-device`, `--list-ep`, and `--help` — omitting `--verbose` means users reading the doc have no way to discover a functional, documented flag. The example `winml inspect -m facebook/convnext-tiny-224 -v -H` on `inspect.md` uses the same flag pattern, and `sys.py` line 699 shows `-v`/`--verbose` passed via `docstring`. Using `--verbose` surfaces Backend SDKs and Export Readiness sections (lines 392–433) that are hidden otherwise; presenting the command as having only 3 flags is actively wrong.
+
+## Important (misleading or stale)
+
+- **"How it works" says CUDA details are always probed via PyTorch** — source lines 218–251 show `_get_torch_info(verbose=False)` is the default, which explicitly skips `import torch` and CUDA probing (lines 235–251 are gated on `if not verbose: return info`). CUDA availability (`cuda_available`) only appears in the output when `--verbose` is passed. The doc's "How it works" says "probes PyTorch for CUDA availability and GPU device names" unconditionally, which is misleading — this only happens under `--verbose`.
+
+- **`--format compact` pitfall says it "omits device and EP tables"** (line 106) — but the source (lines 757–774) shows compact *does* support `--list-device` and `--list-ep` and prints device/EP information in a single-line form. The pitfall is only correct for the full default report path (line 812 `elif output_format.lower() == "compact": _output_compact(info)` which skips devices/EPs), but combination of `--format compact --list-device` works and produces output. The pitfall is partially misleading.
+
+- **"Backend SDK detection" described as part of default output** — source lines 392–433 show Backend SDKs and Export Readiness sections are only rendered under `verbose=True` (`if verbose:` guard at line 392). The "How it works" section implies these are always shown.
+
+- **Example output shows "winml-cli System Information"** (line 49) but source line 342 renders `"WinML CLI System Information"`. Minor inconsistency in the example panel title.
+
+## Minor (polish)
+
+- **`--help` short form `-h`** — Click auto-adds `--help` / `-h` for all commands; listing it explicitly in the table is harmless but adds noise.
+- **`sys.md` cross-links to `hub.md`** (line 117), but the actual CLI command is `winml catalog` (source: `catalog.py`), not `winml hub`. If `hub.md` documents a `winml hub` alias, verify it exists in `__init__.py`; otherwise the cross-link is confusing.
+
+## Verified correct (key claims checked)
+
+- `--format` flag exists with short `-f`, type `Choice(["text", "json", "compact"])`, default `"text"` → source lines 645–652.
+- `--list-device` flag exists as `is_flag=True, default=False`, no short form → source lines 653–658.
+- `--list-ep` flag exists as `is_flag=True, default=False`, no short form → source lines 659–664.
+- QNN detection uses `QNN_SDK_ROOT` / `QAIRT_SDK_ROOT` env vars → source lines 261–272.
+- OpenVINO detection via `import openvino` → source lines 283–290.
+- `--format json` emits devices and EPs → source lines 801–812.
+- Device enumeration in NPU > GPU > CPU priority order → source lines 495–500.
+- EP enumeration merges WinML registry with ORT `get_available_providers()` → source lines 592–623.
diff --git a/docs/superpowers/2026-05-27-doc-issues/tutorials-index.md b/docs/superpowers/2026-05-27-doc-issues/tutorials-index.md
new file mode 100644
index 000000000..ce9238027
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/tutorials-index.md
@@ -0,0 +1,27 @@
+# Issues: docs/tutorials/index.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical
+
+- (none)
+
+## Important
+
+- (none)
+
+## Minor
+
+- **Backtick usage inconsistency.** The prose uses `` `winml-cli` `` (with
+  backtick) in one sentence but refers to `winml` (the CLI binary name) without
+  backtick elsewhere. This is cosmetic only.
+
+## Verified correct
+
+- Table entry `[ConvNeXt on NPU](npu-convnext.md)` — file exists at
+  `docs/tutorials/npu-convnext.md`. Link resolves.
+- "Hardware" column entry "Copilot+PC NPU primary; CPU works as fallback" —
+  consistent with npu-convnext.md content.
+- No command invocations to verify.
+- No `wmk` or `ModelKit` strings in user-facing prose.
+- Page correctly describes tutorials vs samples vs concepts distinctions.
diff --git a/docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md b/docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md
new file mode 100644
index 000000000..0cb7e2c6b
--- /dev/null
+++ b/docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md
@@ -0,0 +1,18 @@
+# Issues: docs/concepts/weight-and-activation.md
+
+Source verified against: microsoft/winml-cli @ 5e25579
+
+## Critical (factually wrong; user would hit error)
+- (none)
+
+## Important (misleading or stale claim)
+- Line 23: States "QNN on NPU pairs uint8 weights with uint8 or uint16 activations." According to `src/winml/modelkit/config/precision.py`, the NPU auto-precision resolves to `w8a16` (`_AUTO_PRECISION = {"npu": "w8a16", ...}`, line 33), which maps to `uint8` weights + `uint16` activations (lines 57-65). The `int8` preset maps to `uint8/uint8` (lines 39-51). So the claim "uint8 or uint16" is technically accurate for the full range of QNN-targeted precisions, but the default (and most prominently documented) NPU precision is `w8a16` (uint8 weight + uint16 activation), not `uint8/uint8`. The framing may lead readers to underweight the `w8a16` default.
+
+## Minor (style, polish, low-impact)
+- Line 19: "The `--weight-type` and `--activation-type` flags on `winml quantize` exist..." — both flags are confirmed at `src/winml/modelkit/commands/quantize.py:67` and `73`.
+- Lines 28-33: Cross-links (`quantization.md`, `eps-and-devices.md`, `../commands/quantize.md`, `graphs-and-ir.md`) all resolve to files that exist on disk.
+
+## Verified correct (anchored claims you checked)
+- Line 7: "winml quantize ... observes the weight distributions in your exported ONNX and bakes the per-tensor scale/zero-point into the QDQ nodes" — matches `src/winml/modelkit/commands/quantize.py` workflow and `src/winml/modelkit/config/precision.py` precision resolution.
+- Lines 19-24: `--weight-type` accepts `uint8, int8, uint16, int16`; `--activation-type` accepts the same — confirmed at `src/winml/modelkit/commands/quantize.py:67-76`.
+- Line 25: `w8a16` described as "8-bit weights, 16-bit activations" — confirmed; resolves to `uint8` weight + `uint16` activation via `_BITS_TO_WEIGHT_TYPE[8]="uint8"` and `_BITS_TO_ACTIVATION_TYPE[16]="uint16"` at `src/winml/modelkit/config/precision.py:57-65`.
diff --git a/docs/superpowers/2026-05-27-validated-issues.md b/docs/superpowers/2026-05-27-validated-issues.md
new file mode 100644
index 000000000..4a2bbfcb3
--- /dev/null
+++ b/docs/superpowers/2026-05-27-validated-issues.md
@@ -0,0 +1,151 @@
+# v3 docs validated issues
+
+Validated against microsoft/winml-cli @ 5e25579 on docs/draft.
+
+## Critical (factually wrong; user would hit error)
+
+### docs/getting-started/installation.md
+- Python version wrong: doc states `3.10` / `requires-python = ">=3.10,<3.11"` but `pyproject.toml:13` reads `requires-python = ">=3.11,<3.12"`. Install step (`uv python install 3.10`) and verify output (`Python Version 3.10.x`) are both wrong.
+
+### docs/getting-started/end-to-end.md
+- DML and CPU artifact filenames are wrong: doc claims GPU produces `convnext_tiny_dml_ctx.onnx` and CPU produces `convnext_tiny.onnx`. `compiler/configs.py:175` (`for_dml`) and `:165` (`for_cpu`) both set `enable_ep_context=False`; `CompileStage.process` only calls `_finalize_output` when `ep_config.enable_ep_context` is True (`compile.py:102`). Neither DML nor CPU produces a `_ctx.onnx` file — the compile step is a no-op for both.
+
+### docs/commands/overview.md
+- `winml hub` command does not exist. Source registers the function as `catalog` (`catalog.py:387`). Every `winml hub` invocation in the doc will fail at the CLI.
+
+### docs/commands/build.md
+- `--random-init` flag does not exist. Source `build.py` has no such option (grep returns no hits). Passing it will produce "No such option".
+- `--config / -c` documented as *(required)* but `build.py:237` sets `required=False`. When omitted, config is auto-generated from `-m`.
+- `--qnn-sdk-root` listed in the flag table but does not exist on `winml build` (zero hits in `build.py`). It is a `winml compile`-only flag. Users will get "No such option".
+
+### docs/commands/compile.md
+- `--device` default documented as `npu` but `compile.py:62` sets `default="auto"`. Users expecting NPU-only targeting without `--device npu` will get auto-detection instead.
+- `--no-quant` flag does not exist in `compile.py` (zero occurrences). Users who pass it get "No such option".
+
+### docs/commands/config.md
+- `--no-compile` default documented as `off` (compile included by default). `config.py:163` shows `default=True` for `no_compile`, meaning compile is *excluded* by default. The framing is entirely backwards — users need `--compile` to include compilation, not `--no-compile` to exclude it.
+
+### docs/commands/eval.md
+- `--device` type column shows `cpu|gpu|npu`, default `cpu`. Source `eval.py:69` defines `click.Choice(["auto", "cpu", "gpu", "npu"])` with `default="auto"`. `auto` is missing and the default is wrong.
+- `-n` listed as a short alias for `--samples`. Source defines `--samples` with no short flag. The `-n` alias does not exist.
+
+### docs/commands/hub.md
+- All `winml hub` invocations fail: source registers the command as `winml catalog` (`catalog.py:387`).
+- `--model / -m` flag does not exist in `catalog.py` (confirmed full read). Users who run `winml hub --model <id>` get "No such option".
+
+### docs/commands/analyze.md
+- `--device` default documented as `NPU`; `analyze.py:644` sets `default="auto"`. Users will not get NPU-specific analysis by default.
+- `--ep` default documented as "all supported EPs analyzed"; `analyze.py:633` sets `default="auto"` (infers from local availability, not "all").
+- `--run-unknown-op` default documented as "enabled"; `analyze.py:668` sets `default=False`. The pitfall note that says "disable when libraries are missing" compounds this by implying it is on.
+
+### docs/commands/optimize.md
+- `--preset / -p` flag does not exist. `optimize.py` command definition (lines 151–187) has no `--preset` option. The entire "Built-in presets" table and all preset-based examples are invalid; users get "Error: no such option: --preset".
+
+### docs/commands/inspect.md
+- `--list-tasks`, `--model-type`, and `--model-class` flags are not documented. All three are defined in source (`inspect.py:98–116`) and functional.
+
+### docs/concepts/quantization.md
+- `int8` row annotated "default for NPU via QNN EP". Actual NPU auto-precision default is `w8a16` (`precision.py:33`: `_AUTO_PRECISION = {"npu": "w8a16", ...}`). `int8` is valid for QNN but is not the default.
+- `auto` row: "Resolves to `int8` (NPU)…" is wrong for NPU; resolves to `w8a16` per `_AUTO_PRECISION["npu"]` (`precision.py:33`).
+- `w4a16` row: "Recognized as a precision string but raises error at quantization time" is wrong. `is_quantized_precision("w4a16")` returns `False` (4 not in `_BITS_TO_WEIGHT_TYPE`, `precision.py:57`) so it is rejected before quantization, not recognized at all.
+
+### docs/concepts/compile-and-epcontext.md
+- `--no-quant` on `winml compile` (line 29): flag does not exist in `compile.py`. Users get "Error: No such option".
+
+### docs/concepts/config-and-build.md
+- JSON `compile` section uses nested `ep_config.provider` structure. `WinMLCompileConfig.to_dict()` (`configs.py:230–245`) serializes flat with `execution_provider`, not nested under `ep_config`. Copy-pasting the example silently uses defaults instead of the specified values.
+
+### docs/samples/bert-config-build.md
+- Final artifact documented as `bert_out/bert-base-uncased_ctx.onnx`. `build.py:714` writes `final_path = resolved_dir / "model.onnx"` for non-cached builds. The file is `bert_out/model.onnx`; the ctx-name variant does not exist. Step 3 `winml perf` reference to `bert-base-uncased_ctx.onnx` will also fail.
+
+### docs/samples/convnext-primitives.md
+- CPU (`--device cpu`) and GPU (`--device gpu`) compile steps documented as producing `_cpu_ctx.onnx` and `_dml_ctx.onnx`. Both are wrong: `for_cpu()` and `for_dml()` set `enable_ep_context=False` (`configs.py:165,175`); no `_ctx.onnx` is written. The GPU perf tab referencing `convnext_int8_dml_ctx.onnx` will fail.
+- Note claims "`--device` does not accept `auto` on `winml eval`". `eval.py:69` lists `auto` as a valid choice with `default="auto"`.
+
+### docs/tutorials/npu-convnext.md
+- CPU artifact named `convnext_int8_cpu_ctx.onnx` (steps 7–8): `for_cpu()` sets `enable_ep_context=False` (`configs.py:165`); no such file is produced. The Step 8 perf command referencing it will fail.
+- Python 3.10 listed in prerequisites; `pyproject.toml:13` requires `>=3.11,<3.12`.
+
+### docs/concepts/perf-and-monitoring.md
+- `--device` described as accepting only `cpu`, `gpu`, `npu`. `perf.py:1113` uses `device_option(include_auto=True, default="auto")`; `auto` is valid and is the default.
+- Default output path stated as `{model_slug}_perf.json` (current directory). Source `perf.py:871` writes to `~/.cache/winml/perf/<slug>/<timestamp>.json`.
+
+---
+
+## Important (misleading or stale)
+
+### docs/getting-started/installation.md
+- "No NPU?" callout claims `winml eval` accepts only `cpu|gpu|npu` (no `auto`). `eval.py:69` defines `click.Choice(["auto", "cpu", "gpu", "npu"])` — `auto` is valid.
+
+### docs/getting-started/end-to-end.md
+- Sample `sys` output shows `QNNExecutionProvider -> NPU`. `get_ep_device_map()` returns `"npu/gpu"` for QNN (`device.py:49`, `constants.py:183`); actual rendered output would be `QNNExecutionProvider -> NPU/GPU`.
+
+### docs/commands/build.md
+- `--no-compile/--compile` documented as a simple `--no-compile` flag; source `build.py:275–282` is a boolean toggle pair; `--compile` (force enable) is undocumented.
+- `--trust-remote-code` absent from flag table; `build.py:312–314` defines it.
+- `--max-optim-iterations` table default shown as `3`; `build.py:309` sets `default=None` (3 is enforced inside pipeline helpers, not at Click layer).
+
+### docs/commands/config.md
+- `--no-compile` framing is backwards (see Critical). The entire usage example `winml config ... --no-compile` implies the flag does work when it is a no-op (already the default).
+
+### docs/commands/hub.md
+- "How it works" describes per-EP latency stats and accuracy verdicts (PASS/AT_RISK/REGRESSION) that do not appear anywhere in `catalog.py`. The rendered catalog shows only Model, Task, Size, Model Type columns.
+- `--ep` and `--device` filter flags (`catalog.py:377–385`) absent from the flag table entirely.
+
+### docs/commands/analyze.md
+- `--ep` valid special values `"all"` and `"auto"` not mentioned; `analyze.py:634` includes both in `Choice`. Related: "Omitting `--ep` analyzes every EP" (pitfall line 82) repeats the incorrect default claim.
+- `--model` short form `-m` shown with empty Short column; `cli.py:68` defines `"--model", "-m"`.
+- `--verbose/-v`, `--quiet/-q`, and `--config/-c` absent from flag table; all defined via decorators (`analyze.py:651–652`).
+
+### docs/commands/optimize.md
+- `--verbose/-v` absent from flag table; `optimize.py:180–185` defines it.
+- `--model` Short column is empty; `optimize.py:167` defines `-m`.
+- "Configuration precedence" describes 4 levels (including "preset"); source has 3 levels (`optimize.py:363–383`). The preset level does not exist.
+
+### docs/commands/inspect.md
+- `-v`/`--verbose` absent from flag table; `inspect.py:78–83` defines it.
+
+### docs/commands/perf.md
+- `--compare-devices` listed as "Not yet implemented" but the flag is not registered at all in `perf.py`. Passing it will error, not silently be ignored.
+- `--op-tracing` documented as a user-facing feature; `perf.py:1183` decorates it `hidden=True`.
+- Default output path documented as `{model_slug}_perf.json`; actual path is `~/.cache/winml/perf/<slug>/<timestamp>.json` (`perf.py:871`).
+
+### docs/commands/sys.md
+- `--verbose/-v` absent from flag table; `sys.py:653–659` defines it. Verbose mode surfaces Backend SDKs and Export Readiness sections (`sys.py:392`).
+
+### docs/concepts/config-and-build.md
+- `WinMLBuildConfig` described as having five sub-configs; `config/build.py:132–138` also has `eval: WinMLEvaluationConfig | None` and `auto: bool`.
+
+### docs/samples/npu-convnext.md
+- Step 7 OpenVINO artifact named `convnext_int8_openvino_ctx.onnx`; `compile.py:230` uses `{stem}_{device}_ctx.onnx` where device is the resolved device string (`"npu"`), not the EP name. Actual filename would be `convnext_int8_npu_ctx.onnx`.
+
+### docs/concepts/perf-and-monitoring.md
+- `--monitor` described as streaming "NPU utilisation". Source resolves from model device at runtime (`perf.py:409`); it monitors whichever device is being benchmarked, not NPU specifically.
+- `--op-tracing` documented as a supported feature; it is `hidden=True` (`perf.py:1183`).
+
+### docs/commands/overview.md
+- `src/winml/modelkit/commands/_options.py` cited as "canonical contract" for global flags. This file does not exist (`_options.py` absent from `commands/` directory). Global flags are in `cli.py`.
+
+---
+
+## Rejected (claimed by an agent but not a real defect)
+
+### docs/concepts/quantization.md
+- ["`_KNOWN_PRECISIONS` from `_options.py`" is fabricated] — REJECTED: The claim itself is being kept as Critical because both `_KNOWN_PRECISIONS` and `_options.py` are absent from the codebase (confirmed `_options.py` not in commands/, and `grep` for `_KNOWN_PRECISIONS` returns nothing). The actual symbol is `_NAMED_PRECISIONS` at `precision.py:71`. The finding is genuine, not a false positive.
+
+### docs/concepts/compile-and-epcontext.md
+- [External EPContext described as "default"] — REJECTED as false positive: `EPConfig.embed_context: bool = False` at `configs.py:46` confirms external is the default. Doc is correct.
+- [`--no-validate` flag] — REJECTED as false positive: `compile.py:72–74` defines `--validate/--no-validate`; the doc's use of `--no-validate` correctly names the negative form of the toggle.
+
+### docs/getting-started/end-to-end.md
+- [`QNN_SDK_ROOT` from environment] — REJECTED: `build.py` has no `--qnn-sdk-root` flag (confirmed zero hits). Reading from environment is the correct description.
+- [`--device auto` priority order "NPU first, then GPU, then CPU"] — REJECTED: `device.py:62` confirms `_DEVICE_PRIORITY = ("npu", "gpu", "cpu")`. Claim is correct.
+
+### docs/commands/compile-and-epcontext.md
+- [`for_vitisai` and `for_qnn` described as interchangeable "QNN-family EPs"] — REJECTED as below threshold: both produce EPContext, the distinction noted by the agent is a simplification, not a user-facing error.
+
+### docs/concepts/quantization.md
+- [`int16` weight dtype listed as `int16`] — REJECTED: `_WEIGHT_TYPE["int16"] = "int16"` at `precision.py:43`. Doc row is correct.
+
+### docs/commands/eval.md
+- [`winml eval` loads via `WinMLAutoModel`] — this claim is kept as Important (class name misrepresents implementation) but the agent's flag about missing flags is correct and retained above.

From e440a594480a1e318aaffb46d61cbdecd61bfd1e Mon Sep 17 00:00:00 2001
From: Zac <1221537+tezheng@users.noreply.github.com>
Date: Wed, 27 May 2026 08:52:04 +0800
Subject: [PATCH 004/143] docs(fix): apply 47 validated fixes from
 microsoft/winml-cli source review

5 parallel fix agents applied the validated-issues list. Net: 25 Critical
+ 22 Important defects resolved across 20 doc files + mkdocs.yml.

Major fixes by area:

Concepts (4 pages):
- quantization.md: NPU auto-precision corrected to w8a16 (was int8);
  w4a16 description corrected (rejected at validation, not 'recognized
  but raises at quantization'); _KNOWN_PRECISIONS/_options.py references
  replaced with the actual _NAMED_PRECISIONS/precision.py
- compile-and-epcontext.md: removed non-existent --no-quant flag mention
- config-and-build.md: JSON 'compile' section flattened to use
  execution_provider (not nested ep_config.provider); table expanded to
  the actual 7 sub-configs (added eval, auto)
- perf-and-monitoring.md: --device documented as accepting auto;
  output path corrected to ~/.cache/winml/perf/<slug>/<timestamp>.json;
  --monitor not NPU-specific; --op-tracing marked hidden

Commands (11 pages):
- overview.md: winml hub renamed to winml catalog throughout;
  _options.py reference replaced with cli.py
- hub.md: H1 and all invocations changed to 'winml catalog'; removed
  non-existent --model/-m flag; rewrote 'How it works' (no per-EP latency
  / accuracy-verdict columns exist); added --ep/--device filter flags
- build.md: --config marked optional (was required); --random-init and
  --qnn-sdk-root removed (don't exist); --no-compile/--compile toggle
  pair documented; --trust-remote-code added; --max-optim-iterations
  default corrected to None
- compile.md: --device default corrected to auto; --no-quant flag
  removed (doesn't exist on compile)
- config.md: --no-compile/--compile framing corrected (compile is
  EXCLUDED by default; users need --compile to include)
- eval.md: --device includes auto (default auto, not cpu); -n short
  alias removed; class reference replaced with actual evaluate function
- analyze.md: --device default corrected to auto; --ep default to
  auto; --run-unknown-op default to False; -m/-v/-q/-c flags added
- optimize.md: --preset/-p flag and entire Built-in presets table
  removed (flag doesn't exist); --verbose added; 'Configuration
  precedence' reduced from 4 levels to 3
- inspect.md: --list-tasks, --model-type, --model-class, --verbose
  flags added
- perf.md: --compare-devices removed (not registered at all); output
  path corrected; --op-tracing marked hidden
- sys.md: --verbose/-v added to flag table

Samples / Tutorials / Getting Started (5 pages):
- installation.md: Python 3.10 corrected to 3.11; 'No NPU?' callout
  no longer claims winml eval rejects auto (it accepts auto on main)
- end-to-end.md: dropped incorrect _ctx.onnx CPU/DML artifacts;
  QNNExecutionProvider mapped to NPU/GPU (not just NPU)
- convnext-primitives.md: CPU/GPU compile clarified (no _ctx.onnx
  produced; uses convnext_int8.onnx directly); winml eval auto reverted
- bert-config-build.md: build final artifact corrected to model.onnx
  (was bert-base-uncased_ctx.onnx)
- npu-convnext.md: Python 3.10 -> 3.11; OpenVINO artifact filename
  corrected to use device string (_npu_ctx.onnx not _openvino_ctx.onnx);
  CPU compile tab dropped (CPU doesn't produce _ctx.onnx)

mkdocs.yml: nav label 'hub' renamed to 'catalog' to match the actual
command name on microsoft/winml-cli main.
---
 docs/commands/analyze.md               | 19 ++++----
 docs/commands/build.md                 | 14 +++---
 docs/commands/compile.md               | 25 +++++------
 docs/commands/config.md                |  6 +--
 docs/commands/eval.md                  |  8 ++--
 docs/commands/hub.md                   | 60 +++++++++++---------------
 docs/commands/inspect.md               |  9 ++--
 docs/commands/optimize.md              | 35 ++++-----------
 docs/commands/overview.md              | 10 ++---
 docs/commands/perf.md                  |  9 ++--
 docs/commands/sys.md                   |  1 +
 docs/concepts/compile-and-epcontext.md |  2 +-
 docs/concepts/config-and-build.md      | 12 +++---
 docs/concepts/perf-and-monitoring.md   |  8 ++--
 docs/concepts/quantization.md          |  8 ++--
 docs/getting-started/end-to-end.md     | 17 +++-----
 docs/getting-started/installation.md   | 14 +++---
 docs/samples/bert-config-build.md      |  6 +--
 docs/samples/convnext-primitives.md    | 12 +++---
 docs/tutorials/npu-convnext.md         | 16 +++----
 mkdocs.yml                             |  2 +-
 21 files changed, 129 insertions(+), 164 deletions(-)

diff --git a/docs/commands/analyze.md b/docs/commands/analyze.md
index 24292e381..6466435dc 100644
--- a/docs/commands/analyze.md
+++ b/docs/commands/analyze.md
@@ -16,13 +16,16 @@ $ winml analyze [options]
 
 | Flag | Short | Type | Default | Description |
 |------|-------|------|---------|-------------|
-| `--model` | | `PATH` | *(required)* | Path to the ONNX model file to analyze. |
-| `--ep` | | choice | *(none)* | Target execution provider. Accepts full names (`QNNExecutionProvider`, `OpenVINOExecutionProvider`, `VitisAIExecutionProvider`) or short aliases (`qnn`, `ov`/`openvino`, `vitis`/`vitisai`). When omitted, all supported EPs are analyzed. |
-| `--device` | | `CPU\|GPU\|NPU` | `NPU` | Target device type. Filters the analysis for the named device class when `--ep` is also supplied. When omitted, defaults to NPU. |
+| `--model` | `-m` | `PATH` | *(required)* | Path to the ONNX model file to analyze. |
+| `--ep` | | choice | `auto` | Target execution provider. Accepts full names (`QNNExecutionProvider`, `OpenVINOExecutionProvider`, `VitisAIExecutionProvider`), short aliases (`qnn`, `ov`/`openvino`, `vitis`/`vitisai`), `all` (all rule-data-backed EPs), or `auto` (infer from local availability). |
+| `--device` | | `cpu\|gpu\|npu\|all\|auto` | `auto` | Target device type. `auto` infers from local availability; `all` evaluates all rule-data-backed devices. |
+| `--verbose` | `-v` | flag | off | Enable verbose output. |
+| `--quiet` | `-q` | flag | off | Suppress non-essential output. |
+| `--config` | `-c` | `PATH` | *(none)* | Build configuration file (YAML/JSON). |
 | `--output` | | `PATH` | *(none)* | Save the full JSON result to a file in addition to printing the console summary. |
 | `--information` / `--no-information` | | flag | enabled | Include detailed per-operator recommendations and remediation hints in the output. Pass `--no-information` for a compact pass/fail summary. |
 | `--htp-metadata` | | `PATH` | *(none)* | Path to an HTP metadata JSON file. Enables enhanced Qualcomm-specific pattern extraction when targeting QNN. |
-| `--run-unknown-op` / `--no-run-unknown-op` | | flag | enabled | Attempt to run operators unknown to the EP locally to infer shape and type information. Disable when the local machine lacks the required libraries. |
+| `--run-unknown-op` / `--no-run-unknown-op` | | flag | disabled | Attempt to run operators unknown to the EP locally to infer shape and type information. Enable when local libraries are available. |
 | `--save-node` | | `partial\|unsupported` | *(none)* | Save partial or unsupported node subgraphs to disk for further investigation. Can be specified multiple times: `--save-node partial --save-node unsupported`. |
 
 ## How it works
@@ -31,14 +34,14 @@ $ winml analyze [options]
 
 ## Examples
 
-Analyze against all supported EPs using the default NPU device:
+Analyze using auto-detected EP and device:
 
 ```bash
 $ winml analyze --model microsoft/resnet-50.onnx
 ```
 
 ```text
-Analyzing microsoft/resnet-50.onnx against all supported EPs...
+Analyzing microsoft/resnet-50.onnx (EP: auto, device: auto)...
 
 QNNExecutionProvider (NPU): FULLY SUPPORTED
   Operators checked : 142
@@ -79,10 +82,10 @@ $ winml analyze --model bert-base-uncased.onnx \
 
 ## Common pitfalls
 
-- **Omitting `--ep` analyzes every EP** — this is slower and may produce confusing output when one EP shows unsupported operators that another handles fine. Specify `--ep` when you know your target hardware.
+- **Omitting `--ep` uses `auto` (inferred from local availability)** — to analyze every EP regardless of what is installed, pass `--ep all`. Specify `--ep <name>` when you know your target hardware.
 - **Exit code 1 is not a hard failure** — it means at least one operator is unsupported, not that the model cannot run at all. Many EPs fall back unsupported nodes to the CPU EP automatically; review the recommendations before deciding to restructure the model.
 - **`--htp-metadata` is QNN-specific** — passing a QNN HTP metadata file while targeting a different EP has no effect. Ensure the EP and metadata file correspond to the same hardware.
-- **`--no-run-unknown-op` may widen the unsupported list** — if local execution is disabled, operators whose support cannot be verified statically are conservatively marked as unsupported.
+- **`--run-unknown-op` is disabled by default** — operators whose support cannot be verified statically are conservatively marked as unsupported unless you explicitly pass `--run-unknown-op`. Enable it only when the required local libraries are present.
 - **The model path must point to an existing `.onnx` file** — symbolic HuggingFace model IDs are not accepted; export the model first with `winml export`.
 
 ## See also
diff --git a/docs/commands/build.md b/docs/commands/build.md
index 45fca553c..857effba8 100644
--- a/docs/commands/build.md
+++ b/docs/commands/build.md
@@ -20,19 +20,19 @@ $ winml build [options]
 
 | Flag | Short | Type | Default | Description |
 |---|---|---|---|---|
-| `--config` | `-c` | path | *(required)* | `WinMLBuildConfig` JSON file, generated by `winml config`. |
-| `--model` | `-m` | string | *(required)* | Hugging Face model ID or path to an existing `.onnx` file. |
+| `--config` | `-c` | path | `None` | `WinMLBuildConfig` JSON file, generated by `winml config`. If omitted, config is auto-generated from `-m`. |
+| `--model` | `-m` | string | `None` | Hugging Face model ID or path to an existing `.onnx` file. |
 | `--output-dir` | `-o` | path | `None` | Directory for all build artifacts. Mutually exclusive with `--use-cache`. |
 | `--use-cache` | | flag | `false` | Store artifacts in the winml-cli global cache (`~/.cache/winml/`). Mutually exclusive with `--output-dir`. |
-| `--random-init` | | flag | `false` | Skip weight download; build with random weights (useful for architecture testing). |
 | `--rebuild` | | flag | `false` | Overwrite existing artifacts and re-run the full pipeline. |
 | `--no-quant` | | flag | `false` | Skip the quantization stage, overriding the config. |
-| `--no-compile` | | flag | `false` | Skip the compilation stage, overriding the config. |
+| `--no-compile` / `--compile` | | flag | `None` | Override compilation. `--compile` forces enable (config must have a compile section). `--no-compile` forces skip. Default: inherit from config. |
 | `--no-optimize` | | flag | `false` | Skip the optimization stage (for pre-quantized ONNX input models). |
 | `--ep` | | string | `None` | Target execution provider for the analyzer (e.g., `qnn`). Falls back to the compile config EP if not set. |
-| `--device` | | string | `None` | Target device for the analyzer (e.g., `NPU`, `GPU`). Default: `NPU`. |
+| `--device` | | string | `auto` | Target device for the analyzer (e.g., `npu`, `gpu`). Default: `auto` (auto-detect). |
 | `--no-analyze` | | flag | `false` | Skip the analyzer loop during build. |
-| `--max-optim-iterations` | | integer | `3` | Maximum autoconf re-optimization rounds. `--no-analyze` implicitly sets this to 0. |
+| `--max-optim-iterations` | | integer | `None` | Maximum autoconf re-optimization rounds (3 enforced internally when not set). `--no-analyze` implicitly sets this to 0. |
+| `--trust-remote-code` | | flag | `false` | Allow executing custom code from model repositories. Use only with trusted sources. |
 | `--help` | `-h` | flag | | Show this message and exit. |
 
 ## How it works
@@ -99,8 +99,6 @@ winml build -c config.json -m microsoft/resnet-50 \
   array (module mode), only `--output-dir` is accepted.
 - **The config file must come from `winml config`.** The schema is strict;
   unknown keys are rejected.
-- **`--random-init` can produce silent failures for some architectures.**
-  Use a real model ID when accuracy matters.
 - **Existing artifacts are reused by default.** Pass `--rebuild` to force a
   fresh run after changing the config.
 
diff --git a/docs/commands/compile.md b/docs/commands/compile.md
index decc0e7d1..46126f1f4 100644
--- a/docs/commands/compile.md
+++ b/docs/commands/compile.md
@@ -20,9 +20,8 @@ $ winml compile [options]
 |---|---|---|---|---|
 | `--model` | `-m` | path | *(required unless `--list`)* | Input ONNX model file. |
 | `--output-dir` | | path | same dir as input | Directory to write compiled output artifacts. |
-| `--device` | `-d` | choice | `npu` | Target device: `auto`, `npu`, `gpu`, or `cpu`. |
+| `--device` | `-d` | choice | `auto` | Target device: `auto`, `npu`, `gpu`, or `cpu`. |
 | `--ep` | | choice | `None` | Force a specific execution provider, overriding device-to-provider mapping. Choices: `cpu`, `cuda`, `dml`, `migraphx`, `openvino`, `qnn`, `tensorrt`, `vitisai`. |
-| `--no-quant` | | flag | `false` | Flag retained for compatibility; quantization is no longer performed during compile. Use `winml quantize` beforehand. |
 | `--no-validate` | | flag | `false` | Skip validation of the compiled model after compilation. |
 | `--compiler` | | choice | `ort` | Compiler backend: `ort` (ONNX Runtime) or `qairt` (Qualcomm AI Runtime Tools). |
 | `--qnn-sdk-root` | | path | `None` | Path to the QAIRT/QNN SDK root directory. Required when `--compiler qairt` is set. |
@@ -34,18 +33,19 @@ $ winml compile [options]
 
 `winml compile` resolves the target execution provider from `--device` and
 `--ep`, then calls the winml-cli compiler API to hand the ONNX graph to the
-EP's offline compilation toolchain. For the default NPU target, ONNX Runtime's
-QNN EP generates a binary `.bin` context file (or embeds it inline with
-`--embed`) that encodes the hardware-optimized execution plan, eliminating
-graph partitioning at load time. When `--compiler qairt` is used, the
-Qualcomm AI Runtime Tools SDK is invoked directly (requires `--qnn-sdk-root`).
+EP's offline compilation toolchain. When `--device auto` (the default), the
+target EP is determined by auto-detecting available hardware. For NPU targets,
+ONNX Runtime's QNN EP generates a binary `.bin` context file (or embeds it
+inline with `--embed`) that encodes the hardware-optimized execution plan,
+eliminating graph partitioning at load time. When `--compiler qairt` is used,
+the Qualcomm AI Runtime Tools SDK is invoked directly (requires `--qnn-sdk-root`).
 An optional post-compilation validation pass runs a forward pass through the
 target EP; skip it with `--no-validate` when the target hardware is absent.
 
 ## Examples
 
 ```bash
-# Compile for NPU (default device and compiler)
+# Compile with auto device detection (default compiler)
 winml compile -m resnet50_qdq.onnx
 ```
 
@@ -88,9 +88,6 @@ winml compile -m facebook_convnext_qdq.onnx \
 
 ## Common pitfalls
 
-- **`--no-quant` is a no-op in the current release.** Quantization is no longer
-  performed during compile; run `winml quantize` on your model first, then pass
-  the QDQ model to this command.
 - **`--compiler qairt` requires `--qnn-sdk-root`.** Without a valid SDK path,
   compilation will fail immediately with a missing-executable error.
 - **`--embed` inflates the `.onnx` file significantly.** Embedding the EP
@@ -99,9 +96,9 @@ winml compile -m facebook_convnext_qdq.onnx \
 - **Validation requires the target hardware.** The post-compilation validation
   step runs an actual inference pass; on a machine without the NPU driver or the
   relevant EP installed, always pass `--no-validate`.
-- **`--device` default is `npu`, not `auto`.** Unlike other commands, compile
-  defaults to NPU targeting. Pass `--device cpu` or `--device gpu` explicitly
-  when targeting other hardware.
+- **`--device auto` auto-detects the best available hardware.** Pass `--device npu`,
+  `--device gpu`, or `--device cpu` explicitly when targeting specific hardware
+  regardless of what is auto-detected.
 
 ## See also
 
diff --git a/docs/commands/config.md b/docs/commands/config.md
index cf1cd1c2b..d3eccb41a 100644
--- a/docs/commands/config.md
+++ b/docs/commands/config.md
@@ -29,7 +29,7 @@ $ winml config [options]
 | `--output` | `-o` | `PATH` | *(stdout)* | Write the generated JSON to this file instead of printing to stdout. |
 | `--library` | | `TEXT` | `transformers` | Source library for `TasksManager` task lookup. Defaults to `transformers`; set to `diffusers` or another Optimum-supported library when needed. |
 | `--no-quant` | | flag | off | Omit quantization from the generated config (sets `quant` to `null`). Equivalent to removing the `quant` section before passing to `winml build`. |
-| `--no-compile` | | flag | off | Omit compilation from the generated config (sets `compile` to `null`). Use this when you want to inspect the optimized ONNX before EP-specific compilation. |
+| `--no-compile` / `--compile` | | flag | `--no-compile` (compile excluded by default) | Controls whether compilation is included in the generated config. By default compilation is **excluded** (`compile: null`). Pass `--compile` to include a compile section. |
 | `--trust-remote-code` | | flag | off | Allow execution of custom model code from the HuggingFace repository. Required for some community models. Only enable for repositories you trust. |
 
 ## How it works
@@ -75,10 +75,10 @@ Generate from a model type alone (no HuggingFace download required at config tim
 $ winml config --model-type bert --task fill-mask
 ```
 
-Generate a config from an already-exported ONNX file, skipping quantization and compilation:
+Generate a config from an already-exported ONNX file, skipping quantization (compilation is already excluded by default):
 
 ```bash
-$ winml config -m facebook/convnext-tiny-224.onnx --no-quant --no-compile -o convnext_optim_only.json
+$ winml config -m facebook/convnext-tiny-224.onnx --no-quant -o convnext_optim_only.json
 ```
 
 ## Common pitfalls
diff --git a/docs/commands/eval.md b/docs/commands/eval.md
index 6c70af952..b27da4b0c 100644
--- a/docs/commands/eval.md
+++ b/docs/commands/eval.md
@@ -21,8 +21,8 @@ $ winml eval [options]
 | `--dataset` | | `TEXT` | task default | HuggingFace dataset path (e.g., `imagenet-1k`, `glue`). If omitted, a default dataset is selected based on the task. |
 | `--dataset-name` | | `TEXT` | — | Dataset configuration name for multi-config datasets (e.g., `mrpc` within `glue`). |
 | `--task` | | `TEXT` | auto-detected | Task name (e.g., `image-classification`). Auto-detected from `--model-id` when not provided. |
-| `--device` | | `cpu\|gpu\|npu` | `cpu` | Device to run inference on during evaluation. |
-| `--samples` | `-n` (alias) | `INTEGER` | `100` | Number of dataset samples to evaluate. |
+| `--device` | | `auto\|cpu\|gpu\|npu` | `auto` | Device to run inference on during evaluation. `auto` selects the best available device. |
+| `--samples` | | `INTEGER` | `100` | Number of dataset samples to evaluate. |
 | `--split` | | `TEXT` | `validation` | Dataset split to use (e.g., `validation`, `test`, `train`). |
 | `--shuffle / --no-shuffle` | | flag | `shuffle` | Shuffle the dataset before sampling. Disable with `--no-shuffle` for reproducible sample ordering. |
 | `--streaming` | | flag | `false` | Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. |
@@ -33,7 +33,7 @@ $ winml eval [options]
 
 ## How it works
 
-`winml eval` loads the model via `WinMLAutoModel` (supporting both HuggingFace IDs and local ONNX files), then pulls the requested number of samples from a HuggingFace dataset. Each sample is preprocessed using the tokenizer or image processor associated with the model ID, passed through the ONNX Runtime session, and the output is compared against the ground-truth label. Aggregated metrics (accuracy, F1, etc.) are printed to the console and optionally written to a JSON file. When `-m` is an ONNX file, `--model-id` must be provided so the command knows which preprocessor and label vocabulary to use.
+`winml eval` loads the model and runs the evaluation pipeline via the internal `evaluate` function (supporting both HuggingFace IDs and local ONNX files), then pulls the requested number of samples from a HuggingFace dataset. Each sample is preprocessed using the tokenizer or image processor associated with the model ID, passed through the ONNX Runtime session, and the output is compared against the ground-truth label. Aggregated metrics (accuracy, F1, etc.) are printed to the console and optionally written to a JSON file. When `-m` is an ONNX file, `--model-id` must be provided so the command knows which preprocessor and label vocabulary to use.
 
 ## Examples
 
@@ -46,7 +46,7 @@ $ winml eval -m microsoft/resnet-50
 ```text
 Task:     image-classification
 Dataset:  imagenet-1k (validation, 100 samples)
-Device:   cpu
+Device:   auto
 
 Accuracy: 76.00%
 
diff --git a/docs/commands/hub.md b/docs/commands/hub.md
index efba66652..941cf85fd 100644
--- a/docs/commands/hub.md
+++ b/docs/commands/hub.md
@@ -1,10 +1,10 @@
-# winml hub
+# winml catalog
 
 > Browse the curated winml-cli catalog of validated models and benchmarks.
 
 ## When to use this
 
-Use `winml hub` to discover which HuggingFace models have been validated end-to-end
+Use `winml catalog` to discover which HuggingFace models have been validated end-to-end
 by the winml-cli team — exported, quantized, compiled, and benchmarked on real Windows
 ML devices. It is the starting point when you want a model that is known to work
 before investing time in a custom build.
@@ -12,7 +12,7 @@ before investing time in a custom build.
 ## Synopsis
 
 ```bash
-$ winml hub [options]
+$ winml catalog [options]
 ```
 
 ## Flags
@@ -21,29 +21,28 @@ $ winml hub [options]
 |------|-------|------|---------|-------------|
 | `--model-type` | `-t` | string | `null` | Filter the catalog by model architecture (case-insensitive). Examples: `bert`, `roberta`, `vit`. |
 | `--task` | `-k` | string | `null` | Filter by HuggingFace task (case-insensitive). Examples: `text-classification`, `image-segmentation`. |
-| `--model` | `-m` | string | `null` | Show detailed latency and accuracy benchmarks for a specific model ID. Accepts exact ID or an unambiguous substring. |
-| `--output` | `-o` | path | `null` | Save the displayed results to a JSON file. Works for both list and detail views. |
+| `--ep` | | string | `null` | Filter by execution provider (e.g., `qnn`, `dml`). If not specified, shows all EPs. |
+| `--device` | | string | `null` | Filter by target device (e.g., `npu`, `gpu`). If not specified, shows all devices. |
+| `--output` | `-o` | path | `null` | Save the displayed results to a JSON file. |
 | `--help` | `-h` | flag | — | Show help and exit. |
 
-> `winml hub` reads a local catalog bundled with the package — no network access is
-> required. It does not accept `--device`, `--ep`, or `--precision`.
+> `winml catalog` reads a local catalog bundled with the package — no network access is
+> required.
 
 ## How it works
 
 The catalog is stored in `winml/modelkit/data/hub_models.json` and is loaded
 directly from the installed package data without any network call. Each catalog
-entry records the model ID, task, architecture type, per-EP latency statistics
-(avg, P50, P90, P95, P99, min, max, QPS), and per-EP accuracy results compared
-against a floating-point FP32 baseline. The accuracy verdict uses three levels:
-`PASS` (drop within tolerance), `AT_RISK` (borderline), and `REGRESSION` (exceeds
-threshold). When `--output` is provided, the displayed data — whether a filtered
-list or a single model's detail — is written as indented JSON to the specified path.
+entry records the model ID, task, architecture type, and model size. Use
+`--model-type`, `--task`, `--ep`, or `--device` to narrow the displayed list.
+When `--output` is provided, the filtered results are written as indented JSON
+to the specified path.
 
 ## Examples
 
 ```bash
 # List all validated models in the catalog
-$ winml hub
+$ winml catalog
 ```
 
 ```text
@@ -54,32 +53,27 @@ $ winml hub
 │ ├ ProsusAI/finbert                 text-classification     bert           │
 │ └ ...                                                                     │
 ╰────────────────────────────────────────────────────────────────────────────╯
-Use  winml hub --model <id>  to see perf and accuracy details.
+Use  --ep  or  --device  to filter by execution provider or target device.
 ```
 
 ```bash
 # Filter to BERT-family models only
-$ winml hub --model-type bert
+$ winml catalog --model-type bert
 ```
 
 ```bash
 # Filter by task — show only text-classification models
-$ winml hub --task text-classification
+$ winml catalog --task text-classification
 ```
 
 ```bash
 # Combine filters — BERT models for text classification
-$ winml hub --model-type bert --task text-classification
-```
-
-```bash
-# Show latency and accuracy details for a specific model
-$ winml hub --model ProsusAI/finbert
+$ winml catalog --model-type bert --task text-classification
 ```
 
 ```bash
 # Save filtered results to JSON for offline review
-$ winml hub --task image-classification --output results/image_catalog.json
+$ winml catalog --task image-classification --output results/image_catalog.json
 ```
 
 ## Common pitfalls
@@ -87,20 +81,16 @@ $ winml hub --task image-classification --output results/image_catalog.json
 - **`--task` short flag is `-k`, not `-t`.** The `-t` short flag is taken by
   `--model-type`. Using `-t text-classification` will set the architecture filter,
   not the task filter. Use `-k` or the full `--task` flag.
-- **`--model` performs substring matching when no exact match exists.** If the
-  substring matches more than one catalog entry, the command raises an error and
-  lists the candidates. Use the full model ID to avoid ambiguity.
 - **The catalog reflects a point-in-time snapshot.** Models listed in the catalog
   were validated against a specific version of winml-cli, ONNX Runtime, and the
   relevant EP driver. Accuracy and latency may differ on your hardware or with
   updated drivers.
-- **`--output` only saves what was displayed.** Combining `--model` with `--output`
-  saves the single model's detail dict. Combining a filter with `--output` saves the
-  filtered list. There is no flag to dump the entire catalog in one call — omit all
-  filters and add `--output` to do so.
-- **A model not in the hub can still be used with winml-cli.** The catalog covers
+- **`--output` only saves what was displayed.** Combining a filter with `--output`
+  saves the filtered list. There is no flag to dump the entire catalog in one call —
+  omit all filters and add `--output` to do so.
+- **A model not in the catalog can still be used with winml-cli.** The catalog covers
   tested models; `winml inspect` and `winml export` work with any HuggingFace model
-  that has a supported architecture, whether or not it appears in the hub.
+  that has a supported architecture, whether or not it appears in the catalog.
 
 ## See also
 
@@ -109,5 +99,5 @@ $ winml hub --task image-classification --output results/image_catalog.json
 - [sys.md](sys.md) — verify your environment and EP availability before building
 - [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview from export
   to benchmark
-- [Quantization & QDQ](../concepts/quantization.md) — understand accuracy verdicts
-  and what `drop_pct` measures
+- [Quantization & QDQ](../concepts/quantization.md) — understand quantization concepts
+  and precision options
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
index c6944108a..2ba2c7edb 100644
--- a/docs/commands/inspect.md
+++ b/docs/commands/inspect.md
@@ -19,10 +19,14 @@ $ winml inspect -m <model_id> [options]
 
 | Flag | Short | Type | Default | Description |
 |------|-------|------|---------|-------------|
-| `--model` | `-m` | string | **required** | HuggingFace model ID (e.g. `openai/clip-vit-base-patch32`). Required unless `--help` is used. |
+| `--model` | `-m` | string | **required** | HuggingFace model ID (e.g. `openai/clip-vit-base-patch32`). Required unless `--list-tasks` or `--help` is used. |
 | `--format` | `-f` | `table` \| `json` | `table` | Output format. `table` renders rich panels; `json` emits a machine-readable object. |
 | `--task` | `-t` | string | `null` | Override the auto-detected task (e.g. `image-classification`, `feature-extraction`). |
 | `--hierarchy` | `-H` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
+| `--verbose` | `-v` | flag | `false` | Show full configuration details. |
+| `--list-tasks` | | flag | `false` | List all known tasks and exit. Does not require `--model`. |
+| `--model-type` | | string | `null` | Override model type (e.g. `bert`, `resnet`). Can be used without `--model`. |
+| `--model-class` | | string | `null` | Override model class (e.g. `BertForMaskedLM`). Can be used without `--model`. |
 | `--help` | `-h` | flag | — | Show help and exit. |
 
 > `winml inspect` does not accept `--device`, `--ep`, `--precision`, or `--output`.
@@ -78,8 +82,7 @@ $ winml inspect -m facebook/convnext-tiny-224 -v -H
 
 ## Common pitfalls
 
-- **`--model` is always required.** Unlike some other commands, `winml inspect` has
-  no mode that omits `-m`. The flag is marked required; omitting it returns an error.
+- **`--model` is required for model inspection.** The flag is marked required for model-specific lookups; omitting it returns an error. The only exception is `--list-tasks`, which lists all known tasks and exits without needing a model.
 - **Hierarchy requires a locally installable model config.** If the model config
   references a custom architecture not in the local `transformers` installation,
   `--hierarchy` will fail with an import error. Update `transformers` or omit the flag.
diff --git a/docs/commands/optimize.md b/docs/commands/optimize.md
index 78c7c81a5..877f18731 100644
--- a/docs/commands/optimize.md
+++ b/docs/commands/optimize.md
@@ -18,33 +18,23 @@ $ winml optimize [options]
 |------|-------|------|---------|-------------|
 | `--model` | `-m` | `PATH` | *(required unless listing)* | Input ONNX model file. Not required when `--list-capabilities` or `--list-rewrites` is used. |
 | `--output` | `-o` | `PATH` | `{input}_opt.onnx` | Output path for the optimized model. Defaults to the input filename with `_opt` inserted before the extension. |
-| `--preset` | `-p` | `qnn-compatible\|transformer-optimized\|full\|minimal` | *(none)* | Apply a named optimization preset as a starting configuration. CLI flags override preset values. |
-| `--config` | `-c` | `PATH` | *(none)* | YAML or JSON configuration file. Fields in the file override preset defaults; CLI flags override the file. |
+| `--config` | `-c` | `PATH` | *(none)* | YAML or JSON configuration file. Fields in the file override capability defaults; CLI flags override the file. |
+| `--verbose` | `-v` | flag | off | Enable verbose output. |
 | `--list-capabilities` | `-l` | flag | off | Print all registered optimization capabilities grouped by category and exit. Add `--verbose` for descriptions and ORT names. |
 | `--list-rewrites` | | flag | off | Print all available pattern-rewrite families with their source-to-target mappings and exit. |
 | *(dynamic)* | | flag | *(per capability)* | Each registered capability generates a `--enable-<name>` / `--disable-<name>` pair. Run `--list-capabilities` to see the full current list. Examples: `--enable-gelu-fusion`, `--disable-constant-folding`. Pattern-rewrite flags follow the form `--enable-<source-slug>-<target-slug>`; run `--list-rewrites` to discover all names. |
 
-### Built-in presets
-
-| Preset | Description |
-|--------|-------------|
-| `qnn-compatible` | Disables fusions that produce composite ops unsupported by QNN; sets graph optimization level to 1. |
-| `transformer-optimized` | Enables GELU, LayerNorm, BiasGELU, and Attention fusions — ideal for BERT-family models. |
-| `full` | All fusions in `transformer-optimized` plus MatMul+Add. |
-| `minimal` | Graph optimization level 1 only; no fusions applied. |
-
 ### Configuration precedence
 
 When multiple sources are provided, settings are resolved in this order (highest wins):
 
 1. Explicit CLI flags (`--enable-X` / `--disable-X`)
 2. Config file (`-c`)
-3. Preset (`-p`)
-4. Capability defaults
+3. Capability defaults
 
 ## How it works
 
-`winml optimize` loads the ONNX model, builds a final capability configuration from the resolved precedence chain, and runs all enabled passes through the `Optimizer`. Each capability maps to a named optimization or fusion pipe in the `winml.modelkit.optim` registry. The capability flags are auto-generated at startup from that registry — adding a new optimization to the registry automatically makes it available as a CLI flag without any change to this command's source. After optimization, the command prints the before-and-after node count and percentage reduction so you can quantify the effect.
+`winml optimize` loads the ONNX model, builds a final capability configuration by merging capability defaults, an optional config file, and any explicit CLI flags, then runs all enabled passes through the `Optimizer`. Each capability maps to a named optimization or fusion pipe in the `winml.modelkit.optim` registry. The capability flags are auto-generated at startup from that registry — adding a new optimization to the registry automatically makes it available as a CLI flag without any change to this command's source. After optimization, the command prints the before-and-after node count and percentage reduction so you can quantify the effect.
 
 ## Examples
 
@@ -66,28 +56,21 @@ Success! Model optimized: microsoft/resnet-50_opt.onnx
 Nodes: 312 -> 289 (7.4% reduction)
 ```
 
-Apply the transformer preset to a BERT model:
-
-```bash
-$ winml optimize -m bert-base-uncased.onnx --preset transformer-optimized -o bert_opt.onnx
-```
-
-Enable a specific fusion on top of the minimal preset:
+Enable specific fusions for a BERT model:
 
 ```bash
 $ winml optimize -m bert-base-uncased.onnx \
-    --preset minimal \
     --enable-layer-norm-fusion \
     --enable-attention-fusion \
     -o bert_layernorm_attn.onnx
 ```
 
-Use the QNN-compatible preset and save the result for downstream compilation:
+Use a config file to set capabilities and save the result for downstream compilation:
 
 ```bash
 $ winml optimize -m facebook/convnext-tiny-224.onnx \
-    --preset qnn-compatible \
-    -o convnext_qnn_opt.onnx
+    -c optimize_config.yaml \
+    -o convnext_opt.onnx
 ```
 
 List all available optimization capabilities:
@@ -105,7 +88,7 @@ $ winml optimize --list-rewrites
 ## Common pitfalls
 
 - **`--model` is required for actual optimization** — it can be omitted only when using `--list-capabilities` or `--list-rewrites`. Missing `--model` in any other case raises a usage error.
-- **Preset and CLI flags interact via precedence** — a `--disable-X` CLI flag always wins over a preset that enables the same capability, but omitting the flag entirely leaves the preset value in effect. To turn off a capability set by a preset, you must pass the explicit `--disable-X` flag.
+- **Config file and CLI flags interact via precedence** — a `--disable-X` CLI flag always wins over a config file value that enables the same capability, but omitting the flag leaves the config file value in effect. To turn off a capability set by a config file, pass the explicit `--disable-X` flag.
 - **Config file validation errors abort the run** — if the config file contains keys that fail capability validation or dependency checks, the command prints all errors and exits with code 1 without touching the model. Fix the config before retrying.
 - **The dynamic flag list changes between releases** — new capabilities are added as the optimizer registry grows. Always use `--list-capabilities` to confirm the current set of flags rather than relying on a cached list.
 - **Output path default may overwrite a sibling file** — if you run optimize twice on the same input without specifying `-o`, the second run silently overwrites `{input}_opt.onnx`. Specify an explicit output path in scripts.
diff --git a/docs/commands/overview.md b/docs/commands/overview.md
index ac05e70eb..3e9822b4a 100644
--- a/docs/commands/overview.md
+++ b/docs/commands/overview.md
@@ -5,7 +5,7 @@ journey from model discovery to a deployment-ready artifact. Every subcommand
 shares a consistent invocation style — `winml <command> [flags]` — and the
 same global flags are available on the root `winml` group.
 
-The commands group by user intent. **Discover** (`sys`, `inspect`, `hub`,
+The commands group by user intent. **Discover** (`sys`, `inspect`, `catalog`,
 `analyze`) helps you understand your hardware and model before writing any
 artifacts. **Configure** (`config`, `optimize`) produces a reusable build
 configuration and tunes the ONNX graph. **Build** (`export`, `quantize`,
@@ -13,7 +13,7 @@ configuration and tunes the ONNX graph. **Build** (`export`, `quantize`,
 **Measure** (`perf`, `eval`) benchmarks and validates the result.
 
 The typical workflow follows that order: run `winml sys` to confirm hardware
-and EPs, then `winml inspect` or `winml hub` to verify model support. Use
+and EPs, then `winml inspect` or `winml catalog` to verify model support. Use
 `winml config` to generate a build configuration, then `winml build` to execute
 the full pipeline — or chain `export` → `optimize` → `quantize` → `compile`
 individually for finer control. Close with `winml perf` and `winml eval` to
@@ -25,7 +25,7 @@ measure speed and accuracy.
 |---|---|---|
 | [`sys`](sys.md) | Discover | Inspect your machine — devices, EPs, SDKs, runtime versions at a glance. |
 | [`inspect`](inspect.md) | Discover | Inspect a model's tasks, classes, and hierarchy before committing to an export. |
-| [`hub`](hub.md) | Discover | Browse the curated winml-cli catalog of validated models and benchmarks. |
+| [`catalog`](hub.md) | Discover | Browse the curated winml-cli catalog of validated models and benchmarks. |
 | [`analyze`](analyze.md) | Discover | Verify an ONNX model is compatible with a target execution provider before deployment. |
 | [`config`](config.md) | Configure | Generate a reusable build configuration for a Hugging Face model or ONNX file. |
 | [`optimize`](optimize.md) | Configure | Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed. |
@@ -40,7 +40,7 @@ measure speed and accuracy.
 
 - **I want to see what hardware and EPs I have** → `winml sys`
 - **I want to know if my model is supported** → `winml inspect`
-- **I want to browse validated models with known benchmarks** → `winml hub`
+- **I want to browse validated models with known benchmarks** → `winml catalog`
 - **I want to verify EP operator compatibility before compiling** → `winml analyze`
 - **I want to convert a Hugging Face model to ONNX** → `winml export`
 - **I want to run the whole pipeline in one go** → `winml build`
@@ -52,7 +52,7 @@ measure speed and accuracy.
 `-v` / `--verbose`, `-q` / `--quiet`, `--debug`, `--version`, and `-h` /
 `--help` live on the root `winml` group only. Subcommands access them through
 `ctx.obj` and do not redefine them. See
-`src/winml/modelkit/commands/_options.py` for the canonical contract.
+`src/winml/modelkit/cli.py` for the canonical contract.
 
 ## Shared flags
 
diff --git a/docs/commands/perf.md b/docs/commands/perf.md
index 496e7a422..879f4ba76 100644
--- a/docs/commands/perf.md
+++ b/docs/commands/perf.md
@@ -23,7 +23,7 @@ $ winml perf [options]
 | `--device` | | `auto\|cpu\|gpu\|npu` | `auto` | Device to run the benchmark on. `auto` selects the highest-priority available device. |
 | `--precision` | | `TEXT` | `auto` | Precision mode applied during model build: `auto`, `fp32`, `fp16`, `int8`, `int16`, or compound forms such as `w8a16`. |
 | `--ep` | | `TEXT` | — | Force a specific execution provider (e.g., `qnn`, `dml`, `vitisai`, `openvino`, `cpu`). Overrides the device-to-provider mapping. |
-| `--output` | `-o` | `PATH` | `{model_slug}_perf.json` | Output JSON file path for the benchmark report. |
+| `--output` | `-o` | `PATH` | `~/.cache/winml/perf/<slug>/<timestamp>.json` | Output JSON file path for the benchmark report. |
 | `--batch-size` | | `INTEGER` | `1` | Batch size used when generating synthetic input tensors. |
 | `--shape-config` | | `PATH` | — | Path to a JSON file containing shape overrides (e.g., `{"height": 480, "width": 480}`). Ignored for pre-exported ONNX files and in `--module` mode. |
 | `--no-quantize` | | flag | `false` | Skip quantization during model build. Useful for measuring the fp32 baseline. |
@@ -31,8 +31,7 @@ $ winml perf [options]
 | `--ignore-cache` | | flag | `false` | Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies `--rebuild`. |
 | `--module` | | `TEXT` | — | PyTorch module class name for per-module benchmarking (e.g., `BertAttention`). Builds and times each matching instance separately. See [Load and export](../concepts/load-and-export.md). |
 | `--monitor` | | flag | `false` | Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report. |
-| `--op-tracing` | | `basic\|detail` | — | Enable operator-level profiling. Requires `onnxruntime-qnn`. |
-| `--compare-devices` | | `TEXT` | — | Not yet implemented. Run benchmarks separately and compare the JSON outputs instead. |
+| `--op-tracing` | | `basic\|detail` | — | *(Advanced, hidden=True)* Enable operator-level profiling. Requires `onnxruntime-qnn`. Hidden from `--help` by design; gated on QNN-only profiling support. |
 
 ## How it works
 
@@ -59,7 +58,7 @@ Latency (ms)
 
 Throughput: 467.29 samples/sec
 
-Results saved to: microsoft_resnet-50_perf.json
+Results saved to: ~/.cache/winml/perf/microsoft_resnet-50/2026-05-27T120000.json
 ```
 
 Benchmark a pre-exported ONNX file on CPU with more iterations:
@@ -92,7 +91,7 @@ $ winml perf -m bert-base-uncased --module BertAttention --iterations 200
 - **`--shape-config` is silently ignored in two cases.** It has no effect on pre-exported ONNX files (shapes are baked into the graph) and is ignored in `--module` mode. The command prints a warning in both situations.
 - **`--op-tracing` requires `onnxruntime-qnn`.** The flag activates the QNN profiler, which is only present in the `onnxruntime-qnn` package. If that package is not installed, the benchmark still runs but the op-trace step exits with an error.
 - **Random inputs do not represent real data distributions.** Latency numbers are accurate, but memory access patterns may differ from production because the generated tensors are uniform random values. For memory-bandwidth-sensitive models this can understate real-world latency.
-- **`--compare-devices` is not yet implemented.** Use separate `winml perf` invocations and compare the resulting JSON files manually.
+- **Cross-device comparison.** There is no `--compare-devices` flag. To compare performance across devices, run `winml perf` separately with different `--device` values and compare the resulting JSON files.
 
 ## See also
 
diff --git a/docs/commands/sys.md b/docs/commands/sys.md
index 71d23e81c..8d95e9797 100644
--- a/docs/commands/sys.md
+++ b/docs/commands/sys.md
@@ -21,6 +21,7 @@ $ winml sys [options]
 | `--format` | `-f` | `text` \| `json` \| `compact` | `text` | Output format. `text` renders rich tables, `json` emits machine-readable JSON, `compact` prints a single-line summary. |
 | `--list-device` | — | flag | `false` | List available compute devices (NPU, GPU, CPU) in priority order instead of showing the full system report. |
 | `--list-ep` | — | flag | `false` | List available ONNX Runtime execution providers instead of showing the full system report. Can be combined with `--list-device`. |
+| `--verbose` | `-v` | flag | `false` | Surface additional diagnostic sections: Backend SDKs and Export Readiness. |
 | `--help` | `-h` | flag | — | Show help and exit. |
 
 > `winml sys` takes no `--model`, `--device`, `--ep`, `--task`, or `--precision`
diff --git a/docs/concepts/compile-and-epcontext.md b/docs/concepts/compile-and-epcontext.md
index 3a2529277..a4227e2c3 100644
--- a/docs/concepts/compile-and-epcontext.md
+++ b/docs/concepts/compile-and-epcontext.md
@@ -26,7 +26,7 @@ The first time an ONNX Runtime session is created for a model on a hardware EP,
 
 A model produced by `winml compile` has already paid that cost. The EP context blob is the result of compilation, not its input. When the application loads the compiled model the EP reads the pre-built binary and the session is ready almost immediately. Shipping a compiled model is therefore the standard pattern for production deployments on QNN hardware.
 
-If you are iterating on quantization settings or ONNX graphs and want to check whether the model compiles at all, `winml compile` also accepts `--no-quant` to skip the quantization pass for already-quantized (QDQ) models.
+If you are iterating on quantization settings or ONNX graphs and want to check whether the model compiles at all, pass an already-quantized (QDQ) model directly — `winml compile` compiles whatever ONNX file you supply and does not have a separate quantization pass to skip.
 
 ## Skipping validation
 
diff --git a/docs/concepts/config-and-build.md b/docs/concepts/config-and-build.md
index 164a94207..8c9265d66 100644
--- a/docs/concepts/config-and-build.md
+++ b/docs/concepts/config-and-build.md
@@ -44,8 +44,8 @@ directly before being passed to `winml build`.
 ## What's in a config
 
 A `WinMLBuildConfig` is a dataclass defined in
-`src/winml/modelkit/config/build.py`. It holds five nested sub-configs, one per
-pipeline stage:
+`src/winml/modelkit/config/build.py`. It holds five nested sub-configs for the
+pipeline stages, plus an evaluation config and an auto flag:
 
 | Field | Type | Purpose |
 |---|---|---|
@@ -54,6 +54,8 @@ pipeline stage:
 | `optim` | `WinMLOptimizationConfig` | Graph fusion flags (GeLU, LayerNorm, MatMul+Add). |
 | `quant` | `WinMLQuantizationConfig` | Precision types (`weight_type`, `activation_type`), calibration samples and method (`null` to skip). |
 | `compile` | `WinMLCompileConfig` | Target EP provider, EPContext options, compiler backend (`null` to skip). |
+| `eval` | `WinMLEvaluationConfig \| null` | Evaluation settings run after the build (`null` to skip). |
+| `auto` | `bool` | When `true` (default), auto-fills missing fields from model introspection. |
 
 Setting `quant` or `compile` to `null` tells the pipeline to skip that stage
 entirely, equivalent to passing `--no-quant` or `--no-compile` on the command
@@ -82,10 +84,8 @@ A generated config looks similar to:
     "samples": 10
   },
   "compile": {
-    "ep_config": {
-      "provider": "qnn",
-      "enable_ep_context": true
-    }
+    "execution_provider": "qnn",
+    "enable_ep_context": true
   }
 }
 ```
diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index 3cb6d8110..511fb791f 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -8,21 +8,21 @@ Because `winml perf` accepts both HuggingFace model IDs and local `.onnx` files,
 
 At its core, `winml perf` runs a configurable number of inference iterations and reports latency statistics: p50, p90, and mean latency in milliseconds, plus throughput in inferences per second. Warmup iterations (controlled by `--warmup`, defaulting to 10) are excluded from the statistics so that JIT and cache effects do not skew the numbers.
 
-You can control the run length with `--iterations` and the input shape with `--batch-size` or a `--shape-config` JSON file for models with dynamic axes. The `--device` flag selects the target EP — `cpu`, `gpu`, or `npu` — allowing you to collect numbers on each target with the same command and compare them directly. For fine-grained EP control, `--ep` lets you name a specific provider such as `qnn` or `dml`.
+You can control the run length with `--iterations` and the input shape with `--batch-size` or a `--shape-config` JSON file for models with dynamic axes. The `--device` flag selects the target EP — `cpu`, `gpu`, `npu`, or `auto` (default) — allowing you to collect numbers on each target with the same command and compare them directly. For fine-grained EP control, `--ep` lets you name a specific provider such as `qnn` or `dml`.
 
-The results are written to a JSON file (defaulting to `{model_slug}_perf.json`) so they can be archived and compared across builds.
+The results are written to a JSON file at `~/.cache/winml/perf/<slug>/<timestamp>.json` so they can be archived and compared across builds.
 
 ## Live monitoring
 
 Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
 
-The `--monitor` flag adds a live terminal chart that streams NPU utilisation while the benchmark runs. The chart updates in place during the iteration loop so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, the model is not executing on the NPU as expected.
+The `--monitor` flag adds a live terminal chart that streams hardware utilisation for whichever device is being benchmarked. The chart updates in place during the iteration loop so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN or DirectML hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, the model is not executing on the expected device.
 
 `--monitor` has no effect on the measured latency statistics — it is a passive observer.
 
 ## Per-operator tracing
 
-When end-to-end latency is higher than expected, `--op-tracing` lets you find the operators that are responsible. Two levels are available:
+When end-to-end latency is higher than expected, per-operator tracing lets you find the operators that are responsible. This capability is available via a hidden `--op-tracing` flag (not shown in `--help`) intended for advanced diagnostics. Two levels are available:
 
 `--op-tracing basic` collects cumulative time per operator type and reports a ranked list. This is usually enough to identify whether, say, a sequence of Attention nodes or a large MatMul is dominating the runtime.
 
diff --git a/docs/concepts/quantization.md b/docs/concepts/quantization.md
index de92e7702..4b5dc88de 100644
--- a/docs/concepts/quantization.md
+++ b/docs/concepts/quantization.md
@@ -6,18 +6,18 @@ Quantization is the headline use of datatypes in winml-cli. By replacing `float3
 
 ## Datatypes
 
-winml-cli exposes a precision shorthand on the `--precision` flag that encodes the weight/activation dtype pair as a single string. The table below lists every precision from `_KNOWN_PRECISIONS` in `_options.py`, together with the resolved quantization types from `config/precision.py`. Float precisions (`fp32`, `fp16`) carry no quantization types because weights and activations remain in floating point throughout.
+winml-cli exposes a precision shorthand on the `--precision` flag that encodes the weight/activation dtype pair as a single string. The table below lists every precision from `_NAMED_PRECISIONS` in `config/precision.py`, together with the resolved quantization types. Float precisions (`fp32`, `fp16`) carry no quantization types because weights and activations remain in floating point throughout.
 
 | Precision | Weight dtype | Activation dtype | Notes |
 |-----------|-------------|-----------------|-------|
-| `auto` | device-dependent | device-dependent | Resolves to `int8` (NPU), `fp16` (GPU/CPU) at runtime |
+| `auto` | device-dependent | device-dependent | Resolves to `w8a16` (NPU), `fp16` (GPU/CPU) at runtime |
 | `fp32` | float32 | float32 | No quantization; baseline accuracy |
 | `fp16` | float16 | float16 | Half-precision float; no QDQ nodes inserted |
-| `int8` | uint8 | uint8 | Static quantization; default for NPU via QNN EP |
+| `int8` | uint8 | uint8 | Static quantization; valid for QNN EP |
 | `int16` | int16 | uint16 | Higher-accuracy quantization; larger model than int8 |
 | `w8a8` | uint8 | uint8 | Equivalent to `int8`; explicit mixed-precision notation |
 | `w8a16` | uint8 | uint16 | Mixed: compact weights, wider activations for accuracy |
-| `w4a16` | n/a | n/a | **Planned — not yet supported.** Recognized as a precision string but raises an error at quantization time; no 4-bit weight dtype mapping exists in `precision.py` yet. |
+| `w4a16` | n/a | n/a | **Not supported.** Rejected at validation — `is_quantized_precision("w4a16")` returns `False` because 4-bit weight types are absent from `_BITS_TO_WEIGHT_TYPE` in `precision.py`. The string is not a recognized precision. |
 
 The `--weight-type` and `--activation-type` flags on `winml quantize` accept `uint8`, `int8`, `uint16`, or `int16` and override whatever the `--precision` shorthand would have resolved. This is useful when you need an unsigned weight type for QNN compatibility but a signed activation type for a specific operator constraint. See [Weight and Activation](weight-and-activation.md) for why the two need separate flags in the first place.
 
diff --git a/docs/getting-started/end-to-end.md b/docs/getting-started/end-to-end.md
index 9c50f1f26..44052030c 100644
--- a/docs/getting-started/end-to-end.md
+++ b/docs/getting-started/end-to-end.md
@@ -51,7 +51,7 @@ then GPU, then CPU.
                  Cores: 12 | Threads: 12 | Architecture: ARM64
 
     Available Execution Providers
-      QNNExecutionProvider              -> NPU
+      QNNExecutionProvider              -> NPU/GPU
       DmlExecutionProvider              -> GPU
       CPUExecutionProvider              -> CPU
     ```
@@ -119,10 +119,11 @@ uv run winml perf -m convnext_out/<artifact>.onnx --device auto --iterations 50
 ```
 
 Replace `<artifact>` with the filename written to `convnext_out/` by the build.
-The name reflects the device the build targeted — for example,
-`convnext_tiny_qnn_ctx.onnx` on NPU, `convnext_tiny_dml_ctx.onnx` on
-DirectML, or `convnext_tiny.onnx` on CPU. You can check the directory listing
-or read the compiled artifact path from the build output to get the exact name.
+For NPU builds the compiled artifact is named `model.onnx` in the output
+directory (the `_npu_ctx.onnx` suffix applies only when the compile stage
+produces an EPContext file, which requires `enable_ep_context=True` in the
+compile config). You can check the directory listing or read the compiled
+artifact path from the build output to get the exact name.
 
 === "NPU (QNN)"
 
@@ -139,7 +140,7 @@ or read the compiled artifact path from the build output to get the exact name.
 
     Throughput: 258.14 samples/sec
 
-    Results saved to: convnext_tiny_qnn_ctx_perf.json
+    Results saved to: model_perf.json
     ```
 
 === "GPU (DirectML)"
@@ -156,8 +157,6 @@ or read the compiled artifact path from the build output to get the exact name.
     12.43  12.18  13.74  14.11  15.02  11.27  16.55   0.89
 
     Throughput: 80.45 samples/sec
-
-    Results saved to: convnext_tiny_dml_ctx_perf.json
     ```
 
 === "CPU"
@@ -174,8 +173,6 @@ or read the compiled artifact path from the build output to get the exact name.
     48.31  47.85  52.14  53.77  57.40  44.62  61.23   2.94
 
     Throughput: 20.70 samples/sec
-
-    Results saved to: convnext_tiny_perf.json
     ```
 
 The `--monitor` flag opens a live chart of device utilization while the
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index 07dd8aa9c..d045a8c7c 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -1,6 +1,6 @@
 # Installation
 
-winml-cli is a Python toolkit for converting and optimizing PyTorch models to ONNX format, targeting deployment on the [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/) runtime. It supports multiple hardware backends including QNN (Qualcomm NPU), OpenVINO (Intel CPU/GPU), DirectML, and ONNX Runtime. To get started you need a Windows machine, Python 3.10, and the `uv` package manager.
+winml-cli is a Python toolkit for converting and optimizing PyTorch models to ONNX format, targeting deployment on the [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/) runtime. It supports multiple hardware backends including QNN (Qualcomm NPU), OpenVINO (Intel CPU/GPU), DirectML, and ONNX Runtime. To get started you need a Windows machine, Python 3.11, and the `uv` package manager.
 
 ## Prerequisites
 
@@ -8,23 +8,23 @@ winml-cli is a Python toolkit for converting and optimizing PyTorch models to ON
 |---|---|
 | Windows | Windows 11 24H2 or later (required for NPU support) |
 | Hardware | Copilot+PC with NPU (40+ TOPS recommended for NPU acceleration; CPU/DirectML works without an NPU) |
-| Python | 3.10 (the project pins `requires-python = ">=3.10,<3.11"`) |
+| Python | 3.11 (the project pins `requires-python = ">=3.11,<3.12"`) |
 | Package manager | [`uv`](https://github.com/astral-sh/uv) |
 | Version control | `git` |
 
 !!! note "No NPU?"
-    You can follow most of these docs without NPU hardware. Most winml-cli commands (`build`, `compile`, `perf`, `analyze`) accept `--device auto` and fall back to CPU or DirectML automatically. `winml eval` accepts only `cpu|gpu|npu` (no `auto`), so pass `--device cpu` explicitly there. The end-to-end tutorial documents an explicit CPU fallback path.
+    You can follow most of these docs without NPU hardware. All winml-cli commands accept `--device auto` and fall back to CPU or DirectML automatically. The end-to-end tutorial documents an explicit CPU fallback path.
 
 ## Install
 
 ```bash
 git clone https://github.com/microsoft/winml-cli.git
 cd winml-cli
-uv python install 3.10
+uv python install 3.11
 uv sync
 ```
 
-Cloning the repository pulls down all source code and configuration. `uv python install 3.10` downloads and pins the exact Python version the project requires. `uv sync` creates an isolated virtual environment and installs all declared dependencies from `pyproject.toml` in a single step. No separate `pip install` or manual venv activation is needed.
+Cloning the repository pulls down all source code and configuration. `uv python install 3.11` downloads and pins the exact Python version the project requires. `uv sync` creates an isolated virtual environment and installs all declared dependencies from `pyproject.toml` in a single step. No separate `pip install` or manual venv activation is needed.
 
 ## Verify
 
@@ -40,7 +40,7 @@ Expected output (abbreviated):
 ╰──────────────────────────────────╯
 
 Environment
-  Python Version    3.10.x
+  Python Version    3.11.x
   OS                Windows 11
   Machine           AMD64
 
@@ -67,7 +67,7 @@ This command enumerates available compute devices and execution providers on you
 Two optional dependency groups are available for hardware-specific backends:
 
 - `--extra openvino` — installs [OpenVINO](https://docs.openvino.ai/) for inference on Intel CPU and GPU targets.
-- `--extra qnn` — installs `onnxruntime-qnn` for Qualcomm NPU support. Note: the `onnxruntime-qnn` package requires Python 3.11 or later, so this extra will not install any packages under the project's default Python 3.10 environment. It is reserved for future use when the project broadens its Python version support.
+- `--extra qnn` — installs `onnxruntime-qnn` for Qualcomm NPU support.
 
 To install an extra:
 
diff --git a/docs/samples/bert-config-build.md b/docs/samples/bert-config-build.md
index 62e57d7b8..3b391e323 100644
--- a/docs/samples/bert-config-build.md
+++ b/docs/samples/bert-config-build.md
@@ -75,7 +75,7 @@ winml build
   compile      done  (21.4s)
 
   Build complete in 88.5s
-  Final artifact: bert_out/bert-base-uncased_ctx.onnx
+  Final artifact: bert_out/model.onnx
 ```
 
 !!! note
@@ -84,7 +84,7 @@ winml build
 ## Step 3: Benchmark
 
 ```bash
-winml perf -m bert_out/bert-base-uncased_ctx.onnx --iterations 50
+winml perf -m bert_out/model.onnx --iterations 50
 ```
 
 After a short warm-up, `winml perf` reports latency percentiles and throughput:
@@ -101,7 +101,7 @@ Latency (ms)
 
 Throughput: 206.99 samples/sec
 
-Results saved to: bert-base-uncased_ctx_perf.json
+Results saved to: model_perf.json
 ```
 
 ## Customizing the config
diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
index 5e0576703..70f4deb48 100644
--- a/docs/samples/convnext-primitives.md
+++ b/docs/samples/convnext-primitives.md
@@ -107,14 +107,14 @@ Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the run
 !!! note "NPU requires the QNN SDK"
     Compilation for `--device npu` invokes the Qualcomm QNN offline compiler, which must be installed separately. Pass `--qnn-sdk-root` pointing at the root of your QAIRT SDK installation, or set the `QNN_SDK_ROOT` environment variable to the same path. If the SDK is absent, compile for CPU or GPU instead. For a full explanation of how EPs relate to device targets see [ONNX & Execution Providers](../concepts/eps-and-devices.md).
 
-Each invocation writes a compiled ONNX file to the output directory: `convnext_int8_cpu_ctx.onnx` for CPU, `convnext_int8_dml_ctx.onnx` for GPU (DirectML), and `convnext_int8_qnn_ctx.onnx` for NPU (QNN). The GPU and NPU variants contain an EPContext node that embeds the pre-compiled binary.
+Only the NPU invocation writes a new compiled artifact — `convnext_int8_npu_ctx.onnx` — which contains an EPContext node embedding the pre-compiled Hexagon binary. CPU and GPU compile with `enable_ep_context=False` by default: the compile step validates the model against the target EP but does not produce a new file. For CPU and GPU perf benchmarks (Step 6), use the quantized `convnext_int8.onnx` directly.
 
 ## Step 6: Benchmark
 
 Measure latency and throughput on each device. Pass the compiled ONNX directly so the benchmark uses the pre-compiled artifact.
 
 ```bash
-winml perf -m convnext_int8_cpu_ctx.onnx --device cpu --iterations 200
+winml perf -m convnext_int8.onnx --device cpu --iterations 200
 ```
 
 ```text
@@ -132,8 +132,8 @@ Throughput: 118.91 samples/sec
 ```
 
 ```bash
-winml perf -m convnext_int8_dml_ctx.onnx --device gpu --iterations 200
-winml perf -m convnext_int8_qnn_ctx.onnx --device npu --iterations 200
+winml perf -m convnext_int8.onnx --device gpu --iterations 200
+winml perf -m convnext_int8_npu_ctx.onnx --device npu --iterations 200
 ```
 
 The NPU variant typically delivers the lowest latency and highest power efficiency on Qualcomm Snapdragon hardware. Use the JSON output written by `--output` to compare runs programmatically.
@@ -157,7 +157,7 @@ Accuracy: 81.00%
 Results saved to: convnext_int8_eval.json
 ```
 
-Note that `--device` accepts only `cpu`, `gpu`, or `npu` — it does not accept `auto`. To compare quantized accuracy against the floating-point baseline, run the same command with `convnext.onnx` and compare the two JSON outputs.
+To compare quantized accuracy against the floating-point baseline, run the same command with `convnext.onnx` and compare the two JSON outputs.
 
 ## What you learned
 
@@ -165,7 +165,7 @@ Note that `--device` accepts only `cpu`, `gpu`, or `npu` — it does not accept
 - `winml config` captures the full pipeline configuration as a reproducible JSON file.
 - `winml export` converts the PyTorch model to a portable ONNX graph and embeds hierarchy metadata for downstream analysis.
 - `winml quantize` inserts QDQ nodes using calibration data; `--precision int8` and `--samples` control the precision and calibration budget.
-- `winml compile` pre-bakes an EP-specific binary cache per device; the same quantized ONNX feeds all three targets.
+- `winml compile` pre-bakes an EP-specific binary cache for NPU (producing `convnext_int8_npu_ctx.onnx`); CPU and GPU compile steps validate EP compatibility but produce no new artifact — use the quantized `convnext_int8.onnx` for those devices.
 - `winml perf` and `winml eval` consume the final artifact without modifying it — benchmark first, then validate accuracy before shipping.
 
 ## See also
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index 4359e239e..8eda5a7d3 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -19,7 +19,7 @@ The tutorial is split into two sections. Section A runs through eight primitive
 
 - **Windows 11 24H2** — required for NPU stack support
 - **Copilot+PC with NPU** — 40+ TOPS recommended; CPU and DirectML work as fallback throughout
-- **Python 3.10** and **uv** installed (`pip install uv` or follow [astral.sh/uv](https://astral.sh/uv))
+- **Python 3.11** and **uv** installed (`pip install uv` or follow [astral.sh/uv](https://astral.sh/uv))
 - **winml-cli** installed — see [Installation](../getting-started/installation.md)
 - **For QNN (Snapdragon NPU):** QAIRT SDK installed and `QNN_SDK_ROOT` set to its root directory
 - **For OpenVINO (Intel CPU/GPU/NPU):** OpenVINO runtime installed and registered as an ONNX Runtime EP
@@ -155,13 +155,7 @@ Compilation converts the portable quantized ONNX into an EP-specific binary form
     uv run winml compile -m convnext_int8.onnx --device npu --ep openvino
     ```
 
-=== "CPU fallback"
-
-    ```bash
-    uv run winml compile -m convnext_int8.onnx --device cpu
-    ```
-
-The compiled output file appears in the same directory as the input model. For QNN, the file name follows the pattern `convnext_int8_qnn_ctx.onnx` and an accompanying `.bin` context binary is written alongside it. For OpenVINO, the compiled artifact is named `convnext_int8_openvino_ctx.onnx`. For CPU, the output is `convnext_int8_cpu_ctx.onnx`.
+The compiled output file appears in the same directory as the input model. For QNN, the file name follows the pattern `convnext_int8_npu_ctx.onnx` (using the resolved device string `npu`, not the EP name) and an accompanying `.bin` context binary is written alongside it. For OpenVINO targeting the NPU, the compiled artifact is also named `convnext_int8_npu_ctx.onnx`. CPU builds do not produce a new artifact — the compile step validates EP compatibility but writes no output file; use `convnext_int8.onnx` directly for CPU inference.
 
 !!! note "What we just did"
     Compilation embeds EP context — the compiled binary — inside or alongside the ONNX file using the `EPContext` node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15–60 second JIT penalty on first load. winml-cli locates the QAIRT SDK libraries needed for QNN compilation through `QNN_SDK_ROOT` (set as an environment variable, or passed with `--qnn-sdk-root` on `winml compile`). `winml build` reads only the env var. See [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) for the full picture of what gets embedded and how the context is consumed at runtime.
@@ -175,19 +169,19 @@ Measure inference latency and throughput with the `--monitor` flag to see live N
 === "QNN NPU"
 
     ```bash
-    uv run winml perf -m convnext_int8_qnn_ctx.onnx --device npu --iterations 50 --monitor
+    uv run winml perf -m convnext_int8_npu_ctx.onnx --device npu --iterations 50 --monitor
     ```
 
 === "OpenVINO NPU"
 
     ```bash
-    uv run winml perf -m convnext_int8_openvino_ctx.onnx --device npu --ep openvino --iterations 50 --monitor
+    uv run winml perf -m convnext_int8_npu_ctx.onnx --device npu --ep openvino --iterations 50 --monitor
     ```
 
 === "CPU"
 
     ```bash
-    uv run winml perf -m convnext_int8_cpu_ctx.onnx --device cpu --iterations 50
+    uv run winml perf -m convnext_int8.onnx --device cpu --iterations 50
     ```
 
 A representative run on a Snapdragon X Elite NPU produces output like the following:
diff --git a/mkdocs.yml b/mkdocs.yml
index 651a80d8d..e0f7b1d79 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -96,7 +96,7 @@ nav:
       - Discover:
           - sys: commands/sys.md
           - inspect: commands/inspect.md
-          - hub: commands/hub.md
+          - catalog: commands/hub.md
           - analyze: commands/analyze.md
       - Configure:
           - config: commands/config.md

From ed28a408994e8b3754bc39683ca6a4afdc46df59 Mon Sep 17 00:00:00 2001
From: Zac <1221537+tezheng@users.noreply.github.com>
Date: Wed, 27 May 2026 11:27:34 +0800
Subject: [PATCH 005/143] docs(install): drop redundant tagline paragraph (and
 its stale EP enumeration)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The opening paragraph re-stated the project tagline (already on the
home page one click above) and enumerated 4 EPs (QNN, OpenVINO, DML,
ONNX Runtime) — which goes stale; the canonical list in
concepts/eps-and-devices.md has 7. Removing the paragraph; the page
now starts with the Prereqs table. Matches the convention used by
quickstart.md and end-to-end.md (neither re-states the tagline).
---
 docs/getting-started/installation.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index d045a8c7c..364c4791e 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -1,7 +1,5 @@
 # Installation
 
-winml-cli is a Python toolkit for converting and optimizing PyTorch models to ONNX format, targeting deployment on the [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/) runtime. It supports multiple hardware backends including QNN (Qualcomm NPU), OpenVINO (Intel CPU/GPU), DirectML, and ONNX Runtime. To get started you need a Windows machine, Python 3.11, and the `uv` package manager.
-
 ## Prerequisites
 
 | Component | Details |

From 612d692e5233a2270219f87e8eb6adc23e53a040 Mon Sep 17 00:00:00 2001
From: "Qiong Wu (qiowu)" <qiowu@microsoft.com>
Date: Wed, 27 May 2026 18:28:48 +0800
Subject: [PATCH 006/143] =?UTF-8?q?docs:=20expand=20analyze/optimize=20con?=
 =?UTF-8?q?tent,=20rename=20hub=E2=86=92catalog=20(#769)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Rewrote `docs/concepts/analyze-and-optimize.md` with source-verified
content: SupportLevel classification table, lint vs autoconf outputs,
analysis modes, optimizer pipe architecture (4 pipes, 43 capabilities, 5
rewrite groups / 12 rules), and autoconf loop SVG diagram
- Updated `docs/commands/analyze.md` with corrected EP aliases,
exit-code table, and additional CLI examples
- Renamed `hub.md` → `catalog.md` and updated all cross-references
(inspect, overview, sys, mkdocs.yml)
- Fixed `check-yaml` pre-commit hook to support `!!python/name` tags in
mkdocs.yml (`--unsafe`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
---
 .pre-commit-config.yaml               |   1 +
 docs/assets/optimize-analyze-loop.svg |  95 ++++++++++++++++
 docs/commands/analyze.md              |  69 ++++++++----
 docs/commands/{hub.md => catalog.md}  |   0
 docs/commands/inspect.md              |   2 +-
 docs/commands/overview.md             |   2 +-
 docs/commands/sys.md                  |   2 +-
 docs/concepts/analyze-and-optimize.md | 155 ++++++++++++++++++++++++--
 mkdocs.yml                            |   2 +-
 9 files changed, 293 insertions(+), 35 deletions(-)
 create mode 100644 docs/assets/optimize-analyze-loop.svg
 rename docs/commands/{hub.md => catalog.md} (100%)

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index d189b0585..ade35422b 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -6,6 +6,7 @@ repos:
       - id: trailing-whitespace
         args: [--markdown-linebreak-ext=md]
       - id: check-yaml
+        args: [--unsafe]
 
   - repo: https://github.com/Lucas-C/pre-commit-hooks
     rev: v1.5.5
diff --git a/docs/assets/optimize-analyze-loop.svg b/docs/assets/optimize-analyze-loop.svg
new file mode 100644
index 000000000..85298f599
--- /dev/null
+++ b/docs/assets/optimize-analyze-loop.svg
@@ -0,0 +1,95 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 630 306" width="630" height="306" font-family="'Segoe UI', system-ui, -apple-system, sans-serif">
+  <defs>
+    <marker id="arr" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M 0 1 L 9 5 L 0 9 z" fill="#757575"/>
+    </marker>
+    <marker id="arr-g" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
+      <path d="M 0 1 L 9 5 L 0 9 z" fill="#4caf50"/>
+    </marker>
+  </defs>
+
+  <!-- Background -->
+  <rect width="630" height="306" rx="6" fill="#fafafa"/>
+
+  <!-- ===== config.optim pill ===== -->
+  <rect x="72" y="16" width="96" height="28" rx="14" fill="#ffe0b2" stroke="#ffb74d" stroke-width="1"/>
+  <text x="120" y="35" text-anchor="middle" font-size="10.5" font-weight="500" fill="#e65100">config.optim</text>
+
+  <!-- Arrow: config → Optimize -->
+  <line x1="120" y1="44" x2="120" y2="87" stroke="#757575" stroke-width="1" marker-end="url(#arr)"/>
+
+  <!-- ===== Autoconf Loop boundary ===== -->
+  <rect x="30" y="62" width="418" height="148" rx="10" fill="none" stroke="#4caf50" stroke-width="1.5" stroke-dasharray="7,4"/>
+  <text x="239" y="80" text-anchor="middle" font-size="10" fill="#4caf50" font-weight="600" font-style="italic">Autoconf loop</text>
+
+  <!-- ===== Optimize ===== -->
+  <rect x="68" y="100" width="104" height="44" rx="6" fill="#c8e6c9" stroke="#66bb6a" stroke-width="1.2"/>
+  <text x="120" y="127" text-anchor="middle" font-size="13" font-weight="600" fill="#1a2332">Optimize</text>
+
+  <!-- Arrow: Optimize → Analyze -->
+  <line x1="172" y1="122" x2="206" y2="122" stroke="#757575" stroke-width="1" marker-end="url(#arr)"/>
+
+  <!-- ===== Analyze ===== -->
+  <rect x="210" y="100" width="90" height="44" rx="5" fill="#bbdefb" stroke="#64b5f6" stroke-width="1.2"/>
+  <text x="255" y="127" text-anchor="middle" font-size="13" font-weight="600" fill="#1a2332">Analyze</text>
+
+  <!-- Arrow: Analyze → diamond -->
+  <line x1="300" y1="122" x2="328" y2="122" stroke="#757575" stroke-width="1" marker-end="url(#arr)"/>
+
+  <!-- ===== Diamond: new flags? ===== -->
+  <polygon points="328,122 358,102 388,122 358,142" fill="#fff9c4" stroke="#fdd835" stroke-width="1.2"/>
+  <text x="358" y="120" text-anchor="middle" font-size="9" font-weight="600" fill="#1a2332">new</text>
+  <text x="358" y="131" text-anchor="middle" font-size="9" font-weight="600" fill="#1a2332">flags?</text>
+
+  <!-- ===== YES path: loop back to Optimize ===== -->
+  <path d="M 358 142 L 358 192 L 120 192 L 120 144"
+        fill="none" stroke="#4caf50" stroke-width="1.2" stroke-dasharray="6,3" marker-end="url(#arr-g)"/>
+  <text x="366" y="170" font-size="9" fill="#4caf50" font-weight="600">yes</text>
+
+  <!-- ===== NO path: exit loop to the right → Final Analyze ===== -->
+  <line x1="388" y1="122" x2="484" y2="122" stroke="#757575" stroke-width="1" marker-end="url(#arr)"/>
+  <text x="410" y="114" font-size="9" fill="#78909c" font-weight="600">no</text>
+
+  <!-- ===== Final Analyze ===== -->
+  <rect x="488" y="100" width="76" height="44" rx="5" fill="#bbdefb" stroke="#64b5f6" stroke-width="1.2"/>
+  <text x="526" y="119" text-anchor="middle" font-size="11" font-weight="600" fill="#1a2332">Final</text>
+  <text x="526" y="134" text-anchor="middle" font-size="11" font-weight="600" fill="#1a2332">Analyze</text>
+
+  <!-- Arrow: Final Analyze → has_errors? -->
+  <line x1="526" y1="144" x2="526" y2="172" stroke="#757575" stroke-width="1" marker-end="url(#arr)"/>
+
+  <!-- ===== Diamond: has_errors? ===== -->
+  <polygon points="494,196 526,176 558,196 526,216" fill="#fff9c4" stroke="#fdd835" stroke-width="1.2"/>
+  <text x="526" y="194" text-anchor="middle" font-size="9" font-weight="600" fill="#1a2332">has</text>
+  <text x="526" y="205" text-anchor="middle" font-size="9" font-weight="600" fill="#1a2332">errors?</text>
+
+  <!-- YES → RuntimeError -->
+  <line x1="526" y1="216" x2="526" y2="240" stroke="#757575" stroke-width="1" marker-end="url(#arr)"/>
+  <text x="536" y="232" font-size="9" fill="#78909c" font-weight="600">yes</text>
+
+  <rect x="486" y="244" width="80" height="30" rx="5" fill="#ffcdd2" stroke="#ef5350" stroke-width="1.2"/>
+  <text x="526" y="264" text-anchor="middle" font-size="11" font-weight="600" fill="#c62828">RuntimeError</text>
+
+  <!-- NO → done -->
+  <line x1="558" y1="196" x2="586" y2="196" stroke="#757575" stroke-width="1" marker-end="url(#arr)"/>
+  <text x="572" y="189" text-anchor="middle" font-size="9" fill="#78909c" font-weight="600">no</text>
+
+  <rect x="590" y="183" width="16" height="26" rx="3" fill="#c8e6c9" stroke="#66bb6a" stroke-width="1"/>
+  <text x="598" y="200" text-anchor="middle" font-size="7" fill="#2e7d32">▸</text>
+
+  <!-- ===== Legend ===== -->
+  <rect x="20" y="286" width="12" height="12" rx="3" fill="#c8e6c9" stroke="#66bb6a" stroke-width="0.8"/>
+  <text x="38" y="296" font-size="8.5" fill="#616161">Transform</text>
+
+  <rect x="106" y="286" width="12" height="12" rx="3" fill="#bbdefb" stroke="#64b5f6" stroke-width="0.8"/>
+  <text x="124" y="296" font-size="8.5" fill="#616161">Analysis</text>
+
+  <polygon points="192,292 202,286 212,292 202,298" fill="#fff9c4" stroke="#fdd835" stroke-width="0.8"/>
+  <text x="218" y="296" font-size="8.5" fill="#616161">Decision</text>
+
+  <rect x="270" y="286" width="12" height="12" rx="3" fill="#ffcdd2" stroke="#ef5350" stroke-width="0.8"/>
+  <text x="288" y="296" font-size="8.5" fill="#616161">Error</text>
+
+  <line x1="322" y1="292" x2="348" y2="292" stroke="#4caf50" stroke-width="1.2" stroke-dasharray="5,3"/>
+  <text x="354" y="296" font-size="8.5" fill="#616161">Autoconf loop</text>
+</svg>
diff --git a/docs/commands/analyze.md b/docs/commands/analyze.md
index 6466435dc..1f6298f67 100644
--- a/docs/commands/analyze.md
+++ b/docs/commands/analyze.md
@@ -17,20 +17,33 @@ $ winml analyze [options]
 | Flag | Short | Type | Default | Description |
 |------|-------|------|---------|-------------|
 | `--model` | `-m` | `PATH` | *(required)* | Path to the ONNX model file to analyze. |
-| `--ep` | | choice | `auto` | Target execution provider. Accepts full names (`QNNExecutionProvider`, `OpenVINOExecutionProvider`, `VitisAIExecutionProvider`), short aliases (`qnn`, `ov`/`openvino`, `vitis`/`vitisai`), `all` (all rule-data-backed EPs), or `auto` (infer from local availability). |
+| `--ep` | | choice | `auto` | Target execution provider. Accepts full names (e.g., `QNNExecutionProvider`) or short aliases (`qnn`, `openvino`, `vitisai`, `cpu`, `cuda`, `dml`, `nvtensorrtrtx`, `migraphx`). Use `all` for every rule-data-backed EP, or `auto` to infer from local availability. |
 | `--device` | | `cpu\|gpu\|npu\|all\|auto` | `auto` | Target device type. `auto` infers from local availability; `all` evaluates all rule-data-backed devices. |
 | `--verbose` | `-v` | flag | off | Enable verbose output. |
 | `--quiet` | `-q` | flag | off | Suppress non-essential output. |
 | `--config` | `-c` | `PATH` | *(none)* | Build configuration file (YAML/JSON). |
 | `--output` | | `PATH` | *(none)* | Save the full JSON result to a file in addition to printing the console summary. |
 | `--information` / `--no-information` | | flag | enabled | Include detailed per-operator recommendations and remediation hints in the output. Pass `--no-information` for a compact pass/fail summary. |
-| `--htp-metadata` | | `PATH` | *(none)* | Path to an HTP metadata JSON file. Enables enhanced Qualcomm-specific pattern extraction when targeting QNN. |
-| `--run-unknown-op` / `--no-run-unknown-op` | | flag | disabled | Attempt to run operators unknown to the EP locally to infer shape and type information. Enable when local libraries are available. |
+| `--htp-metadata` | | `PATH` | *(none)* | Path to an HTP metadata JSON file (produced by `winml export`). Enriches subgraph pattern extraction by mapping nodes back to their source module hierarchy. Benefits all target EPs. |
+| `--run-unknown-op` / `--no-run-unknown-op` | | flag | disabled | For operators not in the rule database, build a minimal ONNX graph and run it on the target EP locally to determine support. Enable when local EP libraries are available. |
 | `--save-node` | | `partial\|unsupported` | *(none)* | Save partial or unsupported node subgraphs to disk for further investigation. Can be specified multiple times: `--save-node partial --save-node unsupported`. |
+| `--optim-config` | | `PATH` | *(none)* | Save the auto-discovered optimization config (merged across all analyzed EPs) to a JSON file. |
 
 ## How it works
 
-`winml analyze` loads the ONNX model and runs a static analysis pass via `ONNXStaticAnalyzer`. It checks each operator in the graph against the EP's capability list, classifies nodes as fully supported, partially supported, or unsupported, and optionally runs unknown operators locally to infer missing shape information. The command exits with code `0` when all operators are supported, `1` when at least one operator is unsupported or only partially supported, and `2` on any input or runtime error — making it safe to use in CI pipelines with exit-code checks.
+`winml analyze` loads the ONNX model and runs a static analysis pass via `ONNXStaticAnalyzer`. For each operator (and recognized subgraph pattern), the analyzer consults the target EP's rule database. For operators not in the database, it can optionally probe them locally when `--run-unknown-op` is enabled. The combined answer classifies each node as supported, partial, unsupported, or unknown (see [Analyze and optimize](../concepts/analyze-and-optimize.md) for definitions).
+
+The analysis always produces a **lint** result — the pass/fail verdict. When `--information` is enabled (the default), it additionally produces an **autoconf** result: a set of fusion-flag suggestions that, if applied in the optimize stage, would resolve partial or unsupported patterns. Pass `--no-information` to skip autoconf and get just the lint verdict.
+
+### Exit codes
+
+| Code | Meaning |
+|------|---------|
+| `0`  | All operators are fully supported on the target EP. |
+| `1`  | At least one operator is unsupported, partially supported, or unknown. |
+| `2`  | Input or configuration error (bad path, unknown EP, etc.). |
+
+Exit codes make `winml analyze` safe to use as a CI gate with `set -e` or `$?` checks.
 
 ## Examples
 
@@ -40,19 +53,7 @@ Analyze using auto-detected EP and device:
 $ winml analyze --model microsoft/resnet-50.onnx
 ```
 
-```text
-Analyzing microsoft/resnet-50.onnx (EP: auto, device: auto)...
-
-QNNExecutionProvider (NPU): FULLY SUPPORTED
-  Operators checked : 142
-  Unsupported       : 0
-  Partial           : 0
-
-OpenVINOExecutionProvider (NPU): FULLY SUPPORTED
-  Operators checked : 142
-  Unsupported       : 0
-  Partial           : 0
-```
+The output shows a live progress table per EP followed by an `ANALYSIS SUMMARY` section. Each EP line displays support counts in `S/P/U/Unk` format (Supported / Partial / Unsupported / Unknown) with color-coded indicators.
 
 Check QNN NPU support using the short alias:
 
@@ -63,7 +64,7 @@ $ winml analyze --model bert-base-uncased.onnx --ep qnn --device NPU
 Check Intel OpenVINO GPU support and print operator-level recommendations:
 
 ```bash
-$ winml analyze --model bert-base-uncased.onnx --ep ov --device GPU --information
+$ winml analyze --model bert-base-uncased.onnx --ep openvino --device GPU --information
 ```
 
 Save the full JSON result for offline inspection while still printing the console summary:
@@ -72,24 +73,46 @@ Save the full JSON result for offline inspection while still printing the consol
 $ winml analyze --model facebook/convnext-tiny-224.onnx --output results.json
 ```
 
-Use QNN with HTP metadata for enhanced Qualcomm pattern extraction:
+Use HTP metadata for enhanced subgraph pattern extraction:
 
 ```bash
 $ winml analyze --model bert-base-uncased.onnx \
-    --ep QNNExecutionProvider --device NPU \
-    --htp-metadata htp_metadata.json
+    --ep qnn --device NPU \
+    --htp-metadata bert-base-uncased_htp_metadata.json
+```
+
+Run a lint-only pass (no recommendations) for a CI gate:
+
+```bash
+$ winml analyze --model model.onnx --ep qnn --device NPU --no-information
+echo "Exit code: $?"  # 0 = clean, 1 = issues, 2 = input error
+```
+
+Dump unsupported subgraphs to disk for debugging:
+
+```bash
+$ winml analyze --model model.onnx --ep qnn \
+    --save-node partial --save-node unsupported \
+    --output result.json
+```
+
+Enable local execution for operators not in the rule database:
+
+```bash
+$ winml analyze --model model.onnx --ep qnn --device NPU --run-unknown-op
 ```
 
 ## Common pitfalls
 
 - **Omitting `--ep` uses `auto` (inferred from local availability)** — to analyze every EP regardless of what is installed, pass `--ep all`. Specify `--ep <name>` when you know your target hardware.
 - **Exit code 1 is not a hard failure** — it means at least one operator is unsupported, not that the model cannot run at all. Many EPs fall back unsupported nodes to the CPU EP automatically; review the recommendations before deciding to restructure the model.
-- **`--htp-metadata` is QNN-specific** — passing a QNN HTP metadata file while targeting a different EP has no effect. Ensure the EP and metadata file correspond to the same hardware.
-- **`--run-unknown-op` is disabled by default** — operators whose support cannot be verified statically are conservatively marked as unsupported unless you explicitly pass `--run-unknown-op`. Enable it only when the required local libraries are present.
+- **`--htp-metadata` is EP-agnostic** — HTP metadata enriches pattern extraction before any EP-specific checks, so it benefits all target EPs equally. You do not need separate metadata files per EP.
+- **`--run-unknown-op` is disabled by default** — operators not covered by the rule database are classified as `UNKNOWN` (not unsupported) unless you explicitly pass `--run-unknown-op` to probe them locally. Enable it only when the target EP's libraries are available on the local machine.
 - **The model path must point to an existing `.onnx` file** — symbolic HuggingFace model IDs are not accepted; export the model first with `winml export`.
 
 ## See also
 
+- [Analyze and optimize](../concepts/analyze-and-optimize.md) — conceptual deep dive on classifications, lint vs autoconf, and the analyzer/optimizer loop
 - [eps-and-devices.md](../concepts/eps-and-devices.md) — background on ONNX operators and execution providers
 - [export.md](export.md) — convert a HuggingFace model to ONNX before analyzing
 - [compile.md](compile.md) — compile the model for the target EP after analysis passes
diff --git a/docs/commands/hub.md b/docs/commands/catalog.md
similarity index 100%
rename from docs/commands/hub.md
rename to docs/commands/catalog.md
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
index 2ba2c7edb..863c78947 100644
--- a/docs/commands/inspect.md
+++ b/docs/commands/inspect.md
@@ -98,7 +98,7 @@ $ winml inspect -m facebook/convnext-tiny-224 -v -H
 
 ## See also
 
-- [hub.md](hub.md) — browse the curated catalog and check accuracy verdicts before
+- [catalog.md](catalog.md) — browse the curated catalog and check accuracy verdicts before
   inspecting
 - [Load and export concept](../concepts/load-and-export.md) — how `winml.hierarchy.tag`
   metadata is written and what you can do with the module tree
diff --git a/docs/commands/overview.md b/docs/commands/overview.md
index 3e9822b4a..77cfacf31 100644
--- a/docs/commands/overview.md
+++ b/docs/commands/overview.md
@@ -25,7 +25,7 @@ measure speed and accuracy.
 |---|---|---|
 | [`sys`](sys.md) | Discover | Inspect your machine — devices, EPs, SDKs, runtime versions at a glance. |
 | [`inspect`](inspect.md) | Discover | Inspect a model's tasks, classes, and hierarchy before committing to an export. |
-| [`catalog`](hub.md) | Discover | Browse the curated winml-cli catalog of validated models and benchmarks. |
+| [`catalog`](catalog.md) | Discover | Browse the curated winml-cli catalog of validated models and benchmarks. |
 | [`analyze`](analyze.md) | Discover | Verify an ONNX model is compatible with a target execution provider before deployment. |
 | [`config`](config.md) | Configure | Generate a reusable build configuration for a Hugging Face model or ONNX file. |
 | [`optimize`](optimize.md) | Configure | Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed. |
diff --git a/docs/commands/sys.md b/docs/commands/sys.md
index 8d95e9797..3cbdd1e08 100644
--- a/docs/commands/sys.md
+++ b/docs/commands/sys.md
@@ -115,5 +115,5 @@ $ winml sys --list-ep --format json
 - [ONNX & Execution Providers](../concepts/eps-and-devices.md) — background on EPs and
   how `--device` / `--ep` flags interact
 - [inspect.md](inspect.md) — inspect a specific HuggingFace model's compatibility
-- [hub.md](hub.md) — browse the curated catalog of validated models
+- [catalog.md](catalog.md) — browse the curated catalog of validated models
 - [How winml-cli Works](../concepts/how-it-works.md) — end-to-end pipeline overview
diff --git a/docs/concepts/analyze-and-optimize.md b/docs/concepts/analyze-and-optimize.md
index f49d63593..879a24ca1 100644
--- a/docs/concepts/analyze-and-optimize.md
+++ b/docs/concepts/analyze-and-optimize.md
@@ -4,31 +4,170 @@ Not every ONNX graph runs efficiently on every execution provider. An operator t
 
 ## What analyze does
 
-`winml analyze` performs static analysis on an ONNX file and reports how well it will run on a target EP. It checks operator coverage, runs shape inference to catch missing or inconsistent tensor shapes, and performs runtime checks that probe actual support on the local machine.
+`winml analyze` performs static analysis on an ONNX graph to answer one question: **will this model run end-to-end on my target execution provider, and if not, what needs to change?**
 
-Specify a target EP with `--ep` (e.g., `--ep qnn` or `--ep openvino`) and a device with `--device` (CPU, GPU, or NPU). Omit `--ep` to analyze against all supported EPs. Results print to the console by default; add `--output results.json` to save the report as JSON for scripting or archiving.
+Unlike profiling, static analysis does not require executing the full model on the target device. It inspects each operator (and recognized subgraph pattern) against a rule database of known EP capabilities, classifies every node, and emits actionable recommendations. The same analyzer also drives the autoconf feedback loop inside `winml build`, so understanding how it works is useful even when you never invoke `winml analyze` directly.
 
-Exit codes carry the verdict: zero is full support, one is partial support with unsupported operators, two is a configuration error. This makes `winml analyze` suitable as a CI gate. Pass `--information` (enabled by default) to include recommendations alongside each flagged operator. Use `--save-node unsupported` or `--save-node partial` to persist node lists for further work.
+Specify a target EP with `--ep` (e.g., `--ep qnn` or `--ep openvino`) and a device with `--device` (CPU, GPU, or NPU). The default `--ep auto` infers from locally available EPs; pass `--ep all` to evaluate every rule-data-backed EP regardless of local availability. Results print to the console by default; add `--output results.json` to save the report as JSON for scripting or archiving.
+
+### How operators are classified
+
+For each operator (and matched subgraph pattern) the analyzer follows a two-step process:
+
+1. **Rule-database lookup** — does the target EP claim to support this pattern?
+2. **Local probe (fallback)** — if the pattern is absent from the rule database and `--run-unknown-op` is enabled, the analyzer builds a minimal ONNX graph for the op and runs it on the target EP locally to determine support (see [Local op execution](#local-op-execution) below).
+
+The combined answer is recorded as a `SupportLevel`:
+
+| Level         | Compile on target EP | Runs (possibly via CPU fallback) | CLI label          | Exit code contribution |
+| ------------- | -------------------- | -------------------------------- | ------------------ | ---------------------- |
+| `SUPPORTED`   | yes                  | yes                              | `Fully Supported`  | 0                      |
+| `PARTIAL`     | no                   | yes                              | `Partial Support`  | 1 (warning)            |
+| `UNSUPPORTED` | no                   | no                               | `Not Supported`    | 1 (error)              |
+| `UNKNOWN`     | n/a                  | n/a                              | `Unknown Support`  | 1                      |
+
+A `PARTIAL` classification means the operator cannot be dispatched to the requested EP but the ONNX Runtime can still execute the model by falling back to CPU. This is technically a working model, but the latency and power-efficiency goals of NPU deployment are not met. `UNSUPPORTED` means even the CPU fallback path fails, so the model will not run at all. `UNKNOWN` appears only when the analyzer lacks both rule-database data and the ability to test locally.
+
+### Two key outputs: lint and autoconf
+
+Every analysis produces a **lint** result; the default (full) mode additionally produces an **autoconf** result. Understanding these two outputs separately is the easiest way to understand what `winml analyze` is for and how to consume it.
+
+**Lint** is the analyzer's verdict on the model as it stands today. It classifies every operator and recognized pattern against the target EP and rolls the classifications up into:
+
+- `errors` — count of `UNSUPPORTED` patterns. **The model will not run.**
+- `warnings` — count of `PARTIAL` patterns. The model runs, but these nodes fall back to CPU.
+- `passed` — `True` iff `errors == 0 and warnings == 0`.
+
+Lint always runs. It is deterministic and sufficient for a yes/no CI gate — the CLI's exit code is derived from it.
+
+**Autoconf** is the analyzer's _suggestion_ for how to fix the current model. It lists the fusion flags which, if enabled in the optimize stage, would convert one or more `PARTIAL`/`UNSUPPORTED` patterns into `SUPPORTED` ones.
+
+Autoconf is what powers the build pipeline's [re-optimization loop](#the-analyzeroptimizer-loop): when the analyzer says "`gelu_fusion` would resolve these warnings", the build re-runs optimize with that flag and re-analyzes — until no further suggestions remain or the iteration limit is hit. Autoconf is _advisory_; nothing else in the system flips fusion flags automatically.
+
+### Analysis modes
+
+`winml analyze` can run in two modes which differ only in whether autoconf is computed:
+
+| Mode               | How to enable                                              | Output                                                                 | When to use                                     |
+| ------------------ | ---------------------------------------------------------- | ---------------------------------------------------------------------- | ----------------------------------------------- |
+| **Lint-only**      | `--no-information` (CLI) or `autoconf=False` (Python)      | Lint only. `optimization_config` is `None`.                            | CI gate; pass/fail only                         |
+| **Full** (default) | `--information` (CLI, default) or `autoconf=True` (Python) | Lint **plus** autoconf and recommendations | Local debugging; build pipeline's autoconf loop |
+
+The only difference between the two modes is whether autoconf and the human-readable recommendations are computed. Skipping them gives a faster, leaner run. The lint result is identical either way.
+
+### Three classes of finding
+
+Every analysis emits findings in three buckets. Each bucket maps to a different remediation pattern.
+
+**Errors (`UNSUPPORTED` patterns)** block deployment. Either the operator does not exist on the target EP at all, or it does not handle the specific input shape/dtype the model uses. Typical remediations:
+
+- Rewrite the model to use an equivalent pattern the EP does support.
+- Lower the opset version of the offending op if the EP supports an older opset.
+- Insert pre/post-processing to massage shapes into a supported configuration.
+
+Each error pattern includes a recommendation that identifies the current pattern and the target pattern the EP does support, so the optimizer (or a manual rewrite) can apply the fix.
+
+**Warnings (`PARTIAL` patterns)** mean the model will run, but the target EP cannot dispatch this pattern. Inference falls back to the CPU EP, breaking the deployment goal (e.g., NPU offload) without breaking correctness. Warnings are usually fusion opportunities — the analyzer recognized a sub-pattern that, if fused, would become a single EP-native op. The fix is to enable the relevant fusion flag in the optimize stage — this is exactly what the autoconf loop does automatically.
+
+**Info (`Information` items)** are lower-priority insights: a hint that an alternative pattern exists, a QDQ-equivalent that could be used after quantization, or a description of why a node was classified as it was. Info entries never affect exit code.
+
+### Local op execution
+
+The static rule database does not cover every operator and every shape/dtype combination. When `--run-unknown-op` is enabled and the analyzer encounters a pattern not present in the database, it builds a tiny ONNX graph containing just that op (with the model's actual input metadata) and runs it on the target EP locally. The compile/run result becomes the classification. Without `--run-unknown-op` (the default), such patterns are classified as `UNKNOWN`.
+
+Leave `--run-unknown-op` disabled when:
+
+- The local machine does not have the target EP available (e.g., analyzing a QNN model from a non-Snapdragon machine).
+- You want bit-for-bit reproducible analysis across machines. Local execution can produce different results depending on driver versions.
+
+### Save-node: debugging unsupported subgraphs
+
+When a pattern is unsupported and the recommendation does not immediately tell you what is wrong, use `--save-node` to dump the offending subgraph to disk as a self-contained, runnable `.onnx` file. You can then open it in [Netron](https://netron.app/), re-analyze it in isolation, or attach it to a bug report as a minimal reproducer. See the [analyze command reference](../commands/analyze.md) for usage examples.
+
+### HTP metadata enhancement
+
+When a model is exported with hierarchy-preserving tags (HTP), the export produces a sidecar `_htp_metadata.json` that maps each ONNX node back to its source module (e.g., `encoder.layer.0.attention.self.GELUActivation`). Passing this file via `--htp-metadata` lets the `PatternExtractor` use the module hierarchy to match subgraph patterns more accurately than operator-level heuristics alone.
+
+HTP metadata is consumed at the pattern extraction stage — before any EP-specific runtime checking — so the enriched patterns benefit all target EPs equally (QNN, OpenVINO, VitisAI, etc.). Without HTP metadata, the analyzer falls back to attribute-based tag matching and then the general-purpose `PatternMatcher`; with it, the analyzer can correctly identify fused patterns (GELU, LayerNorm, Attention) that are difficult to detect from the raw operator graph. See the [analyze command reference](../commands/analyze.md) for usage examples.
+
+### What runs internally
+
+The analyzer is composed of five stages that run in order. You normally do not need to think about them, but they are worth knowing when reading recommendations or extending the analyzer:
+
+| Stage               | Job                                                                                                                                                           |
+| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `ONNXLoader`        | Load the ONNX file (or `ModelProto`), record metadata.                                                                                                        |
+| `PatternExtractor`  | Walk the graph, match operator and subgraph patterns from the rule catalog. Optionally consume HTP metadata.                                                   |
+| `RuntimeChecker`    | For each pattern, consult the rule database; if no rule applies, run the op locally (when allowed).                                                            |
+| `InformationEngine` | Turn classifications into human-readable `Information` items; also runs model validators (constant folding, dynamic input, pattern matching, QDQ validation, shape inference). |
+| `OutputAggregator`  | Assemble the final `AnalysisOutput` (the JSON you get from `--output`).                                                                                       |
+
+The model validators run regardless of whether there are runtime check results — they are model-level sanity checks (e.g., is shape inference complete? are QDQ pairs well-formed?) and can surface issues even when every operator looks fine in isolation.
 
 ## What optimize does
 
-`winml optimize` rewrites the ONNX graph by applying fusions and structural simplifications. Fusions such as GELU, LayerNorm, and MatMul+Add collapse multi-node sequences into single operators that EPs can map to efficient kernels. Layout transformations like the NHWC transformer rearrange tensor memory order to match GPU access patterns.
+`winml optimize` rewrites the ONNX graph by applying fusions and structural simplifications. Internally the optimizer runs four pipes in sequence:
+
+| Pipe              | What it does                                                                 |
+| ----------------- | ---------------------------------------------------------------------------- |
+| **ORTGraphPipe**  | ORT C++ graph optimizer (level 2): fusions, eliminations, layout transforms  |
+| **RewritePipe**   | JSON-driven pattern matcher that replaces subgraph patterns with equivalent alternatives |
+| **ORTFusionPipe** | ORT Python transformer optimizer: attention, LayerNorm, and RMSNorm fusions  |
+| **SurgeryPipe**   | Post-optimization model surgery (constant clamping, NaN guard removal)       |
 
-Every optimization is a named capability toggled via `--enable-<name>` and `--disable-<name>` flags. Run `--list-capabilities` to see all registered optimizations and their defaults. This granularity matters when a specific fusion breaks a downstream step or when you need an exact optimization profile for a given EP.
+Every optimization is a named **capability** toggled via `--enable-<name>` and `--disable-<name>` flags. Run `--list-capabilities` to see all registered optimizations and their defaults. The optimizer currently ships 43 static capabilities across 13 categories:
 
-The pattern-rewrite family is a complementary mechanism: instead of folding nodes, rewrites replace one subgraph pattern with a structurally equivalent alternative. Run `--list-rewrites` to discover available families and their flag names. Flags follow the form `--enable-<source-slug>-<target-slug>`.
+| Category     | Capabilities | Examples                                        |
+| ------------ | :----------: | ----------------------------------------------- |
+| GELU         | 5            | gelu-fusion, fast-gelu-fusion, quick-gelu-fusion |
+| LayerNorm    | 6            | layer-norm-fusion, skip-layer-norm-fusion, fuse-rmsnorm |
+| MatMul       | 6            | matmul-add-fusion, matmul-activation-fusion      |
+| Conv         | 4            | conv-bn-fusion, conv-activation-fusion           |
+| Layout       | 4            | nhwc-transformer, transpose-optimizer            |
+| GEMM         | 3            | gemm-activation-fusion, gemm-transpose-fusion    |
+| Elimination  | 3            | slice-elimination, expand-elimination            |
+| Graph        | 3            | constant-folding, double-qdq-pairs-remover       |
+| Activation   | 2            | bias-softmax-fusion, bias-dropout-fusion         |
+| Attention    | 1            | attention-fusion                                 |
+| Misc         | 4            | pad-fusion, gather-to-slice-fusion               |
+| Surgery      | 2            | clamp-constant-values, remove-isnan-in-attention-mask |
 
-Use presets (`--preset transformer-optimized`, `--preset qnn-compatible`) as a starting point, and commit a specific combination to a `--config` file for reproducible builds.
+This granularity matters when a specific fusion breaks a downstream step or when you need an exact optimization profile for a given EP. Some capabilities declare dependencies (e.g., `bias-gelu-fusion` requires `gelu-fusion`); the optimizer resolves these automatically when you enable a flag.
+
+**Pattern rewrites** are a complementary mechanism: instead of folding nodes, rewrites replace one subgraph pattern with a structurally equivalent alternative. Rules are defined in JSON files (`default.json` for general rewrites, `qnn.json` for QNN-specific rewrites). The optimizer currently ships 5 rewrite groups containing 12 individual rules — for example, four GELU source variants can each be rewritten to a single `Gelu` op, and a MatMul+Add pattern can be rewritten to a GEMM or to a Conv2D for Qualcomm NPU targets. Run `--list-rewrites` to discover available families and their flag names. Flags follow the form `--enable-<source-slug>-<target-slug>`.
+
+Commit a specific combination of flags to a `--config` file for reproducible builds.
 
 ## The analyzer/optimizer loop
 
 A single optimize pass may create fusion opportunities that were not present before, and a freshly fused graph may surface new operator compatibility issues. This is why `winml build` runs analyze and optimize in an alternating loop rather than once each.
 
-The loop repeats up to `--max-optim-iterations` rounds (default: three), which covers most transformer and vision architectures. Convergence is checked after each round; the loop exits early when the analysis result no longer improves. Use `--no-analyze` to skip the loop and run a single optimization pass — useful for deterministic rebuilds from a fixed ONNX checkpoint where the graph is already known good.
+The flow inside `winml build` (implemented in `run_optimize_analyze_loop`) is:
+
+![Optimize-analyze loop](../assets/optimize-analyze-loop.svg)
+
+The initial optimize pass applies the flags from `config.optim`. The analyzer then inspects the result; if autoconf discovers fusion flags that were not yet enabled, the optimizer re-runs with those flags and the analyzer re-checks. This repeats up to `--max-optim-iterations` rounds (default: three). The loop exits early when autoconf suggests no further changes. After the loop, a final analysis validates the result — if unsupported patterns still exist, the build raises a `RuntimeError`.
+
+Use `--no-analyze` to skip the loop and run a single optimization pass — useful for deterministic rebuilds from a fixed ONNX checkpoint where the graph is already known good.
+
+## When to use which entry point
+
+| You want to...                                | Use                                               |
+| --------------------------------------------- | ------------------------------------------------- |
+| Gate a CI pipeline on EP compatibility        | `winml analyze` (CLI) — exit code is the contract |
+| Embed analysis in a build script or notebook  | `analyze_onnx(model, ep=...)` (flat Python API)   |
+| Post-process the full result programmatically | `ONNXStaticAnalyzer().analyze(...)` (class API)   |
+| Analyze an in-memory `ModelProto`             | `ONNXStaticAnalyzer().analyze_from_proto(...)`    |
+| Optimize with full control over fusions       | `winml optimize` (CLI) with `--enable-` / `--disable-` flags |
+| Reproducible build from a config file         | `winml build -c config.json` (pipeline wrapper)   |
+
+The CLI and the flat Python API are sufficient for the vast majority of cases. The class-based API is only needed when you want to call `is_fully_supported(ep)`, `get_unsupported_operators(ep)`, or `get_optimization_opportunities(ep)` on the full result.
 
 ## See also
 
 - [Compile and EPContext](compile-and-epcontext.md)
 - [Primitives and pipeline](primitives-and-pipeline.md)
+- [How winml-cli works](how-it-works.md) — where the analyzer sits in the build pipeline
+- [EPs and devices](eps-and-devices.md) — background on EPs and operator support
 - [analyze command](../commands/analyze.md)
 - [optimize command](../commands/optimize.md)
diff --git a/mkdocs.yml b/mkdocs.yml
index e0f7b1d79..b6d9ae3bd 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -96,7 +96,7 @@ nav:
       - Discover:
           - sys: commands/sys.md
           - inspect: commands/inspect.md
-          - catalog: commands/hub.md
+          - catalog: commands/catalog.md
           - analyze: commands/analyze.md
       - Configure:
           - config: commands/config.md

From 9f182c85234015acd8a56df02b9bf3b489b33698 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Wed, 27 May 2026 11:18:51 +0800
Subject: [PATCH 007/143] Release/v0.1.0 back to main (#758)

Co-authored-by: Zhipeng Wang <zhiwang@microsoft.com>
Co-authored-by: Qiong Wu (qiowu) <qiowu@microsoft.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Charles Zhang <zhangchao@microsoft.com>
Co-authored-by: Zhenchao Ni <zhenni@microsoft.com>
---
 CHANGELOG.md                                  |  39 +
 pyproject.toml                                |   9 +-
 src/winml/modelkit/cli.py                     |   2 +-
 src/winml/modelkit/commands/inspect.py        |  24 +-
 src/winml/modelkit/commands/perf.py           |  22 +-
 src/winml/modelkit/data/hub_models.json       | 817 +++++++++++++-----
 src/winml/modelkit/inspect/resolver.py        | 137 ++-
 src/winml/modelkit/session/monitor/_pdh.py    |   4 +-
 src/winml/modelkit/session/session.py         |  10 +-
 tests/cli/test_catalog_cli.py                 |   6 +-
 tests/e2e/test_config_e2e.py                  |   2 +
 tests/e2e/test_perf_e2e.py                    |   2 +
 .../inspect/test_resolve_processor_gating.py  | 230 ++++-
 tests/unit/session/test_winml_session.py      |  68 +-
 14 files changed, 1099 insertions(+), 273 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index e44731b39..7e6ab75d9 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,45 @@ All notable changes to this project are documented in this file.
 
 The format is loosely based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
+## WinML CLI v0.1.0
+
+First **public preview** release. With the Windows ML 2.0 baseline now in place, this release shifts focus to polishing the CLI surface: faster `winml inspect` / `winml eval`, more accurate device & EP resolution, a real PyPI release pipeline, and a meaningful pass over sysinfo and quantization behavior.
+
+### 🎉 Public preview
+
+- Promoted to `Development Status :: 4 - Beta` in `pyproject.toml`.
+- First release published to PyPI via the new ESRP-signed release pipeline (#473).
+
+### ✨ Improvements
+
+- `winml inspect`: banner + spinner during HF metadata fetch (#718, hidden in JSON mode #745); `--list-tasks` <500 ms (#717); processor `Auto*` lookups gated (#719, #746).
+- `winml eval`: lazy module loading drops cold-start latency (#711); inputs validated up-front with friendlier errors and a structured `--schema` output (#694).
+- `winml export`: `model-id` and `task` validated before the export runs (#714).
+- `winml analyze`: cleaner EP/device selection, clearer "op-check skipped" UI, merged optimization config (#702).
+- `winml perf`: estimated model precision (QDQ / block-wise quant / dominant float dtype) is now reported by `WinMLSession` (#706); expanded perf e2e coverage across EPs and devices (#698).
+- `winml monitor`: queries all NPU/GPU engines and reports the max utilization (#716).
+- CLI-wide: did-you-mean suggestions on mistyped subcommands (#699); consistent option-vs-config-file value priority across commands (#720); `op_tracing` hidden from the public surface (#738).
+- Adopted the official `windowsml` usage example — removed the redundant `WinML` singleton, fixing a benign "library already registered" traceback on `winml perf --device npu` (#729).
+
+### 🐛 Fixes
+
+- **Quantization (P0)** — `--precision` now rejects invalid values instead of silently falling back to `uint8/uint8`; default image calibration dataset streams rather than downloading ~5 GB; DETR-family object detection supports `pixel_mask` padding (#680).
+- **`winml eval`** — pinned `pyarrow <24` to avoid an EP DLL load-order crash (#750).
+- **`winml perf`** — QDQ precision detection fix (#753); NPU monitoring adds `3D` engine, device line shows requested vs. actual (#747).
+- **EP / device resolution** — `resolve_device`/`resolve_eps` now use `get_registered_ep_devices` (#712); dropped misleading `ov`/`vitis`/`trtrtx` aliases (#690); `winml sys` raises when an EP isn't available on the host (#686); per-provider `ensure_ready` failures demoted to debug (#703); analyze regression caught during compile e2e (#740).
+- **Native ORT / WinML** — suppressed ORT native stderr, fixed a HANDLE leak (#709); nulled the EP catalog handle after enumeration to prevent a QNN NPU crash on exit (#701); fixed the `onnxruntime` DLL search path (#689).
+- **`winml sys`** — diagnostic sections gated behind `-v`, json-mode logs routed to stderr (#737); CPU/Mem scoped to the current process and PDH percent counters no longer artificially capped (#715); host arch reported via `IsWow64Process2` on Windows ARM64 (#705).
+- **OpenVINO** — `is_npu` detection updated (#722).
+
+### 🔧 Internals & CI
+
+- Added a `winml-cli` Copilot skill (#733).
+
+### 📦 Assets
+
+- `winml_cli-0.1.0-py3-none-any.whl`
+- `rules-v0.1.0.zip`
+
 ## WinML CLI v0.0.4
 
 ### 🚀 Platform upgrades
diff --git a/pyproject.toml b/pyproject.toml
index 4e116aeb9..7c8f1bfd2 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ requires = [ "setuptools>=61", "wheel" ]
 
 [project]
 name = "winml-cli"
-version = "0.0.4"
+version = "0.1.0"
 description = "Accelerate Model Deployment on WinML"
 readme = "README.md"
 keywords = [ "onnx", "winml" ]
@@ -12,7 +12,7 @@ license = { text = "MIT" }
 authors = [ { name = "WinML Team" } ]
 requires-python = ">=3.11,<3.12"
 classifiers = [
-  "Development Status :: 2 - Pre-Alpha",
+  "Development Status :: 4 - Beta",
   "Intended Audience :: Developers",
   "Intended Audience :: Science/Research",
   "License :: OSI Approved :: MIT License",
@@ -25,6 +25,11 @@ dependencies = [
   "click>=8.4",
   "colorama>=0.4.6",
   "datasets>=4.1.1",
+  # pyarrow 24.0.0 crashes (0xC0000005) on Windows when its C extension is
+  # loaded after the WinML/ORT EP DLLs (see _preload_bundled_onnxruntime_dll
+  # in winml.modelkit.__init__). Cap to <24 until upstream is fixed; re-test
+  # 24.x patches / 25.x before relaxing.
+  "pyarrow>=21,<24",
   "diffusers>=0.36",
   "evaluate>=0.4.6",
   "fastapi>=0.135.3",
diff --git a/src/winml/modelkit/cli.py b/src/winml/modelkit/cli.py
index 03e14b221..4504856b7 100644
--- a/src/winml/modelkit/cli.py
+++ b/src/winml/modelkit/cli.py
@@ -264,7 +264,7 @@ def format_commands(self, ctx: click.Context, formatter: click.HelpFormatter) ->
 def main(ctx: click.Context, verbose: int, quiet: bool, debug: bool) -> None:
     """WinML CLI - Accelerate Model Deployment on WinML.
 
-    Universal ONNX export with QNN and OpenVINO backend support.
+    Universal ONNX export with various WinML execution providers support.
     """
     # --debug is a backward-compat alias for -vv
     if debug:
diff --git a/src/winml/modelkit/commands/inspect.py b/src/winml/modelkit/commands/inspect.py
index bbc5fb3c8..d52b9d33c 100644
--- a/src/winml/modelkit/commands/inspect.py
+++ b/src/winml/modelkit/commands/inspect.py
@@ -186,10 +186,13 @@ def inspect(
     # Print a banner BEFORE the heavy import chain / network calls so users
     # see immediate feedback instead of ~14 s of silence and assume the
     # command hung (see #543). Banner + spinner go to stderr so `--format
-    # json` consumers still get clean stdout. Suppressed in --quiet mode.
+    # json` consumers still get clean stdout. Suppressed in --quiet mode
+    # and in JSON mode (Click 8.4 mixes stderr into CliRunner.result.output,
+    # and JSON consumers expect clean stdout regardless).
     quiet = bool(ctx.obj and ctx.obj.get("quiet"))
+    json_mode = output_format.lower() == "json"
     target = model_id or model_type or model_class
-    if not quiet:
+    if not quiet and not json_mode:
         _stderr_console.print(f"[dim]Inspecting [bold]{target}[/bold] …[/dim]")
 
     from ..inspect import InspectError, ModelNotFoundError, NetworkError
@@ -204,7 +207,7 @@ def inspect(
         logging.getLogger("winml.modelkit").setLevel(logging.DEBUG)
 
     try:
-        if quiet:
+        if quiet or json_mode:
             result = _inspect_model_v2(
                 model_id=model_id,
                 task_override=task,
@@ -479,9 +482,20 @@ def _inspect_model_v2(
         task=task,
     )
 
+    # Use the top-level model_type for the user-facing result.  For multimodal
+    # models (CLIP, etc.) `loader_config.model_type` is the narrowed sub-config
+    # type (e.g. "clip_text_model"), but users expect the top-level type ("clip").
+    #
+    # Precedence:
+    #   1. model_type_override  — user explicitly passed --model-type
+    #   2. parent_hf_config     — pre-narrowing config (only when model_id was
+    #                             provided and AutoConfig succeeded in step 1)
+    #   3. model_type           — narrowed loader_config.model_type (fallback)
+    display_model_type = model_type_override or getattr(parent_hf_config, "model_type", model_type)
+
     return InspectResult(
-        model_id=model_id or model_type or model_class_override or "unknown",
-        model_type=model_type,
+        model_id=model_id or display_model_type or model_class_override or "unknown",
+        model_type=display_model_type,
         architectures=architectures,
         task=task,
         task_source=task_source,
diff --git a/src/winml/modelkit/commands/perf.py b/src/winml/modelkit/commands/perf.py
index 4ad292d77..7fb4337ea 100644
--- a/src/winml/modelkit/commands/perf.py
+++ b/src/winml/modelkit/commands/perf.py
@@ -300,7 +300,8 @@ def run(self) -> BenchmarkResult:
         _print_model_info(
             self._model.io_config,
             task=self._model.task or self.config.task,
-            device=self._model.device,
+            req_device=self.config.device,
+            act_device=self._model.device,
             ep_name=self._model.ep_name,
         )
 
@@ -755,6 +756,12 @@ def _perf_modules(
 # Report Generation
 # =============================================================================
 
+def _device_string(req_device: str, act_device: str, ep_name: EPName | None) -> str:
+    device_str = f"{req_device} ({act_device})" if req_device != act_device else act_device
+    if ep_name:
+        device_str = f"{device_str} / {ep_name}"
+    return device_str
+
 
 def display_console_report(result: BenchmarkResult, console: Console) -> None:
     """Display benchmark results in formatted console output."""
@@ -763,9 +770,7 @@ def display_console_report(result: BenchmarkResult, console: Console) -> None:
 
     req_device = result.config.device
     act_device = result.actual_device
-    device_str = f"{req_device} ({act_device})" if req_device != act_device else act_device
-    if result.actual_ep:
-        device_str = f"{device_str} / {result.actual_ep}"
+    device_str = _device_string(req_device, act_device, result.actual_ep)
     console.print(f"[dim]Device:[/dim]      {device_str}")
 
     # TODO: show resolved precision once WinMLPreTrainedModel.precision
@@ -885,13 +890,14 @@ def _print_model_info(
     io_config: dict,
     *,
     task: str | None = None,
-    device: str = "auto",
+    req_device: str = "auto",
+    act_device: str = "auto",
     ep_name: EPName | None = None,
 ) -> None:
     """Print model I/O metadata before the benchmark starts."""
     console = Console(stderr=True)
     console.print()
-    device_line = f"{device} / {ep_name}" if ep_name else device
+    device_line = _device_string(req_device, act_device, ep_name)
     console.print(f"[dim]Device:[/dim]      {device_line}")
     if task:
         console.print(f"[dim]Task:[/dim]        {task}")
@@ -1011,7 +1017,7 @@ def _run_onnx_benchmark(
     session.compile()
 
     # Print model info before benchmark starts
-    _print_model_info(io_cfg, device=session.device, ep_name=session.ep_name)
+    _print_model_info(io_cfg, req_device=device, act_device=session.device, ep_name=session.ep_name)
 
     # Run benchmark
     total_iterations = warmup + iterations
@@ -1044,7 +1050,7 @@ def _run_onnx_benchmark(
                 total_iterations=total_iterations,
                 warmup=warmup,
                 model_id=str(onnx_path.name),
-                device=device,
+                device=session.device or device,
             )
             hw_metrics = hw.to_dict()
     else:
diff --git a/src/winml/modelkit/data/hub_models.json b/src/winml/modelkit/data/hub_models.json
index 470839065..201ed03b1 100644
--- a/src/winml/modelkit/data/hub_models.json
+++ b/src/winml/modelkit/data/hub_models.json
@@ -2,13 +2,16 @@
   "version": "1.0",
   "models": [
     {
-      "model_id": "BAAI/bge-base-en-v1.5",
-      "task": "feature-extraction",
-      "model_type": "bert",
+      "model_id": "AdamCodd/vit-base-nsfw-detector",
+      "task": "image-classification",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -17,18 +20,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 104.4
+      "size_mb": 83.4
     },
     {
-      "model_id": "BAAI/bge-base-en-v1.5",
+      "model_id": "BAAI/bge-large-en-v1.5",
       "task": "sentence-similarity",
       "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -37,18 +49,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 104.4
+      "size_mb": 351.8
     },
     {
-      "model_id": "BAAI/bge-large-en-v1.5",
+      "model_id": "BAAI/bge-small-en-v1.5",
       "task": "sentence-similarity",
       "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -60,18 +81,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 319.6
+      "size_mb": 43.9
     },
     {
-      "model_id": "BAAI/bge-small-en-v1.5",
-      "task": "feature-extraction",
-      "model_type": "bert",
+      "model_id": "FacebookAI/roberta-base",
+      "task": "fill-mask",
+      "model_type": "roberta",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -80,18 +107,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 31.8
+      "size_mb": 194.7
     },
     {
-      "model_id": "BAAI/bge-small-en-v1.5",
-      "task": "sentence-similarity",
-      "model_type": "bert",
+      "model_id": "FacebookAI/xlm-roberta-base",
+      "task": "fill-mask",
+      "model_type": "xlm-roberta",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -100,18 +136,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 31.8
+      "size_mb": 634.4
     },
     {
-      "model_id": "Babelscape/wikineural-multilingual-ner",
-      "task": "token-classification",
-      "model_type": "bert",
+      "model_id": "Falconsai/nsfw_image_detection",
+      "task": "image-classification",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -120,18 +165,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 169.1
+      "size_mb": 82.8
     },
     {
-      "model_id": "FacebookAI/roberta-base",
-      "task": "fill-mask",
-      "model_type": "roberta",
+      "model_id": "Intel/dpt-hybrid-midas",
+      "task": "depth-estimation",
+      "model_type": "dpt",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -143,18 +197,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 118.9
+      "size_mb": 117.9
     },
     {
-      "model_id": "FacebookAI/roberta-large",
-      "task": "fill-mask",
-      "model_type": "roberta",
+      "model_id": "Isotonic/distilbert_finetuned_ai4privacy_v2",
+      "task": "token-classification",
+      "model_type": "distilbert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -166,18 +226,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 338.9
+      "size_mb": 86.6
     },
     {
-      "model_id": "FacebookAI/xlm-roberta-base",
-      "task": "fill-mask",
-      "model_type": "xlm-roberta",
+      "model_id": "Jean-Baptiste/camembert-ner-with-dates",
+      "task": "token-classification",
+      "model_type": "camembert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -189,39 +255,53 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 266.0
+      "size_mb": 130.4
     },
     {
-      "model_id": "FacebookAI/xlm-roberta-large",
-      "task": "fill-mask",
-      "model_type": "xlm-roberta",
+      "model_id": "ahotrod/electra_large_discriminator_squad2_512",
+      "task": "question-answering",
+      "model_type": "electra",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
-        "qnn": [
+        "dml": [
           "GPU"
         ],
+        "qnn": [
+          "GPU",
+          "NPU"
+        ],
         "openvino": [
+          "CPU",
           "GPU",
           "NPU"
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 535.2
+      "size_mb": 350.9
     },
     {
-      "model_id": "Intel/bert-base-uncased-mrpc",
-      "task": "feature-extraction",
-      "model_type": "bert",
+      "model_id": "amunchet/rorshark-vit-base",
+      "task": "image-classification",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -230,18 +310,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 104.4
+      "size_mb": 82.8
     },
     {
-      "model_id": "Intel/bert-base-uncased-mrpc",
-      "task": "text-classification",
-      "model_type": "bert",
+      "model_id": "apple/mobilevit-small",
+      "task": "image-classification",
+      "model_type": "mobilevit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -250,18 +339,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 104.4
+      "size_mb": 6.1
     },
     {
-      "model_id": "ProsusAI/finbert",
+      "model_id": "cardiffnlp/twitter-roberta-base-sentiment-latest",
       "task": "text-classification",
-      "model_type": "bert",
+      "model_type": "roberta",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -270,19 +368,29 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 104.4
+      "size_mb": 157.7
     },
     {
-      "model_id": "StanfordAIMI/dinov2-base-xray-224",
-      "task": "image-feature-extraction",
-      "model_type": "dinov2",
+      "model_id": "cross-encoder/ms-marco-MiniLM-L4-v2",
+      "task": "text-classification",
+      "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
+          "GPU",
           "NPU"
         ],
         "openvino": [
@@ -292,18 +400,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 81.8
+      "size_mb": 29.9
     },
     {
-      "model_id": "cardiffnlp/twitter-roberta-base-sentiment-latest",
+      "model_id": "cross-encoder/ms-marco-MiniLM-L6-v2",
       "task": "text-classification",
-      "model_type": "roberta",
+      "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -315,18 +429,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 119.5
+      "size_mb": 33.4
     },
     {
-      "model_id": "dbmdz/bert-large-cased-finetuned-conll03-english",
-      "task": "token-classification",
+      "model_id": "deepset/bert-large-uncased-whole-word-masking-squad2",
+      "task": "question-answering",
       "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -338,40 +458,53 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 318.1
+      "size_mb": 350.9
     },
     {
-      "model_id": "deepset/bert-large-uncased-whole-word-masking-squad2",
+      "model_id": "deepset/roberta-base-squad2",
       "task": "question-answering",
-      "model_type": "bert",
+      "model_type": "roberta",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
         ],
         "openvino": [
           "CPU",
+          "GPU",
           "NPU"
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 319.6
+      "size_mb": 157.2
     },
     {
-      "model_id": "deepset/roberta-base-squad2",
+      "model_id": "deepset/tinyroberta-squad2",
       "task": "question-answering",
       "model_type": "roberta",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -383,18 +516,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 118.3
+      "size_mb": 116.2
     },
     {
-      "model_id": "deepset/tinyroberta-squad2",
-      "task": "question-answering",
-      "model_type": "roberta",
+      "model_id": "dima806/fairface_age_image_detection",
+      "task": "image-classification",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -406,18 +545,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 77.8
+      "size_mb": 82.8
     },
     {
-      "model_id": "dslim/bert-base-NER",
-      "task": "token-classification",
-      "model_type": "bert",
+      "model_id": "distilbert/distilbert-base-cased-distilled-squad",
+      "task": "question-answering",
+      "model_type": "distilbert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -426,18 +571,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 103.3
+      "size_mb": 84.2
     },
     {
-      "model_id": "facebook/convnext-tiny-224",
-      "task": "image-classification",
-      "model_type": "convnext",
+      "model_id": "distilbert/distilbert-base-uncased",
+      "task": "fill-mask",
+      "model_type": "distilbert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -446,18 +600,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 27.3
+      "size_mb": 109.5
     },
     {
-      "model_id": "facebook/dino-vitb16",
-      "task": "image-feature-extraction",
-      "model_type": "vit",
+      "model_id": "distilbert/distilbert-base-uncased-distilled-squad",
+      "task": "question-answering",
+      "model_type": "distilbert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -469,37 +632,53 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 81.8
+      "size_mb": 86.5
     },
     {
-      "model_id": "facebook/dino-vits16",
-      "task": "image-feature-extraction",
-      "model_type": "vit",
+      "model_id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+      "task": "text-classification",
+      "model_type": "distilbert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
         ],
         "openvino": [
           "CPU",
+          "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 20.7
+      "size_mb": 87.0
     },
     {
-      "model_id": "facebook/dinov2-base",
+      "model_id": "facebook/dino-vitb16",
       "task": "image-feature-extraction",
-      "model_type": "dinov2",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -511,40 +690,53 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 82.6
+      "size_mb": 83.4
     },
     {
-      "model_id": "facebook/dinov2-large",
+      "model_id": "facebook/dino-vits16",
       "task": "image-feature-extraction",
-      "model_type": "dinov2",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
         ],
         "openvino": [
           "CPU",
+          "GPU",
           "NPU"
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 290.3
+      "size_mb": 21.6
     },
     {
-      "model_id": "facebook/dinov2-small",
+      "model_id": "facebook/dinov2-base",
       "task": "image-feature-extraction",
       "model_type": "dinov2",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -556,18 +748,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 21.0
+      "size_mb": 82.8
     },
     {
-      "model_id": "google-bert/bert-base-multilingual-cased",
-      "task": "feature-extraction",
-      "model_type": "bert",
+      "model_id": "facebook/dinov2-large",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -576,21 +774,31 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 170.3
+      "size_mb": 291.4
     },
     {
-      "model_id": "google-bert/bert-base-multilingual-uncased",
-      "task": "fill-mask",
-      "model_type": "bert",
+      "model_id": "facebook/dinov2-small",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
-        "qnn": [
+        "dml": [
           "GPU"
         ],
+        "qnn": [
+          "GPU",
+          "NPU"
+        ],
         "openvino": [
           "CPU",
           "GPU",
@@ -598,18 +806,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 160.3
+      "size_mb": 21.4
     },
     {
-      "model_id": "google-bert/bert-base-uncased",
+      "model_id": "google-bert/bert-base-multilingual-cased",
       "task": "fill-mask",
       "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -621,21 +835,28 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 105.0
+      "size_mb": 346.5
     },
     {
-      "model_id": "google-bert/bert-large-uncased-whole-word-masking-finetuned-squad",
-      "task": "question-answering",
+      "model_id": "google-bert/bert-base-multilingual-cased",
+      "task": "masked-lm",
       "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
-        "qnn": [
+        "dml": [
           "GPU"
         ],
+        "qnn": [
+          "GPU",
+          "NPU"
+        ],
         "openvino": [
           "CPU",
           "GPU",
@@ -643,18 +864,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 319.6
+      "size_mb": 346.5
     },
     {
-      "model_id": "google/vit-base-patch16-224",
-      "task": "image-classification",
-      "model_type": "vit",
+      "model_id": "google-bert/bert-base-multilingual-uncased",
+      "task": "fill-mask",
+      "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -666,18 +893,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 82.5
+      "size_mb": 316.4
     },
     {
-      "model_id": "google/vit-base-patch16-224-in21k",
-      "task": "image-feature-extraction",
-      "model_type": "vit",
+      "model_id": "google-bert/bert-base-uncased",
+      "task": "fill-mask",
+      "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -689,18 +922,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 82.4
+      "size_mb": 150.5
     },
     {
-      "model_id": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
-      "task": "feature-extraction",
-      "model_type": "clip",
+      "model_id": "google/vit-base-patch16-224",
+      "task": "image-classification",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -712,62 +951,82 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 144.3
+      "size_mb": 83.6
     },
     {
-      "model_id": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
-      "task": "zero-shot-image-classification",
-      "model_type": "clip",
+      "model_id": "google/vit-base-patch16-224-in21k",
+      "task": "image-feature-extraction",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
         ],
         "openvino": [
           "CPU",
+          "GPU",
           "NPU"
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 144.3
+      "size_mb": 83.4
     },
     {
-      "model_id": "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
-      "task": "zero-shot-image-classification",
-      "model_type": "clip",
+      "model_id": "hustvl/yolos-small",
+      "task": "object-detection",
+      "model_type": "yolos",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
         ],
         "openvino": [
           "CPU",
+          "GPU",
           "NPU"
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 940.4
+      "size_mb": 38.1
     },
     {
-      "model_id": "mattmdjaga/segformer_b2_clothes",
-      "task": "image-segmentation",
-      "model_type": "segformer",
+      "model_id": "kredor/punctuate-all",
+      "task": "token-classification",
+      "model_type": "xlm-roberta",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -779,18 +1038,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 26.1
+      "size_mb": 449.7
     },
     {
-      "model_id": "microsoft/rad-dino",
-      "task": "image-feature-extraction",
-      "model_type": "dinov2",
+      "model_id": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
+      "task": "feature-extraction",
+      "model_type": "clip",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -802,18 +1067,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 82.6
+      "size_mb": 85.4
     },
     {
-      "model_id": "microsoft/resnet-50",
-      "task": "image-classification",
-      "model_type": "resnet",
+      "model_id": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K",
+      "task": "zero-shot-image-classification",
+      "model_type": "clip",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -825,18 +1096,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 24.4
+      "size_mb": 170.1
     },
     {
-      "model_id": "microsoft/swin-large-patch4-window7-224",
-      "task": "image-classification",
-      "model_type": "swin",
+      "model_id": "lxyuan/distilbert-base-multilingual-cased-sentiments-student",
+      "task": "zero-shot-classification",
+      "model_type": "distilbert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -848,18 +1125,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 187.6
+      "size_mb": 217.5
     },
     {
-      "model_id": "microsoft/table-transformer-detection",
-      "task": "object-detection",
-      "model_type": "table-transformer",
+      "model_id": "microsoft/rad-dino",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -868,18 +1151,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 27.5
+      "size_mb": 84.4
     },
     {
-      "model_id": "nvidia/segformer-b1-finetuned-ade-512-512",
-      "task": "image-segmentation",
-      "model_type": "segformer",
+      "model_id": "microsoft/resnet-18",
+      "task": "image-classification",
+      "model_type": "resnet",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -891,18 +1183,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 13.1
+      "size_mb": 11.2
     },
     {
-      "model_id": "nvidia/segformer-b2-finetuned-ade-512-512",
-      "task": "image-segmentation",
-      "model_type": "segformer",
+      "model_id": "monologg/koelectra-small-v2-distilled-korquad-384",
+      "task": "question-answering",
+      "model_type": "electra",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -914,18 +1212,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 26.2
+      "size_mb": 17.7
     },
     {
-      "model_id": "nvidia/segformer-b5-finetuned-ade-640-640",
-      "task": "image-segmentation",
-      "model_type": "segformer",
+      "model_id": "openai/clip-vit-base-patch16",
+      "task": "feature-extraction",
+      "model_type": "clip",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -937,18 +1241,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 80.9
+      "size_mb": 85.5
     },
     {
-      "model_id": "openai/clip-vit-base-patch16",
+      "model_id": "openai/clip-vit-base-patch32",
       "task": "feature-extraction",
       "model_type": "clip",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -960,18 +1270,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 142.7
+      "size_mb": 85.5
     },
     {
-      "model_id": "openai/clip-vit-base-patch16",
-      "task": "zero-shot-image-classification",
-      "model_type": "clip",
+      "model_id": "rizvandwiki/gender-classification",
+      "task": "image-classification",
+      "model_type": "vit",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -980,18 +1296,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 142.7
+      "size_mb": 82.8
     },
     {
-      "model_id": "openai/clip-vit-base-patch32",
+      "model_id": "sentence-transformers/all-MiniLM-L6-v2",
       "task": "feature-extraction",
-      "model_type": "clip",
+      "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1003,18 +1328,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 144.3
+      "size_mb": 33.2
     },
     {
-      "model_id": "openai/clip-vit-base-patch32",
-      "task": "zero-shot-image-classification",
-      "model_type": "clip",
+      "model_id": "sentence-transformers/all-MiniLM-L6-v2",
+      "task": "sentence-similarity",
+      "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1023,18 +1354,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 144.3
+      "size_mb": 33.4
     },
     {
-      "model_id": "openai/clip-vit-large-patch14",
-      "task": "zero-shot-image-classification",
-      "model_type": "clip",
+      "model_id": "sentence-transformers/all-mpnet-base-v2",
+      "task": "feature-extraction",
+      "model_type": "mpnet",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1046,18 +1386,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 407.8
+      "size_mb": 133.4
     },
     {
-      "model_id": "openai/clip-vit-large-patch14-336",
-      "task": "zero-shot-image-classification",
-      "model_type": "clip",
+      "model_id": "sentence-transformers/all-mpnet-base-v2",
+      "task": "fill-mask",
+      "model_type": "mpnet",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1069,18 +1415,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 408.2
+      "size_mb": 156.5
     },
     {
-      "model_id": "patrickjohncyh/fashion-clip",
-      "task": "zero-shot-image-classification",
-      "model_type": "clip",
+      "model_id": "sentence-transformers/all-mpnet-base-v2",
+      "task": "sentence-similarity",
+      "model_type": "mpnet",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1089,18 +1441,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 144.3
+      "size_mb": 134.0
     },
     {
-      "model_id": "rizvandwiki/gender-classification",
-      "task": "image-classification",
-      "model_type": "vit",
+      "model_id": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
+      "task": "feature-extraction",
+      "model_type": "mpnet",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1112,18 +1473,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 81.8
+      "size_mb": 133.4
     },
     {
-      "model_id": "sentence-transformers/all-MiniLM-L6-v2",
-      "task": "feature-extraction",
-      "model_type": "bert",
+      "model_id": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
+      "task": "fill-mask",
+      "model_type": "mpnet",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1135,18 +1502,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 21.6
+      "size_mb": 156.5
     },
     {
-      "model_id": "sentence-transformers/all-MiniLM-L6-v2",
+      "model_id": "sentence-transformers/multi-qa-mpnet-base-dot-v1",
       "task": "sentence-similarity",
-      "model_type": "bert",
+      "model_type": "mpnet",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1158,18 +1531,24 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 21.6
+      "size_mb": 134.0
     },
     {
       "model_id": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
-      "task": "feature-extraction",
+      "task": "sentence-similarity",
       "model_type": "bert",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1178,18 +1557,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 112.2
+      "size_mb": 204.7
     },
     {
-      "model_id": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
+      "model_id": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
       "task": "sentence-similarity",
-      "model_type": "bert",
+      "model_type": "xlm-roberta",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1198,18 +1586,27 @@
           "CPU",
           "GPU",
           "NPU"
+        ],
+        "vitisai": [
+          "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 112.2
+      "size_mb": 450.3
     },
     {
-      "model_id": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
-      "task": "sentence-similarity",
-      "model_type": "xlm-roberta",
+      "model_id": "valentinafeve/yolos-fashionpedia",
+      "task": "object-detection",
+      "model_type": "yolos",
       "supported_eps": {
         "cpu": [
           "CPU"
         ],
+        "dml": [
+          "GPU"
+        ],
         "qnn": [
           "GPU",
           "NPU"
@@ -1221,9 +1618,12 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 265.2
+      "size_mb": 38.1
     },
     {
       "model_id": "w11wo/indonesian-roberta-base-posp-tagger",
@@ -1233,9 +1633,13 @@
         "cpu": [
           "CPU"
         ],
-        "qnn": [
+        "dml": [
           "GPU"
         ],
+        "qnn": [
+          "GPU",
+          "NPU"
+        ],
         "openvino": [
           "CPU",
           "GPU",
@@ -1243,9 +1647,12 @@
         ],
         "vitisai": [
           "NPU"
+        ],
+        "trtrtx": [
+          "GPU"
         ]
       },
-      "size_mb": 118.3
+      "size_mb": 157.2
     }
   ]
 }
diff --git a/src/winml/modelkit/inspect/resolver.py b/src/winml/modelkit/inspect/resolver.py
index a55edf12b..51a6a9ccd 100644
--- a/src/winml/modelkit/inspect/resolver.py
+++ b/src/winml/modelkit/inspect/resolver.py
@@ -10,7 +10,7 @@
 from __future__ import annotations
 
 import logging
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, NamedTuple
 
 from ..loader.task import (
     HF_TASK_DEFAULTS,
@@ -869,20 +869,22 @@ def resolve_processor(
     # This is fast and doesn't require downloading/instantiating processors
     # NOTE: These JSON keys (processor_class, image_processor_type, etc.) are
     # standard HuggingFace config conventions, not model-specific hardcoding.
+    has_preprocessor_config = True
     try:
-        hub_proc, hub_tok, hub_img, hub_fe = _resolve_processor_from_hub_configs(model_id)
-        if hub_proc and processor_class is None:
-            processor_class = hub_proc
+        hub_result = _resolve_processor_from_hub_configs(model_id)
+        if hub_result.processor_class and processor_class is None:
+            processor_class = hub_result.processor_class
             processor_source = "hub_config"
-        if hub_tok and tokenizer_class is None:
-            tokenizer_class = hub_tok
+        if hub_result.tokenizer_class and tokenizer_class is None:
+            tokenizer_class = hub_result.tokenizer_class
             tokenizer_source = "hub_config"
-        if hub_img and image_processor_class is None:
-            image_processor_class = hub_img
+        if hub_result.image_processor_class and image_processor_class is None:
+            image_processor_class = hub_result.image_processor_class
             image_processor_source = "hub_config"
-        if hub_fe and feature_extractor_class is None:
-            feature_extractor_class = hub_fe
+        if hub_result.feature_extractor_class and feature_extractor_class is None:
+            feature_extractor_class = hub_result.feature_extractor_class
             feature_extractor_source = "hub_config"
+        has_preprocessor_config = hub_result.has_preprocessor_config
     except Exception as e:
         logger.debug("Failed to resolve processors from hub configs: %s", e)
 
@@ -890,10 +892,14 @@ def resolve_processor(
     # Skip entirely when Strategies 0 + 1 already populated every field —
     # each Auto* instantiation does its own HF Hub I/O plus class init
     # (AutoProcessor and AutoFeatureExtractor are several seconds each).
+    #
+    # When ``preprocessor_config.json`` is missing on the hub, the model
+    # has neither an image processor nor a feature extractor; skip those
+    # two Auto* round-trips (they would each spend ~1s confirming a 404).
     need_processor = processor_class is None
     need_tokenizer = tokenizer_class is None
-    need_image_processor = image_processor_class is None
-    need_feature_extractor = feature_extractor_class is None
+    need_image_processor = image_processor_class is None and has_preprocessor_config
+    need_feature_extractor = feature_extractor_class is None and has_preprocessor_config
 
     if need_processor or need_tokenizer or need_image_processor or need_feature_extractor:
         try:
@@ -938,9 +944,21 @@ def resolve_processor(
     )
 
 
-def _resolve_processor_from_hub_configs(
-    model_id: str,
-) -> tuple[str | None, str | None, str | None, str | None]:
+class _HubConfigResult(NamedTuple):
+    """Result of ``_resolve_processor_from_hub_configs``.
+
+    A NamedTuple rather than a plain tuple so the trailing boolean cannot be
+    silently swapped with the four ``str | None`` fields at the call site.
+    """
+
+    processor_class: str | None
+    tokenizer_class: str | None
+    image_processor_class: str | None
+    feature_extractor_class: str | None
+    has_preprocessor_config: bool
+
+
+def _resolve_processor_from_hub_configs(model_id: str) -> _HubConfigResult:
     """Resolve processor classes by fetching config files from HuggingFace Hub.
 
     This approach is fast because it only downloads small JSON config files,
@@ -950,7 +968,12 @@ def _resolve_processor_from_hub_configs(
         model_id: HuggingFace model identifier
 
     Returns:
-        Tuple of (processor_class, tokenizer_class, image_processor_class, feature_extractor_class)
+        A ``_HubConfigResult`` whose ``has_preprocessor_config`` reports
+        whether ``preprocessor_config.json`` actually exists on the hub —
+        the authoritative signal that the model has no image processor or
+        feature extractor, so the caller can skip the corresponding
+        ``AutoImageProcessor`` / ``AutoFeatureExtractor`` round-trips
+        (which would each spend ~1s confirming a 404 on text-only models).
     """
     import json
     from pathlib import Path
@@ -962,6 +985,7 @@ def _resolve_processor_from_hub_configs(
     tokenizer_class: str | None = None
     image_processor_class: str | None = None
     feature_extractor_class: str | None = None
+    has_preprocessor_config = False
 
     # Try to download and parse preprocessor_config.json
     # This file contains image_processor_type or processor_class
@@ -970,6 +994,11 @@ def _resolve_processor_from_hub_configs(
             repo_id=model_id,
             filename="preprocessor_config.json",
         )
+        # Set the flag as soon as the file exists on the hub, *before* parsing.
+        # A corrupt JSON is still proof that the model ships preprocessor
+        # config — fall back to Auto* lookups rather than declaring the model
+        # text-only and silently dropping its image/feature processor.
+        has_preprocessor_config = True
         with Path(preprocessor_config_path).open(encoding="utf-8") as f:
             preprocessor_config = json.load(f)
 
@@ -1009,7 +1038,34 @@ def _resolve_processor_from_hub_configs(
     except json.JSONDecodeError as e:
         logger.debug("Failed to parse tokenizer_config.json for %s: %s", model_id, e)
 
-    return processor_class, tokenizer_class, image_processor_class, feature_extractor_class
+    return _HubConfigResult(
+        processor_class=processor_class,
+        tokenizer_class=tokenizer_class,
+        image_processor_class=image_processor_class,
+        feature_extractor_class=feature_extractor_class,
+        has_preprocessor_config=has_preprocessor_config,
+    )
+
+
+def _is_tokenizer_class_name(name: str) -> bool:
+    """Heuristic: does this transformers class name look like a tokenizer?
+
+    Tokenizer classes follow the ``*Tokenizer`` / ``*TokenizerFast`` naming
+    convention (e.g. ``RobertaTokenizer``, ``BertTokenizerFast``). Used to
+    detect when ``AutoProcessor.from_pretrained`` returned a leaf tokenizer
+    rather than a multimodal ``ProcessorMixin`` wrapper.
+    """
+    return name.endswith(("Tokenizer", "TokenizerFast"))
+
+
+def _is_image_processor_class_name(name: str) -> bool:
+    """Heuristic: does this transformers class name look like an image processor?"""
+    return name.endswith(("ImageProcessor", "ImageProcessorFast"))
+
+
+def _is_feature_extractor_class_name(name: str) -> bool:
+    """Heuristic: does this transformers class name look like a feature extractor?"""
+    return name.endswith("FeatureExtractor")
 
 
 def _resolve_processor_from_auto_classes(
@@ -1058,28 +1114,31 @@ def _resolve_processor_from_auto_classes(
             processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
             processor_class = type(processor).__name__
 
-            # AutoProcessor may wrap tokenizer and image_processor
-            if (
-                try_tokenizer
-                and hasattr(processor, "tokenizer")
-                and processor.tokenizer is not None
-            ):
-                tokenizer_class = type(processor.tokenizer).__name__
-
-            if (
-                try_image_processor
-                and hasattr(processor, "image_processor")
-                and processor.image_processor is not None
-            ):
-                image_processor_class = type(processor.image_processor).__name__
-
-            # Some older models use feature_extractor instead of image_processor
-            if (
-                try_feature_extractor
-                and hasattr(processor, "feature_extractor")
-                and processor.feature_extractor is not None
-            ):
-                feature_extractor_class = type(processor.feature_extractor).__name__
+            # AutoProcessor may wrap tokenizer / image_processor / feature_extractor
+            # as a multimodal `ProcessorMixin`.  For single-modality models it
+            # often returns the leaf class directly (e.g. RoBERTa →
+            # `RobertaTokenizerFast`), which has none of those attributes.
+            # Pattern-match the returned class name so the standalone Auto*
+            # calls below can be skipped — otherwise we pay for a second,
+            # redundant load (~2s for AutoTokenizer on warm cache).
+            wrapped_tokenizer = getattr(processor, "tokenizer", None)
+            wrapped_image_processor = getattr(processor, "image_processor", None)
+            wrapped_feature_extractor = getattr(processor, "feature_extractor", None)
+
+            if try_tokenizer and wrapped_tokenizer is not None:
+                tokenizer_class = type(wrapped_tokenizer).__name__
+            elif try_tokenizer and _is_tokenizer_class_name(processor_class):
+                tokenizer_class = processor_class
+
+            if try_image_processor and wrapped_image_processor is not None:
+                image_processor_class = type(wrapped_image_processor).__name__
+            elif try_image_processor and _is_image_processor_class_name(processor_class):
+                image_processor_class = processor_class
+
+            if try_feature_extractor and wrapped_feature_extractor is not None:
+                feature_extractor_class = type(wrapped_feature_extractor).__name__
+            elif try_feature_extractor and _is_feature_extractor_class_name(processor_class):
+                feature_extractor_class = processor_class
 
         except Exception as e:
             logger.debug("AutoProcessor failed for %s: %s", model_id, e)
diff --git a/src/winml/modelkit/session/monitor/_pdh.py b/src/winml/modelkit/session/monitor/_pdh.py
index a9ceccd8a..0e0364dc4 100644
--- a/src/winml/modelkit/session/monitor/_pdh.py
+++ b/src/winml/modelkit/session/monitor/_pdh.py
@@ -329,8 +329,8 @@ def build_npu_query(npu_luid: str, pid: int | None = None) -> PdhQuery:
     Returns:
         An opened PdhQuery configured for NPU monitoring.
     """
-    # Neural: OpenVINO NPU
-    return build_adapter_query(npu_luid, engine_types=("Compute", "Neural"), pid=pid)
+    # Neural / 3D: OpenVINO NPU
+    return build_adapter_query(npu_luid, engine_types=("Compute", "Neural", "3D"), pid=pid)
 
 
 def build_gpu_query(gpu_luid: str, pid: int | None = None) -> PdhQuery:
diff --git a/src/winml/modelkit/session/session.py b/src/winml/modelkit/session/session.py
index 472063645..dc2bca5f4 100644
--- a/src/winml/modelkit/session/session.py
+++ b/src/winml/modelkit/session/session.py
@@ -647,7 +647,7 @@ def _get_precision(model_proto: onnx.ModelProto) -> str | None:
         }
 
         def _label(w_bits: int, a_bits: int) -> str:
-            return f"int{w_bits}" if w_bits == a_bits else f"w{w_bits}a{a_bits}"
+            return f"w{w_bits}a{a_bits}"
 
         # (1) QDQ — dominant zero_point bit width per side.
         if op_types & {"QuantizeLinear", "DequantizeLinear"}:
@@ -664,7 +664,13 @@ def _label(w_bits: int, a_bits: int) -> str:
                 bits = int_bits.get(zp_dtype)
                 if bits is None:
                     continue
-                target = weight_counts if node.input[0] in init_names else act_counts
+                is_weight_side = node.input[0] in init_names
+                # 32-bit zero_points on initializer-input DQs are bias
+                # accumulators (standard for INT8 QDQ: INT8 weights, INT32
+                # bias). They shouldn't drive the weight precision label.
+                if is_weight_side and bits >= 32:
+                    continue
+                target = weight_counts if is_weight_side else act_counts
                 target[bits] = target.get(bits, 0) + 1
 
             if weight_counts or act_counts:
diff --git a/tests/cli/test_catalog_cli.py b/tests/cli/test_catalog_cli.py
index 6b416dcda..5b3cd9b8f 100644
--- a/tests/cli/test_catalog_cli.py
+++ b/tests/cli/test_catalog_cli.py
@@ -316,11 +316,11 @@ def test_ep_vitisai_returns_only_vitisai_models(self, tmp_path: Path) -> None:
         for m in models:
             assert "vitisai" in _ep_keys(m)
 
-    def test_ep_vitisai_is_strict_subset_of_full_catalog(self, tmp_path: Path) -> None:
-        """vitisai is not universally supported — filtered list must be smaller."""
+    def test_ep_vitisai_is_subset_of_full_catalog(self, tmp_path: Path) -> None:
+        """vitisai filtered list must be a non-empty subset of the full catalog."""
         all_models = _run_json(tmp_path / "all.json")
         vitisai = _run_json(tmp_path / "vitisai.json", "--ep", "vitisai")
-        assert 0 < len(vitisai) < len(all_models)
+        assert 0 < len(vitisai) <= len(all_models)
 
     def test_ep_alias_openvino_equals_openvinoexecutionprovider(self, tmp_path: Path) -> None:
         ov = _run_json(tmp_path / "ov.json", "--ep", "openvinoexecutionprovider")
diff --git a/tests/e2e/test_config_e2e.py b/tests/e2e/test_config_e2e.py
index ca2e62572..60d10c958 100644
--- a/tests/e2e/test_config_e2e.py
+++ b/tests/e2e/test_config_e2e.py
@@ -32,6 +32,7 @@
 import pytest
 from click.testing import CliRunner
 
+from tests.e2e.require_ep import require_ep
 from winml.modelkit.commands.config import config
 
 
@@ -372,6 +373,7 @@ def test_no_compile_default(self) -> None:
 
     def test_compile_enabled(self) -> None:
         """--compile (negated default) should produce a compile section."""
+        require_ep("qnn")
         data = _run_config(
             "-m",
             self.MODEL,
diff --git a/tests/e2e/test_perf_e2e.py b/tests/e2e/test_perf_e2e.py
index 013433167..8952b5c7d 100644
--- a/tests/e2e/test_perf_e2e.py
+++ b/tests/e2e/test_perf_e2e.py
@@ -484,6 +484,7 @@ def test_benchmark_ep_cpu(self, ep: str, tmp_path: Path, model_arg: str):
     def test_benchmark_ep_gpu(self, ep: str, tmp_path: Path, model_arg: str):
         """Benchmark with --ep <ep>."""
         require_ep(ep)
+        _require_gpu()
 
         output_file = tmp_path / f"perf_hf_{ep}_gpu.json"
 
@@ -507,6 +508,7 @@ def test_benchmark_ep_gpu(self, ep: str, tmp_path: Path, model_arg: str):
     def test_benchmark_ep_npu(self, ep: str, tmp_path: Path, model_arg: str):
         """Benchmark with --ep <ep>."""
         require_ep(ep)
+        _require_npu()
 
         output_file = tmp_path / f"perf_hf_{ep}_npu.json"
 
diff --git a/tests/unit/inspect/test_resolve_processor_gating.py b/tests/unit/inspect/test_resolve_processor_gating.py
index 02e115c31..5a27dcd0f 100644
--- a/tests/unit/inspect/test_resolve_processor_gating.py
+++ b/tests/unit/inspect/test_resolve_processor_gating.py
@@ -20,14 +20,21 @@
 from unittest.mock import MagicMock, patch
 
 from winml.modelkit.inspect.resolver import (
+    _HubConfigResult,
     _resolve_processor_from_auto_classes,
     resolve_processor,
 )
 
 
-def _all_filled_hub_result() -> tuple[str, str, str, str]:
-    """Strategy 1 returns all four processor types."""
-    return ("BertProcessor", "BertTokenizer", "BertImageProcessor", "BertFeatureExtractor")
+def _all_filled_hub_result() -> _HubConfigResult:
+    """Strategy 1 returns all four processor types + preprocessor_config present."""
+    return _HubConfigResult(
+        processor_class="BertProcessor",
+        tokenizer_class="BertTokenizer",
+        image_processor_class="BertImageProcessor",
+        feature_extractor_class="BertFeatureExtractor",
+        has_preprocessor_config=True,
+    )
 
 
 class TestResolveProcessorStrategy2Gating:
@@ -56,7 +63,13 @@ def test_strategy2_skipped_when_all_fields_filled_by_strategy1(self) -> None:
     def test_strategy2_called_with_per_field_flags(self) -> None:
         """Only the fields still missing after Strategy 1 should have try_*=True."""
         # Strategy 1 fills only image_processor and feature_extractor.
-        hub_result = (None, None, "ConvNextImageProcessor", "ConvNextFeatureExtractor")
+        hub_result = _HubConfigResult(
+            processor_class=None,
+            tokenizer_class=None,
+            image_processor_class="ConvNextImageProcessor",
+            feature_extractor_class="ConvNextFeatureExtractor",
+            has_preprocessor_config=True,
+        )
 
         with (
             patch(
@@ -79,10 +92,17 @@ def test_strategy2_called_with_per_field_flags(self) -> None:
 
     def test_strategy2_runs_when_nothing_filled(self) -> None:
         """Empty Strategy-1 result → Strategy 2 runs with every flag True."""
+        empty_hub_result = _HubConfigResult(
+            processor_class=None,
+            tokenizer_class=None,
+            image_processor_class=None,
+            feature_extractor_class=None,
+            has_preprocessor_config=True,
+        )
         with (
             patch(
                 "winml.modelkit.inspect.resolver._resolve_processor_from_hub_configs",
-                return_value=(None, None, None, None),
+                return_value=empty_hub_result,
             ),
             # Block Strategy 0 (HF registry) by passing no model_type below
             patch(
@@ -101,6 +121,46 @@ def test_strategy2_runs_when_nothing_filled(self) -> None:
         assert info.processor_class == "P"
         assert info.feature_extractor_class == "F"
 
+    def test_missing_preprocessor_config_skips_image_and_feature(self) -> None:
+        """preprocessor_config.json absent → skip AutoImageProcessor & AutoFeatureExtractor.
+
+        Text-only models (RoBERTa, BERT, GPT, ...) don't ship a
+        preprocessor_config.json. Without this gate, Strategy 2 spends
+        ~2s confirming 404s for both AutoImageProcessor and
+        AutoFeatureExtractor. The hub_configs helper now reports the
+        file's existence so the caller can skip those lookups.
+        """
+        hub_result = _HubConfigResult(
+            processor_class=None,
+            tokenizer_class=None,
+            image_processor_class=None,
+            feature_extractor_class=None,
+            has_preprocessor_config=False,
+        )
+
+        with (
+            patch(
+                "winml.modelkit.inspect.resolver._resolve_processor_from_hub_configs",
+                return_value=hub_result,
+            ),
+            patch(
+                "winml.modelkit.inspect.resolver._resolve_processor_from_auto_classes",
+                return_value=(None, None, None, None),
+            ) as mock_auto,
+        ):
+            resolve_processor("text/model")
+
+        assert mock_auto.call_count == 1
+        kwargs = mock_auto.call_args.kwargs
+        assert kwargs["try_processor"] is True
+        assert kwargs["try_tokenizer"] is True
+        assert kwargs["try_image_processor"] is False, (
+            "Must skip AutoImageProcessor when preprocessor_config.json is absent"
+        )
+        assert kwargs["try_feature_extractor"] is False, (
+            "Must skip AutoFeatureExtractor when preprocessor_config.json is absent"
+        )
+
 
 class TestAutoProcessorGatedOnTryProcessor:
     """When ``try_processor=False`` we skip AutoProcessor entirely.
@@ -164,3 +224,163 @@ def test_try_processor_true_still_calls_autoprocessor(self) -> None:
             )
 
         assert mock_ap.call_count == 1
+
+
+class TestAutoProcessorLeafClassDetection:
+    """``AutoProcessor.from_pretrained`` may return a leaf processor.
+
+    For text-only models (RoBERTa, BERT, ...) ``AutoProcessor`` returns
+    the tokenizer directly — e.g. ``RobertaTokenizerFast``. Without
+    recognising this we would re-load the tokenizer via the standalone
+    ``AutoTokenizer.from_pretrained`` below at ~2s of extra cost.
+    """
+
+    @staticmethod
+    def _make_leaf_instance(class_name: str) -> object:
+        """Build an instance whose ``type(obj).__name__`` is ``class_name``.
+
+        Plain instance — no ``.tokenizer`` / ``.image_processor`` /
+        ``.feature_extractor`` attributes — so the leaf-class detection
+        branch is what matches.
+        """
+        return type(class_name, (), {})()
+
+    def test_autoprocessor_returns_tokenizer_fills_tokenizer_class(self) -> None:
+        """When AutoProcessor returns a *Tokenizer*, tokenizer_class is populated
+        and standalone AutoTokenizer is NOT called.
+        """
+        fake = self._make_leaf_instance("RobertaTokenizerFast")
+
+        with (
+            patch("transformers.AutoProcessor.from_pretrained", return_value=fake),
+            patch("transformers.AutoTokenizer.from_pretrained") as mock_at,
+            patch("transformers.AutoImageProcessor.from_pretrained"),
+            patch("transformers.AutoFeatureExtractor.from_pretrained"),
+        ):
+            proc, tok, _img, _feat = _resolve_processor_from_auto_classes(
+                "some/text-model",
+                try_processor=True,
+                try_tokenizer=True,
+                try_image_processor=False,
+                try_feature_extractor=False,
+            )
+
+        assert proc == "RobertaTokenizerFast"
+        assert tok == "RobertaTokenizerFast"
+        assert mock_at.call_count == 0, (
+            "Standalone AutoTokenizer must be skipped when AutoProcessor "
+            "already returned a *Tokenizer* leaf class"
+        )
+
+    def test_autoprocessor_returns_image_processor_fills_image_class(self) -> None:
+        """AutoProcessor returning a *ImageProcessor* fills image_processor_class."""
+        fake = self._make_leaf_instance("ConvNextImageProcessor")
+
+        with (
+            patch("transformers.AutoProcessor.from_pretrained", return_value=fake),
+            patch("transformers.AutoImageProcessor.from_pretrained") as mock_aip,
+        ):
+            proc, _, img, _ = _resolve_processor_from_auto_classes(
+                "some/vision-model",
+                try_processor=True,
+                try_tokenizer=False,
+                try_image_processor=True,
+                try_feature_extractor=False,
+            )
+
+        assert proc == "ConvNextImageProcessor"
+        assert img == "ConvNextImageProcessor"
+        assert mock_aip.call_count == 0
+
+    def test_autoprocessor_returns_feature_extractor_fills_feature_class(self) -> None:
+        """AutoProcessor returning a *FeatureExtractor* fills feature_extractor_class."""
+        fake = self._make_leaf_instance("Wav2Vec2FeatureExtractor")
+
+        with (
+            patch("transformers.AutoProcessor.from_pretrained", return_value=fake),
+            patch("transformers.AutoFeatureExtractor.from_pretrained") as mock_afe,
+        ):
+            proc, _, _, feat = _resolve_processor_from_auto_classes(
+                "some/audio-model",
+                try_processor=True,
+                try_tokenizer=False,
+                try_image_processor=False,
+                try_feature_extractor=True,
+            )
+
+        assert proc == "Wav2Vec2FeatureExtractor"
+        assert feat == "Wav2Vec2FeatureExtractor"
+        assert mock_afe.call_count == 0
+
+    def test_autoprocessor_with_wrapped_pieces_uses_attributes(self) -> None:
+        """Multimodal AutoProcessor (real ProcessorMixin) wins over name suffix."""
+
+        class CLIPTokenizer:
+            pass
+
+        class CLIPProcessor:
+            def __init__(self) -> None:
+                self.tokenizer = CLIPTokenizer()
+
+        with (
+            patch(
+                "transformers.AutoProcessor.from_pretrained",
+                return_value=CLIPProcessor(),
+            ),
+            patch("transformers.AutoTokenizer.from_pretrained") as mock_at,
+        ):
+            proc, tok, _, _ = _resolve_processor_from_auto_classes(
+                "openai/clip-vit-base-patch32",
+                try_processor=True,
+                try_tokenizer=True,
+                try_image_processor=False,
+                try_feature_extractor=False,
+            )
+
+        assert proc == "CLIPProcessor"
+        assert tok == "CLIPTokenizer"
+        assert mock_at.call_count == 0
+
+    def test_unrecognized_leaf_falls_through_to_standalone_autos(self) -> None:
+        """Leaf class with an unrecognized suffix → standalone Auto* fills the gaps.
+
+        ``SpeechT5Processor`` ends in ``Processor`` but none of the
+        ``_is_tokenizer_class_name`` / ``_is_image_processor_class_name`` /
+        ``_is_feature_extractor_class_name`` heuristics match it. The
+        leaf-class shortcut leaves ``tokenizer_class`` / ``image_processor_class``
+        / ``feature_extractor_class`` as ``None``, and the standalone
+        ``Auto*`` calls below fill them in — documenting the graceful
+        fallback when the suffix heuristic does not match.
+        """
+        fake = self._make_leaf_instance("SpeechT5Processor")
+
+        fake_tok = type("SomeTokenizer", (), {})()
+        fake_feat = type("SomeFeatureExtractor", (), {})()
+
+        with (
+            patch("transformers.AutoProcessor.from_pretrained", return_value=fake),
+            patch(
+                "transformers.AutoTokenizer.from_pretrained",
+                return_value=fake_tok,
+            ) as mock_at,
+            patch("transformers.AutoImageProcessor.from_pretrained") as mock_aip,
+            patch(
+                "transformers.AutoFeatureExtractor.from_pretrained",
+                return_value=fake_feat,
+            ) as mock_afe,
+        ):
+            proc, tok, _img, feat = _resolve_processor_from_auto_classes(
+                "microsoft/speecht5_tts",
+                try_processor=True,
+                try_tokenizer=True,
+                try_image_processor=False,
+                try_feature_extractor=True,
+            )
+
+        assert proc == "SpeechT5Processor"
+        # Suffix didn't match any leaf-class heuristic → standalone Auto* must run.
+        assert tok == "SomeTokenizer"
+        assert feat == "SomeFeatureExtractor"
+        assert mock_at.call_count == 1
+        assert mock_aip.call_count == 0  # gated off by try_image_processor=False
+        assert mock_afe.call_count == 1
diff --git a/tests/unit/session/test_winml_session.py b/tests/unit/session/test_winml_session.py
index a260208c2..30901c40e 100644
--- a/tests/unit/session/test_winml_session.py
+++ b/tests/unit/session/test_winml_session.py
@@ -296,7 +296,7 @@ def test_precision_int8_from_qdq(self, tmp_path: Path):
         path = self._save(model, tmp_path / "qdq_int8.onnx")
 
         session = WinMLSession(onnx_path=path, device="auto")
-        assert session.io_config["precision"] == "int8"
+        assert session.io_config["precision"] == "w8a8"
 
     def test_precision_w8a16_mixed_qdq(self, tmp_path: Path):
         """Activation quantized to uint16 + weight to int8 → 'w8a16'."""
@@ -350,6 +350,72 @@ def test_precision_w8a16_mixed_qdq(self, tmp_path: Path):
         session = WinMLSession(onnx_path=path, device="auto")
         assert session.io_config["precision"] == "w8a16"
 
+    def test_precision_int8_ignores_int32_bias_zp(self, tmp_path: Path):
+        """INT32 bias DQ on the weight side must not poison the label.
+
+        Mirrors the NPU-quantized ResNet-50 case: every Conv has an
+        INT8-weight DQ alongside an INT32-bias DQ. The bias is a quant
+        accumulator, not a weight, so it must be excluded from weight-side
+        bit-width counting; otherwise the result becomes 'w32a8'.
+        """
+        import numpy as np
+        from onnx import TensorProto, helper
+
+        a = helper.make_tensor_value_info("A", TensorProto.FLOAT, [1, 4])
+        c = helper.make_tensor_value_info("C", TensorProto.FLOAT, [1, 4])
+
+        # Activation Q→DQ with UINT8 zero_point
+        a_scale = helper.make_tensor("A_scale", TensorProto.FLOAT, [], [0.05])
+        a_zp = helper.make_tensor(
+            "A_zp", TensorProto.UINT8, [], np.array([0], dtype=np.uint8).tobytes(), raw=True
+        )
+        q_act = helper.make_node("QuantizeLinear", ["A", "A_scale", "A_zp"], ["A_q"], name="q_act")
+        dq_act = helper.make_node(
+            "DequantizeLinear", ["A_q", "A_scale", "A_zp"], ["A_d"], name="dq_act"
+        )
+
+        # Weight DQ with INT8 zero_point (initializer → weight side)
+        w_q = helper.make_tensor(
+            "W_q",
+            TensorProto.INT8,
+            [4, 4],
+            np.zeros((4, 4), dtype=np.int8).tobytes(),
+            raw=True,
+        )
+        w_scale = helper.make_tensor("W_scale", TensorProto.FLOAT, [], [0.1])
+        w_zp = helper.make_tensor(
+            "W_zp", TensorProto.INT8, [], np.array([0], dtype=np.int8).tobytes(), raw=True
+        )
+        dq_w = helper.make_node("DequantizeLinear", ["W_q", "W_scale", "W_zp"], ["W"], name="dq_w")
+
+        # Bias DQ with INT32 zero_point (initializer → would be classified
+        # weight-side; this is the node that previously poisoned the label).
+        b_q = helper.make_tensor(
+            "B_q", TensorProto.INT32, [4], np.zeros(4, dtype=np.int32).tobytes(), raw=True
+        )
+        b_scale = helper.make_tensor("B_scale", TensorProto.FLOAT, [], [0.005])
+        b_zp = helper.make_tensor(
+            "B_zp", TensorProto.INT32, [], np.array([0], dtype=np.int32).tobytes(), raw=True
+        )
+        dq_b = helper.make_node("DequantizeLinear", ["B_q", "B_scale", "B_zp"], ["B"], name="dq_b")
+
+        matmul = helper.make_node("MatMul", ["A_d", "W"], ["MM"], name="mm")
+        add = helper.make_node("Add", ["MM", "B"], ["C"], name="add_bias")
+
+        graph = helper.make_graph(
+            [q_act, dq_act, dq_w, dq_b, matmul, add],
+            "qdq_with_int32_bias",
+            [a],
+            [c],
+            [a_scale, a_zp, w_q, w_scale, w_zp, b_q, b_scale, b_zp],
+        )
+        model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 13)])
+        model.ir_version = 7
+        path = self._save(model, tmp_path / "qdq_int32_bias.onnx")
+
+        session = WinMLSession(onnx_path=path, device="auto")
+        assert session.io_config["precision"] == "w8a8"
+
     def test_precision_matmulnbits_w4a16(self, tmp_path: Path):
         """MatMulNBits with bits=4 + fp16 initializers → 'w4a16'."""
         import numpy as np

From a30d5a9cb61962c970b36b2aaa107502412b8653 Mon Sep 17 00:00:00 2001
From: fangyangci <133664123+fangyangci@users.noreply.github.com>
Date: Wed, 27 May 2026 14:48:34 +0800
Subject: [PATCH 008/143] fix integration test(only unit test
 _skip_winml_ep_init) (#760)

only unit test _skip_winml_ep_init
---
 tests/conftest.py                                      | 10 +++++++---
 .../integration/analyze/runtime_checker/test_helper.py |  2 ++
 .../analyze/runtime_checker/test_reshape_openvino.py   |  2 --
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/tests/conftest.py b/tests/conftest.py
index 5ebf7250d..d46583756 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -27,7 +27,9 @@
 # winml.modelkit.models.winml.base imports WinMLSession at module level,
 # which triggers WinMLEPRegistry._discover_eps() → WinML SDK runtime init.
 # This can hang on CI environments without the SDK installed.
-# Mock it globally for non-e2e tests; e2e tests use real initialization.
+# Apply this guard to CI-covered fast suites (unit/regression/cli) so
+# WinML SDK runtime initialization does not add environment-dependent
+# hangs/flakiness. Integration and e2e tests use real initialization.
 
 
 @pytest.fixture(autouse=True)
@@ -67,8 +69,10 @@ def _reset_telemetry_singleton():
 
 @pytest.fixture(autouse=True)
 def _skip_winml_ep_init(request: pytest.FixtureRequest, monkeypatch: pytest.MonkeyPatch) -> None:
-    """Mock WinML EP initialization for non-e2e tests."""
-    if "e2e" in {m.name for m in request.node.iter_markers()}:
+    """Mock WinML EP initialization for unit/regression/cli tests."""
+    nodeid = request.node.nodeid.replace("\\", "/")
+    # Keep CI fast/stable for the matrix groups that run by default.
+    if not nodeid.startswith(("tests/unit/", "tests/regression/", "tests/cli/")):
         return
     try:
         from winml.modelkit.session import ep_registry as ep_registry_mod
diff --git a/tests/integration/analyze/runtime_checker/test_helper.py b/tests/integration/analyze/runtime_checker/test_helper.py
index 11d5a19ea..90d6de830 100644
--- a/tests/integration/analyze/runtime_checker/test_helper.py
+++ b/tests/integration/analyze/runtime_checker/test_helper.py
@@ -119,6 +119,8 @@ def should_run_ep_test(ep_name: str, device_type, skip_message: str | None = Non
     """Determine if EP test should run."""
     # Run if hardware is available
     try:
+        from winml.modelkit import winml
+        winml.register_execution_providers(ort=True)
         import onnxruntime as ort
 
         ep_devices = ort.get_ep_devices()
diff --git a/tests/integration/analyze/runtime_checker/test_reshape_openvino.py b/tests/integration/analyze/runtime_checker/test_reshape_openvino.py
index a0623f019..943126283 100644
--- a/tests/integration/analyze/runtime_checker/test_reshape_openvino.py
+++ b/tests/integration/analyze/runtime_checker/test_reshape_openvino.py
@@ -10,13 +10,11 @@
     reshape_quick_helper,
     should_run_ep_test,
 )
-from winml.modelkit import winml
 from winml.modelkit.analyze.runtime_checker.ep_checker import EPChecker
 
 
 def _require_openvino_device(device_type: ort.OrtHardwareDeviceType, skip_message: str) -> None:
     should_run_ep_test("OpenVINOExecutionProvider", device_type, skip_message)
-    winml.register_execution_providers(ort=True)
 
 
 # don't use EPChecker directly as there is a bug with pytest in subprocess

From 04b74aeef3be7489da458682f1ec49ffdd772415 Mon Sep 17 00:00:00 2001
From: Zhipeng Wang <zhiwang@microsoft.com>
Date: Wed, 27 May 2026 15:48:02 +0800
Subject: [PATCH 009/143] Remove WindowsAppRuntimeVersion from SysInfo (#761)

## Summary

- Drop the `WindowsAppRuntimeVersion` class, attribute, property, and
`windowsAppRuntimeVersion` field in `SysInfo.to_dict()` from
`src/winml/modelkit/sysinfo/sysinfo.py`.
- Remove the now-unused `import re`.

Nothing else in the codebase referenced these symbols. Integration
`runtime_checker` fixtures still contain the field inside their stored
`sys_info` blob, but the test helper ignores `sys_info` during
comparison, and the field will disappear naturally next time those
fixtures are regenerated.
---
 src/winml/modelkit/sysinfo/sysinfo.py | 38 ---------------------------
 1 file changed, 38 deletions(-)

diff --git a/src/winml/modelkit/sysinfo/sysinfo.py b/src/winml/modelkit/sysinfo/sysinfo.py
index f29a505c5..f77336480 100644
--- a/src/winml/modelkit/sysinfo/sysinfo.py
+++ b/src/winml/modelkit/sysinfo/sysinfo.py
@@ -2,41 +2,10 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.
 # --------------------------------------------------------------------------
-import re
-
 from .hardware import CPU, GPU, NPU, RAM
 from .software import OS, EPPackage, PipPackage, PythonRuntime
 
 
-class WindowsAppRuntimeVersion:
-    """Represents the Windows App Runtime version from pip packages."""
-
-    _package_name_suffix = "-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap"
-
-    def __init__(self, pip_packages: list[PipPackage]) -> None:
-        """Initialize Windows App Runtime version from pip packages."""
-        version = None
-        for package in pip_packages:
-            if package.name.endswith(self._package_name_suffix):
-                version = package.version
-                break
-        if version is None:
-            raise ValueError(
-                f"Package ending with '{self._package_name_suffix}' not found in pip packages."
-            )
-        version = re.sub(r"^\d+!", "", version)
-        version = re.sub(r"\.dev(\d+)", r"-experimental\1", version)
-        # .dev0 are converted from -experimental packages instead of -experimental0
-        if version.endswith("-experimental0"):
-            version = version[:-1]
-        self._version = version
-
-    @property
-    def version(self) -> str:
-        """Windows App Runtime version."""
-        return self._version
-
-
 class SysInfo:
     """Comprehensive system information collector."""
 
@@ -50,7 +19,6 @@ def __init__(self) -> None:
         self._python_runtime = PythonRuntime.get()
         self._pip_packages = PipPackage.get_all()
         self._ep_packages = EPPackage.get_all()
-        self._windows_app_runtime_version = WindowsAppRuntimeVersion(self._pip_packages)
 
     @property
     def cpu_list(self) -> list[CPU]:
@@ -92,11 +60,6 @@ def ep_packages(self) -> list[EPPackage]:
         """List of execution provider packages."""
         return self._ep_packages
 
-    @property
-    def windows_app_runtime_version(self) -> WindowsAppRuntimeVersion:
-        """Windows App Runtime version."""
-        return self._windows_app_runtime_version
-
     def to_dict(self) -> dict:
         """Convert all system information to a dictionary."""
         return {
@@ -108,5 +71,4 @@ def to_dict(self) -> dict:
             "pythonRuntime": self._python_runtime.to_dict(),
             "pipPackages": [pkg.to_dict() for pkg in self._pip_packages],
             "epPackages": [pkg.to_dict() for pkg in self._ep_packages],
-            "windowsAppRuntimeVersion": self._windows_app_runtime_version.version,
         }

From 9feca073684805595b16f95821bd7154af6fe855 Mon Sep 17 00:00:00 2001
From: "Qiong Wu (qiowu)" <qiowu@microsoft.com>
Date: Wed, 27 May 2026 16:25:37 +0800
Subject: [PATCH 010/143] fix: move VitisAI EP to last in ordering and fit
 catalog table width (#763)

## Summary
- **VitisAI EP ordering**: Move `VitisAIExecutionProvider` to end of
`EP_SUPPORTED_DEVICES` so it appears last in `analyze --ep all` output,
since it is not yet fully supported.
- **Catalog table width**: Set `expand=False` on both `Table` and
`Panel` in `_build_list_renderable` so the catalog table fits its
content width instead of stretching to the full terminal width.
---
 src/winml/modelkit/commands/catalog.py         | 11 ++++++-----
 src/winml/modelkit/utils/constants.py          |  7 ++++---
 tests/unit/analyze/test_static_analyzer_cli.py |  2 +-
 3 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/src/winml/modelkit/commands/catalog.py b/src/winml/modelkit/commands/catalog.py
index 661bb564f..4c0001f88 100644
--- a/src/winml/modelkit/commands/catalog.py
+++ b/src/winml/modelkit/commands/catalog.py
@@ -275,14 +275,14 @@ def _build_list_renderable(
         header_style="bold",
         padding=(0, 2),
         show_edge=False,
-        expand=True,
+        expand=False,
     )
-    table.add_column("Model", ratio=4, overflow="fold")
-    table.add_column("Task", ratio=2, overflow="fold")
+    table.add_column("Model", overflow="fold")
+    table.add_column("Task", overflow="fold")
     table.add_column("Size", no_wrap=True, justify="right", width=5)
-    table.add_column("Model Type", ratio=1, overflow="fold")
+    table.add_column("Model Type", overflow="fold")
     if ep_col_header is not None:
-        table.add_column(ep_col_header, ratio=2, overflow="fold")
+        table.add_column(ep_col_header, overflow="fold")
 
     for m in models:
         color = _type_color(m["model_type"])
@@ -302,6 +302,7 @@ def _build_list_renderable(
         f"[bold cyan]{len(models)}[/bold cyan] validated model(s)",
         border_style="blue",
         padding=(0, 1),
+        expand=False,
     )
     return Group(panel)
 
diff --git a/src/winml/modelkit/utils/constants.py b/src/winml/modelkit/utils/constants.py
index 1ddc9c149..b62e9e4c1 100644
--- a/src/winml/modelkit/utils/constants.py
+++ b/src/winml/modelkit/utils/constants.py
@@ -173,17 +173,18 @@ def extract_ep_options(kwargs: dict) -> dict[str, str]:
 #
 # Iteration order also feeds ``sysinfo.device._DEVICE_EP_MAP`` (and therefore
 # ``resolve_eps``): the per-device priority is **IHV-first, native-last**
-# (Nvidia -> AMD -> Qualcomm -> Intel -> Microsoft -> CPU), so the keys are
-# listed in that order rather than alphabetically.
+# (Nvidia -> AMD -> Qualcomm -> Intel -> Microsoft -> CPU -> Vitis), so the
+# keys are listed in that order rather than alphabetically.
+# VitisAI is placed last because it is not yet fully supported.
 EP_SUPPORTED_DEVICES: dict[EPName, tuple[str, ...]] = {
     "NvTensorRTRTXExecutionProvider": ("gpu",),
     "CUDAExecutionProvider": ("gpu",),
     "MIGraphXExecutionProvider": ("gpu",),
-    "VitisAIExecutionProvider": ("npu",),
     "QNNExecutionProvider": ("npu", "gpu"),
     "OpenVINOExecutionProvider": ("npu", "gpu", "cpu"),
     "DmlExecutionProvider": ("gpu",),
     "CPUExecutionProvider": ("cpu",),
+    "VitisAIExecutionProvider": ("npu",),
 }
 
 # Device string to ORT device type mapping
diff --git a/tests/unit/analyze/test_static_analyzer_cli.py b/tests/unit/analyze/test_static_analyzer_cli.py
index 24546f5b9..6227a82ff 100644
--- a/tests/unit/analyze/test_static_analyzer_cli.py
+++ b/tests/unit/analyze/test_static_analyzer_cli.py
@@ -1059,7 +1059,6 @@ class TestAnalyzeEPDeviceSelectionMatrix:
                     ("NvTensorRTRTXExecutionProvider", "GPU"),
                     ("CUDAExecutionProvider", "GPU"),
                     ("MIGraphXExecutionProvider", "GPU"),
-                    ("VitisAIExecutionProvider", "NPU"),
                     ("QNNExecutionProvider", "NPU"),
                     ("QNNExecutionProvider", "GPU"),
                     ("OpenVINOExecutionProvider", "NPU"),
@@ -1067,6 +1066,7 @@ class TestAnalyzeEPDeviceSelectionMatrix:
                     ("OpenVINOExecutionProvider", "CPU"),
                     ("DmlExecutionProvider", "GPU"),
                     ("CPUExecutionProvider", "CPU"),
+                    ("VitisAIExecutionProvider", "NPU"),
                 ],
                 None,
             ),

From 5bdb1fba03892e235448658550eefeb56e3290d7 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Thu, 28 May 2026 15:32:24 +0800
Subject: [PATCH 011/143] example: add readme and example.py for
 microsoft/table-transformer-detection (#779)

Also update scripts/e2e_eval/run_pytorch_baseline.py to include pytorch
model latency

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
---
 .../README.md                                 | 152 ++++++++++++
 .../example.py                                | 216 ++++++++++++++++++
 scripts/e2e_eval/run_pytorch_baseline.py      | 120 ++++++++++
 3 files changed, 488 insertions(+)
 create mode 100644 examples/microsoft-table-transformer-detection/README.md
 create mode 100644 examples/microsoft-table-transformer-detection/example.py

diff --git a/examples/microsoft-table-transformer-detection/README.md b/examples/microsoft-table-transformer-detection/README.md
new file mode 100644
index 000000000..2df45d3b9
--- /dev/null
+++ b/examples/microsoft-table-transformer-detection/README.md
@@ -0,0 +1,152 @@
+# microsoft/table-transformer-detection
+
+End-to-end build + accuracy walkthrough for `microsoft/table-transformer-detection`
+(task: `object-detection`) on the NPU, using the
+PubTables-1M detection validation split as the dataset.
+
+Run all commands from the `ModelKit` repo root.
+
+---
+
+## 1. Build the model on NPU
+
+Two steps: `winml config` generates a build config JSON, then `winml build`
+consumes it. `--precision w8a16` is the default NPU precision; the build
+produces a QDQ-quantized ONNX that executes on the NPU.
+
+```powershell
+winml config `
+  -m microsoft/table-transformer-detection `
+  --task object-detection `
+  --device npu `
+  --ep openvino `
+  --precision w8a16 `
+  -o build_config.json
+```
+
+```powershell
+winml build `
+  -c build_config.json `
+  -m microsoft/table-transformer-detection `
+  --device npu `
+  --ep openvino `
+  --use-cache
+```
+
+Artifacts land under
+`~/.cache/winml/artifacts/microsoft_table-transformer-detection/` — the file
+to evaluate is `objdet_*_quantized.onnx`.
+
+---
+
+## 2. Evaluate on NPU with `winml eval`
+
+The PubTables-1M dataset must exist on disk first. Build it once:
+
+```powershell
+uv run python scripts/e2e_eval/datasets/build_pubtables1m_detection.py `
+  --output $HOME/.cache/winml/eval_datasets/build_pubtables1m_detection
+```
+
+Then run `winml eval` against the quantized ONNX produced in step 1. Pass the
+ONNX file to `-m` and the HuggingFace model ID to `--model-id` (needed for
+the preprocessor / postprocessor). `--output` writes a JSON file containing
+the parsed metrics:
+
+```powershell
+winml eval `
+  -m $HOME/.cache/winml/artifacts/microsoft_table-transformer-detection/objdet_<hash>_quantized.onnx `
+  --model-id microsoft/table-transformer-detection `
+  --task object-detection `
+  --device npu `
+  --ep openvino `
+  --dataset $HOME/.cache/winml/eval_datasets/build_pubtables1m_detection `
+  --split validation `
+  --samples 1000 `
+  --column annotation_column=objects `
+  --column bbox_key=bbox `
+  --column category_key=category `
+  --column box_format=xyxy `
+  --output winml_eval_output.json
+```
+
+Replace `<hash>` with the actual filename produced by step 1.
+
+The mAP value is `metrics.map` inside `winml_eval_output.json`.
+
+---
+
+## 3. Measure latency with `winml perf`
+
+`winml perf` benchmarks the quantized ONNX directly using random inputs
+derived from the model's I/O configuration. Point `-m` at the same
+`*_quantized.onnx` produced in step 1. `--warmup` iterations are excluded
+from the statistics; `--iterations` is the measured sample count.
+
+```powershell
+winml perf `
+  -m $HOME/.cache/winml/artifacts/microsoft_table-transformer-detection/objdet_<hash>_quantized.onnx `
+  --device npu `
+  --ep openvino `
+  --warmup 10 `
+  --iterations 100 `
+  -o winml_perf_output.json
+```
+
+The output JSON contains `latency_ms` (`mean`, `min`, `max`, `p50`, `p90`,
+`p95`, `p99`, `std`) and `throughput` (`samples_per_sec`, `batches_per_sec`).
+Mean and p50 latency are the headline numbers; report them alongside the
+device and precision used.
+
+---
+
+## 4. Evaluate the original PyTorch model
+
+`run_pytorch_baseline.py` loads the HuggingFace checkpoint with native PyTorch
+on CPU and emits the same metric so the two runs are directly comparable. The
+last stdout line is a single JSON object:
+`{"metric": "map", "value": <float>, "num_samples": <int>}`.
+
+Pass `--perf-iterations N` (and optionally `--perf-warmup K`, default `10`) to
+also measure PyTorch inference latency. When `N > 0`, the script reuses the
+HuggingFace pipeline on the first dataset sample, runs `K` untimed warmup
+iterations, then `N` timed iterations, and emits a latency JSON line on
+stdout immediately before the metric line. The metric line is still the
+final stdout line.
+
+```powershell
+$columnsMapping = '{"annotation_column":"objects","bbox_key":"bbox","category_key":"category","box_format":"xyxy"}'
+
+uv run python scripts/e2e_eval/run_pytorch_baseline.py `
+  --model microsoft/table-transformer-detection `
+  --task object-detection `
+  --device cpu `
+  --num-samples 1000 `
+  --dataset $HOME/.cache/winml/eval_datasets/build_pubtables1m_detection `
+  --split validation `
+  --columns-mapping $columnsMapping `
+  --winml-metric-key map `
+  --perf-warmup 10 `
+  --perf-iterations 100
+```
+
+The latency JSON line has the same `mean_ms` / `min_ms` / `max_ms` /
+`p50_ms` / `p90_ms` / `p95_ms` / `p99_ms` keys as `winml perf` so the two
+runs can be compared directly.
+
+---
+
+## 5. Comparing the results
+
+For WinML, the accuracy value comes from `metrics.map` in
+`winml_eval_output.json` while for the PyTorch baseline, it comes from the
+last stdout line. Latency comes from `latency_ms` in `winml_perf_output.json`
+for WinML and from the latency JSON line on stdout for the PyTorch baseline.
+
+Result on CPU Intel(R) Core(TM) Ultra 7 258V:
+
+| Model | Device | Precision | mAP | mean latency (ms) | p50 latency (ms) | Size (MB) |
+|---|---|---|---|---|---|---|
+| PyTorch | CPU | fp32 | 0.988714 | 620.859 | 600.336 | 115 |
+| WinML (ONNX) | OpenVINO NPU | w8a16 (QDQ) | 0.9822 | 44.09 | 41.60 | 58 |
+| WinML (ONNX) | OpenVINO CPU | fp32 | 0.9814 | 33.99 | 30.38 | 110 |
diff --git a/examples/microsoft-table-transformer-detection/example.py b/examples/microsoft-table-transformer-detection/example.py
new file mode 100644
index 000000000..cea67b448
--- /dev/null
+++ b/examples/microsoft-table-transformer-detection/example.py
@@ -0,0 +1,216 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Run one inference with the WinML-built ONNX and print detections.
+
+Mirrors the HuggingFace ``TableTransformerForObjectDetection`` example
+(https://huggingface.co/docs/transformers/main/en/model_doc/table-transformer)
+but loads the quantized ONNX produced by ``winml build`` (step 1 of the
+README) via :class:`WinMLAutoModel` instead of the original PyTorch
+checkpoint.
+
+Usage::
+
+    uv run python examples/microsoft-table-transformer-detection/example.py `
+      --onnx $HOME/.cache/winml/artifacts/microsoft_table-transformer-detection/`
+            `objdet_<hash>_quantized.onnx
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import torch
+from huggingface_hub import hf_hub_download
+from PIL import Image, ImageDraw, ImageFont
+from transformers import AutoConfig, AutoImageProcessor
+
+from winml.modelkit import WinMLAutoModel
+
+
+HF_MODEL_ID = "microsoft/table-transformer-detection"
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--onnx",
+        required=True,
+        type=Path,
+        help="Path to the quantized ONNX produced by step 1 of the README "
+        "(e.g. objdet_<hash>_quantized.onnx).",
+    )
+    parser.add_argument(
+        "--device",
+        default="npu",
+        choices=["auto", "npu", "gpu", "cpu"],
+        help="Target device (default: npu).",
+    )
+    parser.add_argument(
+        "--ep",
+        default="openvino",
+        help="Execution provider alias (default: openvino).",
+    )
+    parser.add_argument(
+        "--threshold",
+        type=float,
+        default=0.9,
+        help="Detection confidence threshold (default: 0.9).",
+    )
+    parser.add_argument(
+        "--image",
+        type=Path,
+        default=None,
+        help="Local image path. If omitted, downloads the example PDF page "
+        "from the nielsr/example-pdf HuggingFace dataset (same image as "
+        "the HF docs example).",
+    )
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=Path("detections.png"),
+        help="Where to write the annotated image (default: detections.png "
+        "in the current directory).",
+    )
+    return parser.parse_args()
+
+
+def draw_detections(
+    image: Image.Image,
+    results: dict,
+    id2label: dict[int, str],
+) -> Image.Image:
+    """Draw bounding boxes and labels on a copy of ``image``."""
+    annotated = image.copy()
+    draw = ImageDraw.Draw(annotated)
+
+    try:
+        font = ImageFont.truetype("arial.ttf", size=max(12, annotated.height // 60))
+    except OSError:
+        font = ImageFont.load_default()
+
+    palette = [
+        (220, 38, 38), (34, 197, 94), (59, 130, 246), (234, 179, 8),
+        (168, 85, 247), (236, 72, 153), (20, 184, 166), (249, 115, 22),
+    ]
+
+    for score, label, box in zip(
+        results["scores"], results["labels"], results["boxes"], strict=True,
+    ):
+        x0, y0, x1, y1 = (round(v, 2) for v in box.tolist())
+        label_id = label.item()
+        color = palette[label_id % len(palette)]
+
+        draw.rectangle([(x0, y0), (x1, y1)], outline=color, width=3)
+
+        caption = f"{id2label[label_id]} {score.item():.2f}"
+        text_bbox = draw.textbbox((x0, y0), caption, font=font)
+        tx0, ty0, tx1, ty1 = text_bbox
+        # Anchor the caption above the box; flip below if it would clip.
+        height = ty1 - ty0
+        if ty0 - height < 0:
+            ty0, ty1 = y0, y0 + height
+        else:
+            ty0, ty1 = y0 - height, y0
+        draw.rectangle([(tx0, ty0), (tx1, ty1)], fill=color)
+        draw.text((tx0, ty0), caption, fill="white", font=font)
+
+    return annotated
+
+
+def load_image(image_arg: Path | None) -> Image.Image:
+    """Load the input image from disk or download the HF docs sample."""
+    if image_arg is not None:
+        return Image.open(image_arg.expanduser()).convert("RGB")
+    sample_path = hf_hub_download(
+        repo_id="nielsr/example-pdf",
+        repo_type="dataset",
+        filename="example_pdf.png",
+    )
+    return Image.open(sample_path).convert("RGB")
+
+
+def main() -> None:
+    """Load the quantized ONNX, run one inference, print detections."""
+    args = parse_args()
+
+    image = load_image(args.image)
+
+    # HF processor handles resize/normalize and supplies post-processing.
+    image_processor = AutoImageProcessor.from_pretrained(HF_MODEL_ID)
+
+    # skip_build=True uses the ONNX as-is; it has already been optimized
+    # and quantized by `winml build`. use_cache=False avoids touching the
+    # winml artifact cache for this read-only example.
+    model = WinMLAutoModel.from_pretrained(
+        args.onnx.expanduser(),
+        task="object-detection",
+        device=args.device,
+        ep=args.ep,
+        skip_build=True,
+        use_cache=False,
+    )
+
+    # Match the processor's output size to the ONNX's static input shape so
+    # pixel_values matches (B, C, H, W) exactly. Mirrors the same handling
+    # in the WinML object-detection evaluator.
+    input_shapes = (model.io_config.get("input_shapes") or [[]])[0]
+    input_names = model.io_config.get("input_names", [])
+    if len(input_shapes) == 4:
+        _, _, h, w = input_shapes
+        if "pixel_mask" in input_names:
+            image_processor.size = {
+                "shortest_edge": min(h, w),
+                "longest_edge": max(h, w),
+            }
+            if hasattr(image_processor, "pad_size"):
+                image_processor.pad_size = {"height": h, "width": w}
+            if hasattr(image_processor, "do_pad"):
+                image_processor.do_pad = True
+        else:
+            image_processor.size = {"height": h, "width": w}
+            if hasattr(image_processor, "do_pad"):
+                image_processor.do_pad = False
+
+    inputs = image_processor(images=image, return_tensors="pt")
+    outputs = model(
+        pixel_values=inputs["pixel_values"],
+        pixel_mask=inputs.get("pixel_mask"),
+    )
+
+    # post_process_object_detection expects outputs.logits and
+    # outputs.pred_boxes (both torch tensors), which ObjectDetectionOutput
+    # provides. target_sizes is (H, W) per image.
+    target_sizes = torch.tensor([image.size[::-1]])
+    results = image_processor.post_process_object_detection(
+        outputs,
+        threshold=args.threshold,
+        target_sizes=target_sizes,
+    )[0]
+
+    # WinML's bare-ONNX path doesn't attach an HF config to the model, so
+    # pull id2label from the HF hub for human-readable label names.
+    id2label = AutoConfig.from_pretrained(HF_MODEL_ID).id2label
+
+    for score, label, box in zip(
+        results["scores"], results["labels"], results["boxes"], strict=True,
+    ):
+        box = [round(v, 2) for v in box.tolist()]
+        print(
+            f"Detected {id2label[label.item()]} "
+            f"with confidence {round(score.item(), 3)} at location {box}",
+        )
+
+    annotated = draw_detections(image, results, id2label)
+    output_path = args.output.expanduser()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    annotated.save(output_path)
+    print(f"Annotated image written to {output_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/e2e_eval/run_pytorch_baseline.py b/scripts/e2e_eval/run_pytorch_baseline.py
index f7e77f693..cf5fd4d3a 100644
--- a/scripts/e2e_eval/run_pytorch_baseline.py
+++ b/scripts/e2e_eval/run_pytorch_baseline.py
@@ -27,6 +27,7 @@
 import json
 import sys
 from pathlib import Path
+from typing import Any
 
 
 # Ensure utils/ and modelkit package are importable when invoked as a subprocess
@@ -51,6 +52,98 @@ def _emit_result(metric: str, value: float, num_samples: int) -> None:
     print(json.dumps({"metric": metric, "value": round(value, 6), "num_samples": num_samples}))
 
 
+def _emit_latency(latency: dict) -> None:
+    """Print latency JSON to stdout (emitted before ``_emit_result``)."""
+    print(json.dumps(latency))
+
+
+def _extract_pipeline_input(sample: dict, columns_mapping: dict) -> Any:
+    """Pick a single raw input from a dataset sample to feed the HF pipeline.
+
+    Looks up common column-role keys first, then falls back to common column
+    names. Returns the value verbatim — PIL.Image for vision tasks, ``str``
+    for text tasks, etc.
+    """
+    for key in ("image_column", "text_column", "input_column", "question_column"):
+        col = columns_mapping.get(key)
+        if col and col in sample:
+            return sample[col]
+    for col in ("image", "text", "input", "question", "sentence"):
+        if col in sample:
+            return sample[col]
+    return None
+
+
+def _measure_pytorch_latency(task_evaluator: Any, warmup: int, iterations: int) -> dict:
+    """Time HF pipeline calls on one dataset sample and return summary stats.
+
+    Mirrors ``winml perf``'s ``latency_ms`` structure so the two outputs can
+    be compared directly. Includes preprocess + forward + postprocess in the
+    measurement (full user-perceived call).
+    """
+    import time
+
+    if len(task_evaluator.data) == 0:
+        raise RuntimeError("Dataset is empty; cannot measure pytorch latency")
+
+    sample = task_evaluator.data[0]
+    columns_mapping = task_evaluator.config.dataset.columns_mapping or {}
+    raw_input = _extract_pipeline_input(sample, columns_mapping)
+    if raw_input is None:
+        raise RuntimeError(
+            "Could not determine pipeline input column for latency measurement; "
+            "looked for columns_mapping keys (image_column/text_column/"
+            "input_column/question_column) and defaults (image/text/input/"
+            "question/sentence)."
+        )
+
+    pipe = task_evaluator.pipe
+    needs_cuda_sync = False
+    try:
+        import torch
+
+        model_device = next(pipe.model.parameters()).device
+        needs_cuda_sync = model_device.type == "cuda"
+    except Exception:
+        torch = None  # type: ignore[assignment]
+
+    _out(f"PyTorch latency: warming up ({warmup} iter)...")
+    for _ in range(warmup):
+        pipe(raw_input)
+        if needs_cuda_sync:
+            torch.cuda.synchronize()
+
+    _out(f"PyTorch latency: measuring ({iterations} iter)...")
+    samples_ms: list[float] = []
+    for _ in range(iterations):
+        if needs_cuda_sync:
+            torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        pipe(raw_input)
+        if needs_cuda_sync:
+            torch.cuda.synchronize()
+        samples_ms.append((time.perf_counter() - t0) * 1000.0)
+
+    samples_ms.sort()
+    n = len(samples_ms)
+    mean_ms = sum(samples_ms) / n
+    p50 = samples_ms[n // 2]
+    p90 = samples_ms[min(int(n * 0.9), n - 1)]
+    p95 = samples_ms[min(int(n * 0.95), n - 1)]
+    p99 = samples_ms[min(int(n * 0.99), n - 1)]
+    return {
+        "mean_ms": round(mean_ms, 3),
+        "min_ms": round(samples_ms[0], 3),
+        "max_ms": round(samples_ms[-1], 3),
+        "p50_ms": round(p50, 3),
+        "p90_ms": round(p90, 3),
+        "p95_ms": round(p95, 3),
+        "p99_ms": round(p99, 3),
+        "warmup": warmup,
+        "iterations": iterations,
+    }
+
+
 # ---------------------------------------------------------------------------
 # Model and dataset helpers
 # ---------------------------------------------------------------------------
@@ -151,6 +244,21 @@ def parse_args() -> argparse.Namespace:
         "``dataset_config.winml_metric_key`` (or ``dataset_config.metric`` when "
         "the former is absent).",
     )
+    parser.add_argument(
+        "--perf-iterations",
+        type=int,
+        default=0,
+        help="Number of timed iterations for pytorch latency measurement. "
+        "When >0, runs the pytorch model on one dataset sample repeatedly and "
+        "emits a latency JSON line before the metric line. Default: 0 (disabled).",
+    )
+    parser.add_argument(
+        "--perf-warmup",
+        type=int,
+        default=10,
+        help="Number of warmup iterations excluded from latency statistics "
+        "(only used when --perf-iterations > 0). Default: 10.",
+    )
     return parser.parse_args()
 
 
@@ -236,6 +344,18 @@ def main() -> None:
 
         metrics = task_evaluator.compute()
 
+        if args.perf_iterations > 0:
+            latency = _measure_pytorch_latency(
+                task_evaluator,
+                warmup=args.perf_warmup,
+                iterations=args.perf_iterations,
+            )
+            _out(
+                f"PyTorch latency: mean={latency['mean_ms']}ms "
+                f"p50={latency['p50_ms']}ms p90={latency['p90_ms']}ms"
+            )
+            _emit_latency(latency)
+
         value = float(metrics[winml_metric_key])
         # Emit result as last stdout line (parsed by run_eval.py accuracy phase)
         _emit_result(winml_metric_key, value, num_samples)

From c8f7c559b4172b9145770209922e9a168871ba75 Mon Sep 17 00:00:00 2001
From: Brenda Bai <yiba@microsoft.com>
Date: Thu, 28 May 2026 15:52:02 +0800
Subject: [PATCH 012/143] docs: restructure README into 5-section layout (#770)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Reorganized README into 5 sections: Title + Description, Features /
Scope, Getting Started, Commands, Contributing + License
- Updated status badge to `preview`, rewrote description and Features (✅
bullets)
- Scope section: added supported EPs, built-in model catalog reference,
accepted inputs; removed verbose LLM/not-supported block
- Getting Started: consolidated Prerequisites + Installation + Quick
Start; added Config-Build Pipeline and Step-by-step through primitive
commands walkthroughs
- Commands: BYOM workflow with pipeline diagram, command table +
collapsible details, comparison table (Config-Driven first)
- Reference tables at end: Supported Hardware, Supported Tasks,
Supported Model Types, Built-in Models

---------

Co-authored-by: Qiong Wu (qiowu) <qiowu@microsoft.com>
Co-authored-by: Zhipeng Wang <zhiwang@microsoft.com>
---
 README.md | 511 ++++++++++++++++++++++--------------------------------
 1 file changed, 209 insertions(+), 302 deletions(-)

diff --git a/README.md b/README.md
index f8371b13b..180832504 100644
--- a/README.md
+++ b/README.md
@@ -1,85 +1,66 @@
 # WinML CLI
 
 [![WinML CLI CI](https://github.com/microsoft/winml-cli/actions/workflows/modelkit-ci.yml/badge.svg)](https://github.com/microsoft/winml-cli/actions/workflows/modelkit-ci.yml)
-![Status](https://img.shields.io/badge/status-early%20access-blue)
-![Python](https://img.shields.io/badge/python-3.11%2B-blue?logo=python&logoColor=white)
+![Status](https://img.shields.io/badge/status-preview-blue)
+[![PyPI release](https://img.shields.io/pypi/v/winml-cli)](https://pypi.org/project/winml-cli/)
 ![License](https://img.shields.io/badge/license-MIT-green)
 
-**WinML CLI** is a CLI toolkit to build **portable, performant, and high-quality** models for Windows ML. It covers the entire journey from pretrained model to on-device inference — export, optimization, quantization, compilation, and benchmarking — across **all execution providers**, regardless of silicon.
+**Windows ML CLI** is a command line tool for building **portable, performant, and high-quality** AI models for Windows ML. It takes you from a source model — whether from Hugging Face or your own pipeline — to a hardware-optimized artifact in a reproducible workflow.
 
----
-
-## :dart: WinML CLI Is Right for You If
-
-- [x] You want to build models that run on **any Windows device** — Qualcomm, Intel, AMD, NVIDIA, or CPU
-- [x] You want to benchmark a model with **one command** — latency, throughput, and live hardware utilization
-- [x] You want to catch compatibility issues **ahead of time** — unsupported ops, shape mismatches, EP gaps
-- [x] You want **deep insights** into your model — I/O shapes, task mapping, operator coverage per EP
-- [x] You want a **repeatable and traceable** model building process — config-driven, inspectable at every stage
-- [x] You want **AI agents** to build and profile models for you — agent-ready skills for coding assistants
+Purpose-built for Windows hardware diversity, the CLI handles conversion, graph optimization, and compilation across AMD, Intel, NVIDIA, and Qualcomm targets. The CLI fits naturally into CI/CD pipelines so teams can validate and ship models easily.
 
 ---
 
-## :desktop_computer: Supported Hardware
+## :dart: Features
 
-| Execution Provider | Hardware | Status | EP Flag | Device Flag |
-|:-------------------|:---------|:------:|:--------|:------------|
-| **QNN** | Qualcomm NPU (Snapdragon X Elite) | 🟢 Ready | `--ep qnn` | `--device npu` |
-| **OpenVINO** | Intel NPU (Meteor Lake / Lunar Lake) | 🟢 Ready | `--ep openvino` | `--device npu` |
-| **VitisAI** | AMD NPU (Ryzen AI) | 🟢 Ready | `--ep vitisai` | `--device npu` |
-| **NvTensorRTRTX** | NVIDIA discrete GPUs | 🔶 Planned | `--ep nv_tensorrt_rtx` | `--device gpu` |
-| **MIGraphX** | AMD discrete GPUs | 🔶 Planned | `--ep migraphx` | `--device gpu` |
-| **Dml** | Hardware-agnostic GPU backend | 🔶 Planned | `--ep dml` | `--device gpu` |
-| **CPU** | Cross-platform fallback | ⚪ Always available | `--ep cpu` | `--device cpu` |
+✅ **You want to build models that run with Windows ML on any device** — seamlessly across CPU, GPU, and NPU
 
-> **Tip:** Use `--device auto` and WinML CLI picks the best available device — NPU first, then GPU, then CPU.
+✅ **You want to benchmark models with one command** — get latency, throughput, and live hardware utilization
 
----
-
-## :clipboard: Prerequisites
+✅ **You want to optimize models out of the box** — with built-in graph optimizations, quantization, and EP-aware tuning
 
-### Required Software
+✅ **You want deep insights into your model** — including unsupported operators, shape mismatches, and execution provider gaps
 
-| **Component** | **How to Get It** |
-|-----------|--------------|
-| **Windows 11** (x64 or ARM64) | Windows 11 24H2+ required for NPU support |
-| **UV** | Install [UV](https://github.com/astral-sh/uv) |
-| **WinML CLI** (Python wheel) | [Releases](https://github.com/microsoft/winml-cli/releases) |
+✅ **You want a repeatable and traceable workflow** — with config-driven pipelines that are inspectable at every stage
 
-### Required Hardware
+✅ **You want AI agents to build and profile models for you** — with agent-ready skills for automation via coding assistants
 
-**WinML CLI targets NPU.** We recommend testing on one of the following NPU devices:
+### :compass: Scope
 
-| Device | EP | Flag |
-|--------|-----|------|
-| Snapdragon X Elite (Qualcomm) | QNN | `--ep qnn --device npu` |
-| Intel AI Boost (Meteor Lake / Lunar Lake) | OpenVINO | `--ep openvino --device npu` |
-| AMD Ryzen AI (Phoenix / Hawk Point / Strix) | VitisAI | `--ep vitisai --device npu` |
+WinML CLI supports **classic deep learning models** for now — LLM support is on the way.
 
-**No NPU?** Use `--device auto` — WinML CLI will fall back to the best available device (GPU → CPU). Note that `winml compile` requires NPU and cannot run without one.
+**Supported execution providers:** QNN · OpenVINO · VitisAI · NvTensorRTRTX · Dml · CPU — covering NPU, GPU, and CPU across Windows ML. See the [Supported Hardware](#supported-hardware) reference table for the full EP-to-device mapping.
 
-### Accepted Inputs
+The [built-in model catalog](#built-in-models) includes verified models that run across all EPs supported by Windows ML and serve as a reliable starting point. WinML CLI is not limited to these — you can bring **any model** you have:
 
 - **HuggingFace model ID** (e.g., `microsoft/resnet-50`) — weights are downloaded on first run
 - **Local ONNX file** (e.g., `model.onnx`) — from `winml export`, `winml build`, or any ONNX you already have
 
-### The Golden Rule: Inspect First
+See the [Supported Tasks](#supported-tasks) and [Supported Model Types](#supported-model-types) reference tables for the full list.
 
-Before running any pipeline command, always verify the model is supported:
+**Known constraints:**
 
-```bash
-winml inspect -m <model-id>
-```
-
-If `inspect` prints an error or shows `Unsupported`, **skip that model**. Only models that pass inspect are valid inputs for export, analyze, build, perf, and eval.
+- Some models may export successfully but fail during optimization or quantization due to unsupported operator patterns. The analyzer will flag these issues.
+- Performance numbers vary by device, driver version, and EP version. Always benchmark on your target hardware.
 
 ---
 
-## :package: Installation
+## :rocket: Getting Started
+
+### Prerequisites
+
+| Component | Details |
+|---|---|
+| Windows | Windows 11 24H2 or later (required for NPU support; earlier versions work for CPU/GPU) |
+| Python | 3.11 |
+| Package manager | [`uv`](https://github.com/astral-sh/uv) |
+| **WinML CLI** (Python wheel) | [Releases](https://github.com/microsoft/winml-cli/releases) |
+| **WinML CLI** (Python wheel) | [Releases](https://github.com/microsoft/winml-cli/releases) |
+### Installation
 
 WinML CLI requires **Python 3.11** and is distributed as a Python wheel. We recommend [uv](https://docs.astral.sh/uv/) for fast, reproducible environment setup.
 
-**1. Create a Python 3.11 environment**
+**1. Create an environment**
 
 ```bash
 uv venv --python 3.11
@@ -104,158 +85,109 @@ uv pip install winml_cli-<version>-py3-none-any.whl
 **3. Verify your environment**
 
 ```bash
-winml sys --list-device --list-ep
+uv run winml sys --list-device --list-ep
 ```
 
-Confirm that your target device and EP appear in the output:
-
-- **Snapdragon X Elite** — look for `QNNExecutionProvider`
-- **Intel AI Boost** — look for `OpenVINOExecutionProvider`
-- **AMD Ryzen AI** — look for `VitisAIExecutionProvider`
-
-If no NPU is detected, you can still use WinML CLI with `--device auto` for most commands. The only exception is `winml compile`, which requires an NPU device.
-
----
-
-## :wrench: Commands
-
-| Category | Commands | Purpose |
-|:---------|:---------|:--------|
-| **Primitives** | `inspect` `export` `optimize` `quantize` `compile` | Single-stage building blocks |
-| **Pipeline** | `config` `build` `perf` `eval` `run`\* | End-to-end orchestration |
-| **Insights** | `analyze` `debug`\* | Diagnostics and compatibility |
-| **Utilities** | `hub` `cache`\* `doctor`\* `setting`\* `sys` | Catalog, cache, and environment |
-
-\* = coming soon
-
-<details>
-<summary><strong>Primitives</strong> — one stage at a time</summary>
-
-**`winml inspect`** — Discover model metadata. Prints the task, model class, input/output tensor names and shapes, and execution provider compatibility. No weights are loaded — this reads only the model configuration, making it fast and lightweight. Always run inspect first to verify a model is supported.
-
-**`winml export`** — Convert a source model to ONNX. Takes a Hugging Face model ID (or local checkpoint) and produces a standards-compliant ONNX file with hierarchy-preserving metadata.
-
-**`winml optimize`** — Fuse operators, simplify graphs, and prepare for target EPs. Takes an ONNX model and an optimization config (typically generated by `winml analyze`) and applies graph-level transformations: operator fusion, constant folding, shape inference, and EP-specific rewrites.
-
-**`winml quantize`** — Compress to low-bit precision. Reduces model size and inference latency by converting weights and activations from FP32 to INT8 (or other low-bit formats). After quantization, the model is portable — it can run on any ONNX Runtime backend.
-
-**`winml compile`** — Generate device-specific binaries. Takes a quantized ONNX model and produces EP-specific compiled artifacts (for example, QNN context binaries for Qualcomm NPU). This step locks the model to a specific device but delivers the lowest possible inference latency.
-
-</details>
-
-<details>
-<summary><strong>Pipeline</strong> — orchestrated workflows</summary>
-
-**`winml config`** — Auto-detect optimal settings into a JSON config. Inspects the model and generates a complete build specification: task, I/O shapes, optimization flags, quantization parameters, and target EP settings. The config file is reviewable, editable, and version-controllable — the single source of truth for your build.
-
-**`winml build`** — Orchestrate the full pipeline. Takes a config file and executes every stage in sequence: export, analyze, optimize, quantize, and compile. Two commands (`config` + `build`) replace eight manual steps.
-
-**`winml perf`** — Benchmark latency, throughput, and hardware utilization. Runs inference on the target device and reports latency percentiles (p50, p90, p99), throughput (inferences per second), and optionally live hardware monitoring (CPU, RAM, NPU utilization) with the `--monitor` flag. Can accept a local ONNX file or a Hugging Face model ID.
+`--list-device` and `--list-ep` print only the hardware and EP inventory, skipping SDK versions and Python environment details that plain `winml sys` would include. If the command exits without error, your winml-cli install is ready.
 
-**`winml eval`** — Measure model accuracy against reference datasets. Compares the output of your optimized/quantized model against the original to quantify any accuracy loss introduced by the pipeline.
+### Quick Start
 
-**`winml run`** — End-to-end inference with pre/post processing. *(Coming soon.)*
+WinML CLI supports two ways to build a model — choose the one that fits your workflow:
 
-</details>
+- [**Config-Build Driven Pipeline**](#config-build-pipeline) — generate a config file first, then run a single build command. Best for reproducible, CI/CD-friendly workflows.
+- [**Primitive Commands**](#step-by-step-through-primitive-commands) — run each pipeline stage individually. Best for exploring, debugging, or custom workflows.
 
-<details>
-<summary><strong>Insights</strong> — understand what is happening inside</summary>
+This walkthrough uses `facebook/convnext-tiny-224` as an example model.
 
-**`winml analyze`** — Lint operators, check EP compatibility, and generate optimization config. The analyzer has two components: the **Linter** (like ESLint for ONNX) checks every operator against target EPs and classifies each as supported, partial, or unsupported. **AutoConf** detects suboptimal patterns and generates the optimization config that the optimizer consumes. Together they form the analyze-optimize loop.
+#### Config-Build Pipeline
 
-**`winml debug`** — Interactive model debugging and layer-by-layer inspection. *(Coming soon.)*
+##### Step 0: Check model readiness
 
-</details>
+Before running any pipeline command, verify the model is supported:
 
-<details>
-<summary><strong>Utilities</strong> — catalog, cache, and environment</summary>
-
-**`winml catalog`** — Browse the curated built-in model catalog.
+```bash
+uv run winml inspect -m facebook/convnext-tiny-224
+```
 
-**`winml cache`** — Manage built model artifacts and pipeline outputs. View, clean, or selectively remove cached models and intermediate files.
+This prints the model's task, model class, input/output tensor names and shapes, and execution provider compatibility — without downloading weights. If inspect succeeds, the model is supported and you can proceed.
 
-**`winml doctor`** — Diagnose environment issues. Checks runtimes, execution providers, and dependencies to identify configuration problems.
+##### Step 1: Generate the build config
 
-**`winml setting`** — Configure WinML CLI preferences. Set default EPs, output directories, and other global options.
+```bash
+uv run winml config -m facebook/convnext-tiny-224 --device auto -o convnext_config.json
+```
 
-**`winml sys`** — System information and capability reporting. Prints detected hardware, available EPs, Python version, and installed package versions.
+`winml config` queries Hugging Face, auto-detects the task and model type, and produces a WinMLBuildConfig JSON. Passing `--device auto` tells the config generator to resolve the target device at generation time — it inspects your hardware and writes the winning device (NPU, GPU, or CPU) together with matching precision and compile settings into `convnext_config.json`. You can open the file to see exactly what was picked before committing to a full build.
 
-</details>
+##### Step 2: Run the build
 
----
+```bash
+uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/
+```
 
-## :rocket: Quick Start
+This single command runs all four pipeline stages in sequence — export, optimize, quantize, and compile — reading the device and precision settings recorded in `convnext_config.json`. The compile stage targets whichever device the config captured: it calls the QNN backend and embeds a pre-compiled Hexagon binary on NPU, or it compiles a DirectML graph on GPU, or it produces a standard optimized ONNX for CPU. All intermediate artifacts land in `convnext_out/`, so you can inspect or reuse any stage independently.
 
-### Inspect a Model
+You can also pass `--no-quant` or `--no-compile` to stop the pipeline early, or `--rebuild` to force re-running even when cached artifacts exist.
 
-The fastest way to get started is to inspect a model. Let's look at ResNet-50:
+##### Step 3: Benchmark on your device
 
 ```bash
-winml inspect -m microsoft/resnet-50
+uv run winml perf -m convnext_out/<artifact>.onnx --device auto --iterations 50 --monitor
 ```
 
-This prints the model's metadata without downloading weights:
+Replace `<artifact>` with the filename written to `convnext_out/` by the build. For NPU builds the compiled artifact is named `model.onnx` in the output directory (the `_npu_ctx.onnx` suffix applies only when the compile stage produces an EPContext file, which requires `enable_ep_context=True` in the compile config). You can check the directory listing or read the compiled artifact path from the build output to get the exact name.
 
-- **Task**: `image-classification` — what the model does
-- **Model class**: `ResNetForImageClassification` — the architecture
-- **Input tensors**: names, data types, and shapes (e.g., `pixel_values: float32 [1, 3, 224, 224]`)
-- **Output tensors**: names, data types, and shapes (e.g., `logits: float32 [1, 1000]`)
+#### Step-by-step through primitive commands
 
-If inspect succeeds, the model is supported and you can proceed with the rest of the pipeline.
+This walkthrough builds **ConvNeXT** (`facebook/convnext-base-224`) step by step using primitive commands.
 
-> **Golden rule: always inspect first.** Before running export, build, perf, or any other pipeline command, verify the model is supported with `winml inspect`.
-
-### Build with Primitive Commands
-
-This walkthrough builds **ConvNeXT** (`facebook/convnext-base-224`) step by step using primitive commands. ConvNeXT is a family of CNN models inspired by Vision Transformers, introduced by Meta in 2022 — it offers high accuracy while retaining the efficiency of CNNs.
-
-#### Phase 1: Inspect
+##### Step 1: Inspect
 
 ```bash
 winml inspect -m facebook/convnext-base-224
 ```
 
-#### Phase 2: Build a Portable Model
+##### Step 2: Build a Portable Model
 
-**Export** from PyTorch to ONNX:
+Export from PyTorch to ONNX:
 
 ```bash
 winml export -m facebook/convnext-base-224 -o convnext/model.onnx -v
 ```
 
-**Analyze** for EP compatibility:
+Analyze for EP compatibility:
 
 ```bash
 winml analyze -m convnext/model.onnx --optim-config optim.json
 ```
 
-**Optimize** the graph using the analyzer's config:
+Optimize the graph using the analyzer's config:
 
 ```bash
 winml optimize -m convnext/model.onnx -c optim.json -o convnext/model_opt.onnx
 ```
 
-**Quantize** to INT8:
+Quantize to w8a16:
 
 ```bash
-winml quantize -m convnext/model_opt.onnx -o convnext/model_opt_int8.onnx
+winml quantize -m convnext/model_opt.onnx --precision w8a16 -o convnext/model_opt_w8a16.onnx
 ```
 
-#### Phase 3: Benchmark on Device
+##### Step 3: Benchmark on Device
 
-**Compile** for NPU (generates device-specific binaries):
+Compile for NPU (generates device-specific binaries):
 
 ```bash
-winml compile -m convnext/model_opt_int8.onnx --ep qnn -o convnext/model_compiled.onnx
+winml compile -m convnext/model_opt_w8a16.onnx --ep qnn -o convnext/model_compiled.onnx
 ```
 
-**Benchmark on NPU** — note the latency:
+Benchmark on NPU — note the latency:
 
 ```bash
 winml perf -m convnext/model_compiled.onnx --ep qnn --iterations 100
 ```
 
-**Benchmark on CPU** for comparison:
+Benchmark on CPU for comparison:
 
 ```bash
 winml perf -m convnext/model_opt.onnx --ep cpu --iterations 100
@@ -263,55 +195,15 @@ winml perf -m convnext/model_opt.onnx --ep cpu --iterations 100
 
 Compare the two numbers to see the performance difference between NPU and CPU inference.
 
-### Build with Config + Build
-
-Same model, different approach. Instead of running each command manually, use the config-driven pipeline. Think of it like CMake: `config` generates a build plan, `build` executes it.
-
-**Generate the build config:**
-
-```bash
-winml config -m facebook/convnext-base-224 -o convnext_config.json
-```
-
-This creates a JSON file containing all settings for every pipeline step — task, I/O shapes, optimization flags, quantization parameters — all auto-detected from the model.
-
-**Build the model:**
-
-```bash
-winml build -c convnext_config.json -m facebook/convnext-base-224 -o convnext_build/
-```
-
-This orchestrates the full pipeline — export, analyze, optimize, quantize, compile — all in one go. Same result as the manual steps above, but in two commands.
-
-**Benchmark the result:**
-
-```bash
-winml perf -m convnext_build/model.onnx --ep qnn --iterations 100
-```
-
-The config file is the single source of truth for your build. Version-control it, share it with teammates, edit it to override settings, and replay builds deterministically on any machine.
-
-### Benchmark in One Command
-
-The simplest way to evaluate a model — one command, zero setup:
-
-```bash
-winml perf -m facebook/convnext-base-224 --device npu --monitor
-```
-
-WinML CLI handles everything behind the scenes: download the model from Hugging Face, export to ONNX, optimize the graph, and run the benchmark on your NPU. The `--monitor` flag enables live hardware monitoring — real-time CPU utilization, RAM usage, and NPU activity alongside the latency results.
-
-This is ideal for quick smoke tests: does the model run on this device, and how fast is it?
-
 ---
 
-## :arrows_counterclockwise: The BYOM Workflow
+## :wrench: Commands
 
-The **Build Your Own Model** (BYOM) workflow is the philosophy behind WinML CLI. It defines how a source model becomes a production-ready, device-optimized artifact.
+### The BYOM Workflow
 
-### The Pipeline
+The **Build Your Own Model** (BYOM) workflow is the philosophy behind WinML CLI. It defines how a source model becomes a production-ready, device-optimized artifact.
 
-```
+```text
 Source Model --> Export --> Analyze --> Optimize --> Quantize --> Compile --> Benchmark
 ```
 
@@ -319,165 +211,180 @@ Source Model --> Export --> Analyze --> Optimize --> Quantize --> Compile --> Be
 
 Each arrow is a WinML CLI command. You can enter the pipeline at any stage (for example, start with a local ONNX file and skip export), exit early (stop after optimization if you do not need quantization), or loop back to repeat a stage with different settings.
 
-### Primitive Commands vs. Config-Driven Pipeline
+| Category | Commands | Purpose |
+| --- | --- | --- |
+| **Primitives** | `inspect` `export` `optimize` `quantize` `compile` | Single-stage building blocks |
+| **Pipeline** | `config` `build` `perf` `eval` | End-to-end orchestration |
+| **Insights** | `analyze`| Diagnostics and compatibility |
+| **Utilities** | `catalog` `sys` | Catalog, and environment |
 
-|  | **Primitive Commands** | **Config-Driven Pipeline** |
-|:--|:--|:--|
-| **Steps** | One command **per stage** | Two steps: **config** + **build** |
-| **Control** | Start from any stage; try different settings to fix errors or tweak performance | Repeatable, tweakable, version-controllable |
-| **Best for** | **Flexible** workflow | Production-ready **delivery** |
-| **When to use** | Exploring, debugging, prototyping | CI/CD, batch builds, team workflows |
-| **Lifecycle** | "Coding" phase | Polish |
+<details>
+<summary><strong>Primitives</strong> — one stage at a time</summary>
 
----
+**`winml inspect`** — Discover model metadata. Prints the task, model class, input/output tensor names and shapes, and execution provider compatibility. No weights are loaded — this reads only the model configuration, making it fast and lightweight. Always run inspect first to verify a model is supported.
 
-## :clipboard: Built-in Models
+**`winml export`** — Convert a source model to ONNX. Takes a Hugging Face model ID (or local checkpoint) and produces a standards-compliant ONNX file with hierarchy-preserving metadata.
 
-Run `winml catalog` to browse the full catalog interactively.
+**`winml optimize`** — Fuse operators, simplify graphs, and prepare for target EPs. Takes an ONNX model and an optimization config (typically generated by `winml analyze`) and applies graph-level transformations: operator fusion, constant folding, shape inference, and EP-specific rewrites.
 
-<details>
-<summary><strong>Click to expand the full model catalog</strong></summary>
+**`winml quantize`** — Compress to low-bit precision. Reduces model size and inference latency by converting weights and activations from FP32 to INT8 (or other low-bit formats). After quantization, the model is portable — it can run on any ONNX Runtime backend.
 
-| Model ID | Task | Architecture |
-|:---------|:-----|:-------------|
-| `microsoft/resnet-50` | image-classification | ResNet |
-| `google/vit-base-patch16-224` | image-classification | ViT |
-| `microsoft/swin-large-patch4-window7-224` | image-classification | Swin |
-| `facebook/convnext-tiny-224` | image-classification | ConvNeXT |
-| `rizvandwiki/gender-classification` | image-classification | ViT |
-| `ProsusAI/finbert` | text-classification | BERT |
-| `Intel/bert-base-uncased-mrpc` | text-classification | BERT |
-| `cardiffnlp/twitter-roberta-base-sentiment-latest` | text-classification | RoBERTa |
-| `dslim/bert-base-NER` | token-classification | BERT |
-| `dbmdz/bert-large-cased-finetuned-conll03-english` | token-classification | BERT |
-| `Babelscape/wikineural-multilingual-ner` | token-classification | BERT |
-| `w11wo/indonesian-roberta-base-posp-tagger` | token-classification | RoBERTa |
-| `microsoft/table-transformer-detection` | object-detection | Table Transformer |
-| `mattmdjaga/segformer_b2_clothes` | image-segmentation | SegFormer |
-| `nvidia/segformer-b1-finetuned-ade-512-512` | image-segmentation | SegFormer |
-| `nvidia/segformer-b2-finetuned-ade-512-512` | image-segmentation | SegFormer |
-| `nvidia/segformer-b5-finetuned-ade-640-640` | image-segmentation | SegFormer |
+**`winml compile`** — Generate device-specific binaries. Takes a quantized ONNX model and produces EP-specific compiled artifacts (for example, QNN context binaries for Qualcomm NPU). This step locks the model to a specific device but delivers the lowest possible inference latency.
 
 </details>
 
-These models are verified against WinML CLI's full pipeline and serve as reliable starting points. You are not limited to this list — any Hugging Face model that passes `winml inspect` is a valid input.
+<details>
+<summary><strong>Pipeline</strong> — orchestrated workflows</summary>
 
-For models not in this table, run `winml inspect -m <model-id>` to verify support before proceeding.
+**`winml config`** — Auto-detect optimal settings into a JSON config. Inspects the model and generates a complete build specification: task, I/O shapes, optimization flags, quantization parameters, and target EP settings. The config file is reviewable, editable, and version-controllable — the single source of truth for your build.
 
----
+**`winml build`** — Orchestrate the full pipeline. Takes a config file and executes every stage in sequence: export, analyze, optimize, quantize, and compile. Two commands (`config` + `build`) replace eight manual steps.
 
-## :warning: Scope & Limitations
+**`winml perf`** — Benchmark latency, throughput, and hardware utilization. Runs inference on the target device and reports latency percentiles (p50, p90, p99), throughput (inferences per second), and optionally live hardware monitoring (CPU, RAM, NPU utilization) with the `--monitor` flag. Can accept a local ONNX file or a Hugging Face model ID.
 
-### What WinML CLI supports
+**`winml eval`** — Measure model accuracy against reference datasets. Compares the output of your optimized/quantized model against the original to quantify any accuracy loss introduced by the pipeline.
 
-WinML CLI targets **classic deep learning models** — CNNs, encoders, vision transformers, NLP classifiers, token classifiers, object detection models, and segmentation models.
+</details>
 
-Supported tasks include:
-- Image classification (ResNet, ViT, Swin, ConvNeXT)
-- Text classification (BERT, RoBERTa)
-- Token classification / NER (BERT, RoBERTa)
-- Object detection (Table Transformer)
-- Image segmentation (SegFormer)
+<details>
+<summary><strong>Insights</strong> — understand what is happening inside</summary>
 
-### What WinML CLI does not support
+**`winml analyze`** — Lint operators, check EP compatibility, and generate optimization config. The analyzer has two components: the **Linter** (like ESLint for ONNX) checks every operator against target EPs and classifies each as supported, partial, or unsupported. **AutoConf** detects suboptimal patterns and generates the optimization config that the optimizer consumes. Together they form the analyze-optimize loop.
 
-**LLMs and generative models are not in scope.** Do not use WinML CLI with GPT, LLaMA, Phi, Mistral, Stable Diffusion, or any model with a decoder-only or sequence-to-sequence generative architecture. LLM support (with LoRA) is planned for Q3-Q4 2026.
+</details>
 
-### Known constraints
+<details>
+<summary><strong>Utilities</strong> — catalog, and environment</summary>
 
-- `winml compile` requires an NPU device. If no NPU is available, skip the compile step and use `--device auto` for benchmarking.
-- Some models may export successfully but fail during optimization or quantization due to unsupported operator patterns. The analyzer will flag these issues.
-- Performance numbers vary by device, driver version, and EP version. Always benchmark on your target hardware.
+**`winml catalog`** — Browse the curated built-in model catalog.
 
----
+**`winml sys`** — System information and capability reporting. Prints detected hardware, available EPs, Python version, and installed package versions.
 
-## :world_map: Roadmap
+</details>
 
-| Milestone | Target | Highlights |
-|:----------|:-------|:-----------|
-| 🟡 **Kickoff** | Q4 2025 | Internal prototype, core primitive commands |
-| 🟢 **Early Access** | Q1 2026 | First external testers, config + build pipeline, hub catalog |
-| 🔵 **Public Beta** | Q2 2026 | Open source, agent skills, Foundry Toolkit integration |
-| 🟣 **RC** | Q3-Q4 2026 | **LLM support** (with LoRA), broader device coverage, MLIR |
+|  | **Config-Driven Pipeline** | **Primitive Commands** |
+|:--|:--|:--|
+| **Steps** | Two steps: **config** + **build** | One command **per stage** |
+| **Control** | Repeatable, tweakable, version-controllable | Start from any stage; try different settings to fix errors or tweak performance |
+| **Best for** | Production-ready **delivery** | **Flexible** workflow |
+| **When to use** | CI/CD, batch builds, team workflows | Exploring, debugging, prototyping |
+| **Lifecycle** | Polish | "Coding" phase |
 
-<details>
-<summary><strong>Click to expand roadmap details</strong></summary>
-
-**Q4 2025 — Kickoff**
-- Primitive commands: `inspect`, `export`, `optimize`, `quantize`, `compile`
-- QNN, OpenVINO, and VitisAI execution provider support
-- Internal validation with ResNet, BERT, ViT, SegFormer families
-
-**Q1 2026 — Early Access**
-- Pipeline commands: `config`, `build`, `perf`, `eval`
-- Analyzer with auto-configuration loop
-- Built-in model catalog (`winml catalog`)
-- Live hardware monitoring (`--monitor`)
-
-**Q2 2026 — Public Beta**
-- Open source release
-- Agent-ready skills for coding assistants (Claude Code, Cursor, Copilot)
-- Foundry Toolkit for VS Code integration
-
-**Q3-Q4 2026 — Release Candidate**
-- LLM support (decoder-only architectures with LoRA adapters)
-- NvTensorRTRTX, MIGraphX, and Dml execution providers
-- MLIR-based optimization backend
-- Public SDK and framework APIs
+---
 
-</details>
+## :handshake: Contributing
 
----
+We welcome contributions! Please see the [contribution guidelines](CONTRIBUTING.md).
 
-## :lock: Data / Telemetry
+For feature requests or bug reports, please file a [GitHub Issue](https://github.com/microsoft/winml-cli/issues).
 
-Official WinML CLI releases can collect anonymous usage telemetry to
-help improve the product. Telemetry is classified as **Optional**. A
-one-time prompt on your first run asks for consent (default: accept —
-press Enter to enable, type `n` to decline).
+### Code of Conduct
 
-Dev installs (`pip install -e .` or running from a source checkout)
-never send telemetry.
+See [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md).
 
-**Control** — edit `%USERPROFILE%\.winml\config.json`:
+### License
 
-- Set `telemetry.consent` to `"disabled"` to opt out
-- Set `telemetry.consent` to `"enabled"` to opt in
-- Delete the file to re-show the first-run prompt on the next run
+This project is licensed under the [MIT License](LICENSE.txt).
 
-Telemetry is automatically disabled in CI / non-TTY environments
-regardless of the stored decision.
+### Trademarks
 
-See [docs/Privacy.md](docs/Privacy.md) for the full list of what is and
-is not collected, event schemas, CI auto-disable behavior, and storage
-locations.
+This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
+trademarks or logos is subject to and must follow
+[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
+Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft
+sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
 
 ---
 
-## :handshake: Contributions and Feedback
+## Supported Hardware
 
-We welcome contributions! Please see the [contribution guidelines](CONTRIBUTING.md).
-
-For feature requests or bug reports, please file a [GitHub Issue](https://github.com/microsoft/winml-cli/issues).
+| Execution Provider | Hardware | Status | EP Flag | Device Flag |
+| --- | --- | --- | --- | --- |
+| **QNN** | Qualcomm NPU & GPU (Snapdragon X Elite) | 🟢 Ready | `--ep qnn` | `--device npu` or `--device gpu` |
+| **OpenVINO** | Intel NPU, GPU & CPU (Meteor Lake / Lunar Lake) | 🟢 Ready | `--ep openvino` | `--device npu`, `--device gpu`, or `--device cpu` |
+| **VitisAI** | AMD NPU — Ryzen AI (Phoenix / Hawk Point / Strix) | 🟢 Ready | `--ep vitisai` | `--device npu` |
+| **NvTensorRTRTX** | NVIDIA discrete GPUs | 🟢 Ready | `--ep nv_tensorrt_rtx` | `--device gpu` |
+| **MIGraphX** | AMD discrete GPUs | ⚠️ Coming soon | `--ep migraphx` | `--device gpu` |
+| **Dml** | Hardware-agnostic GPU backend | 🟢 Ready | `--ep dml` | `--device gpu` |
+| **CPU** | Cross-platform fallback | 🟢 Ready | `--ep cpu` | `--device cpu` |
+
+> **Tip:**
+>
+> - For scenarios where you want to benchmark a model, if no `--device` is specified, WinML CLI defaults to `--device auto` and picks the best available device on your machine — NPU first, then GPU, then CPU.
+> - For scenarios where you want to get insights across all EPs, use `--device all` to cover all WinML EPs, or specify a target like `--device npu` to focus on a particular device class.
 
 ---
 
-## :balance_scale: Code of Conduct
-
-See [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md).
+## Supported Tasks
+
+| Task | Category |
+| --- | --- |
+| `image-classification` | Vision |
+| `image-segmentation` / `semantic-segmentation` | Vision |
+| `image-feature-extraction` | Vision |
+| `image-to-image` / `image-to-text` / `image-text-to-text` | Vision |
+| `object-detection` | Vision |
+| `depth-estimation` | Vision |
+| `keypoint-detection` | Vision |
+| `mask-generation` / `masked-im` / `inpainting` | Vision |
+| `zero-shot-image-classification` / `zero-shot-object-detection` | Vision |
+| `text-classification` | NLP |
+| `token-classification` | NLP |
+| `question-answering` / `document-question-answering` | NLP |
+| `text-generation` / `text2text-generation` | NLP |
+| `fill-mask` / `feature-extraction` / `text-to-image` | NLP |
+| `multiple-choice` / `next-sentence-prediction` | NLP |
+| `sentence-similarity` | NLP |
+| `audio-classification` / `audio-frame-classification` / `audio-xvector` | Audio |
+| `automatic-speech-recognition` | Audio |
+| `text-to-audio` | Audio |
+| `visual-question-answering` | Multimodal |
+| `time-series-forecasting` | Other |
+| `reinforcement-learning` | Other |
 
 ---
 
-## :page_facing_up: License
-
-This project is licensed under the [MIT License](LICENSE.txt).
+## Supported Model Types
+
+| Model Type | Category | Supported Tasks |
+| --- | --- | --- |
+| `convnext` | Vision | image-classification |
+| `detr` | Vision | object-detection |
+| `depth_anything`, `depth_pro`, `zoedepth` | Vision | depth-estimation |
+| `segformer` | Vision | image-segmentation |
+| `swin2sr` | Vision | image-to-image |
+| `sam`, `sam2`, `sam2-video` | Vision | mask-generation, image-segmentation |
+| `bert` | NLP / Encoder | text-classification, token-classification, question-answering, and more |
+| `roberta`, `camembert`, `xlm-roberta` | NLP / Encoder | text-classification, token-classification, and more |
+| `bart`, `marian`, `t5` | NLP / Encoder | text2text-generation, feature-extraction |
+| `blip` | Multimodal | image-to-text, image-text-to-text |
+| `clip`, `clip-text-model`, `clip-vision-model` | Multimodal | feature-extraction, image-feature-extraction |
+| `siglip`, `siglip-text-model`, `siglip-vision-model` | Multimodal | feature-extraction, image-feature-extraction |
+| `vision-encoder-decoder` | Multimodal | image-to-text, text2text-generation |
+| `mu2`, `qwen3` | Generative | text2text-generation |
 
 ---
 
-## Trademarks
+## Built-in Models
 
-This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
-trademarks or logos is subject to and must follow
-[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
-Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft
-sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
+Run `winml catalog` to browse the full catalog interactively.
+
+| Model ID | Task | Architecture |
+| --- | --- | --- |
+| `microsoft/resnet-50` | image-classification | ResNet |
+| `google/vit-base-patch16-224` | image-classification | ViT |
+| `microsoft/swin-large-patch4-window7-224` | image-classification | Swin |
+| `facebook/convnext-tiny-224` | image-classification | ConvNeXT |
+| `rizvandwiki/gender-classification` | image-classification | ViT |
+| `ProsusAI/finbert` | text-classification | BERT |
+| `Intel/bert-base-uncased-mrpc` | text-classification | BERT |
+| `cardiffnlp/twitter-roberta-base-sentiment-latest` | text-classification | RoBERTa |
+| `dslim/bert-base-NER` | token-classification | BERT |
+| `dbmdz/bert-large-cased-finetuned-conll03-english` | token-classification | BERT |
+| `Babelscape/wikineural-multilingual-ner` | token-classification | BERT |
+| `w11wo/indonesian-roberta-base-posp-tagger` | token-classification | RoBERTa |
+| `microsoft/table-transformer-detection` | object-detection | Table Transformer |
+| `mattmdjaga/segformer_b2_clothes` | image-segmentation | SegFormer |
+| `nvidia/segformer-b1-finetuned-ade-512-512` | image-segmentation | SegFormer |
+| `nvidia/segformer-b2-finetuned-ade-512-512` | image-segmentation | SegFormer |
+| `nvidia/segformer-b5-finetuned-ade-640-640` | image-segmentation | SegFormer |

From fd0e44ae8260e9bc66d0f6e740fa4ce7b5b33d47 Mon Sep 17 00:00:00 2001
From: Zhipeng Wang <zhiwang@microsoft.com>
Date: Thu, 28 May 2026 16:07:23 +0800
Subject: [PATCH 013/143] docs: fix duplicate Releases row and switch install
 to pip (#784)

## Summary

- Removed the duplicated `WinML CLI (Python wheel) | [Releases]` row in
the Prerequisites table.
- Updated the install step from `uv pip install
winml_cli-<version>-py3-none-any.whl` to `pip install winml-cli`.
- Updated the Prerequisites entry to point at PyPI instead of GitHub
Releases, keeping the table and install instructions consistent.
---
 README.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 180832504..db90d3d99 100644
--- a/README.md
+++ b/README.md
@@ -53,9 +53,9 @@ See the [Supported Tasks](#supported-tasks) and [Supported Model Types](#support
 |---|---|
 | Windows | Windows 11 24H2 or later (required for NPU support; earlier versions work for CPU/GPU) |
 | Python | 3.11 |
-| Package manager | [`uv`](https://github.com/astral-sh/uv) |
-| **WinML CLI** (Python wheel) | [Releases](https://github.com/microsoft/winml-cli/releases) |
-| **WinML CLI** (Python wheel) | [Releases](https://github.com/microsoft/winml-cli/releases) |
+| Package manager | [uv](https://github.com/astral-sh/uv) |
+| WinML CLI | [PyPI](https://pypi.org/project/winml-cli/) |
+
 ### Installation
 
 WinML CLI requires **Python 3.11** and is distributed as a Python wheel. We recommend [uv](https://docs.astral.sh/uv/) for fast, reproducible environment setup.
@@ -76,10 +76,10 @@ Activate it:
 source .venv/Scripts/activate
 ```
 
-**2. Install from wheel**
+**2. Install winml-cli**
 
 ```bash
-uv pip install winml_cli-<version>-py3-none-any.whl
+uv pip install winml-cli
 ```
 
 **3. Verify your environment**

From 7319628ea4a51da3897decd49cedce9557ff8c72 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Thu, 28 May 2026 17:46:33 +0800
Subject: [PATCH 014/143] fix: config validates device/EP combination without
 system check (#780)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Adds `resolve_check_device_ep` helper that validates a (device, EP)
combination without requiring the device/EP to actually exist on the
system. Closes #765.
- `commands/config.py` and `config/build.py` now use
`resolve_check_device_ep` instead of `resolve_device` so `winml config`
no longer hard-fails on hosts where the requested EP isn't installed.
- When `device=auto` or `ep=None`, the helper delegates to the existing
`resolve_device` + `resolve_eps` flow (system-aware behavior preserved).
When both `device` and `ep` are explicit, it only validates against the
static `EP_SUPPORTED_DEVICES` mapping.
- CLI cleanup: `-m/--model`, `-c/--config`, `--device` for the config
command now use the shared `cli_utils.*_option` decorators.

## Tests

- New `TestResolveCheckDeviceEp` class in
`tests/unit/sysinfo/test_device.py` covering both code paths (delegation
and static-only) plus error cases (unknown EP, unsupported device,
case-insensitivity).
- Existing config-test mocks updated from `resolve_device` to
`resolve_check_device_ep` (`tests/unit/config/conftest.py`,
`tests/unit/config/test_build.py`,
`tests/unit/config/test_build_onnx.py`,
`tests/unit/commands/test_config_cli.py`) so the lazy import in
`config/build.py` is intercepted.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
---
 src/winml/modelkit/commands/analyze.py   |   2 +-
 src/winml/modelkit/commands/compile.py   |   2 +-
 src/winml/modelkit/commands/config.py    |  46 +++----
 src/winml/modelkit/commands/eval.py      |   5 +-
 src/winml/modelkit/commands/export.py    |   3 +-
 src/winml/modelkit/commands/perf.py      |   3 +-
 src/winml/modelkit/commands/quantize.py  |   2 +-
 src/winml/modelkit/config/build.py       |  12 +-
 src/winml/modelkit/sysinfo/__init__.py   |   9 +-
 src/winml/modelkit/sysinfo/device.py     |  40 ++++++
 src/winml/modelkit/utils/cli.py          |  20 ++-
 tests/cli/test_catalog_cli.py            |   2 +-
 tests/unit/commands/test_build.py        |  32 ++++-
 tests/unit/commands/test_config_cli.py   |  11 +-
 tests/unit/config/conftest.py            |  18 ++-
 tests/unit/config/test_build.py          | 150 ++++++++++++----------
 tests/unit/config/test_build_onnx.py     | 104 +++++++--------
 tests/unit/models/auto/test_auto_onnx.py |   4 +-
 tests/unit/sysinfo/test_device.py        | 155 +++++++++++++++++++++++
 19 files changed, 427 insertions(+), 193 deletions(-)

diff --git a/src/winml/modelkit/commands/analyze.py b/src/winml/modelkit/commands/analyze.py
index 460ed7b5f..bc60f1752 100644
--- a/src/winml/modelkit/commands/analyze.py
+++ b/src/winml/modelkit/commands/analyze.py
@@ -649,7 +649,7 @@ def _ep_name_device_display_name(ep_name: str, device_name: str) -> str:
     ),
 )
 @cli_utils.verbosity_options
-@cli_utils.build_config_option
+@cli_utils.build_config_option()
 @cli_utils.output_option("Save JSON output to file")
 @click.option(
     "--information/--no-information",
diff --git a/src/winml/modelkit/commands/compile.py b/src/winml/modelkit/commands/compile.py
index c5828a94f..7cdbfa819 100644
--- a/src/winml/modelkit/commands/compile.py
+++ b/src/winml/modelkit/commands/compile.py
@@ -104,7 +104,7 @@
     default=False,
     help="List available compilers for the selected device and exit",
 )
-@cli_utils.build_config_option
+@cli_utils.build_config_option()
 @click.pass_context
 def compile(
     ctx: click.Context,
diff --git a/src/winml/modelkit/commands/config.py b/src/winml/modelkit/commands/config.py
index 25fc88b7a..4172c6573 100644
--- a/src/winml/modelkit/commands/config.py
+++ b/src/winml/modelkit/commands/config.py
@@ -64,14 +64,7 @@ def _is_onnx_file(model_input: str) -> bool:
 
 
 @click.command("config")
-@click.option(
-    "-m",
-    "--model",
-    "hf_model",
-    default=None,
-    help="HuggingFace model ID (e.g., microsoft/resnet-50) or path to .onnx file. "
-    "Optional when --model-type is provided.",
-)
+@cli_utils.model_option(required=False, optional_message="Optional when --model-type is provided.")
 @click.option(
     "-t",
     "--task",
@@ -97,12 +90,7 @@ def _is_onnx_file(model_input: str) -> bool:
     default=None,
     help="Generate configs for submodules matching this class name (e.g., ResNetConvLayer)",
 )
-@click.option(
-    "-c",
-    "--config",
-    "config_file",
-    type=click.Path(exists=True),
-    default=None,
+@cli_utils.build_config_option(
     help="JSON config file with overrides (WinMLBuildConfig format)",
 )
 @click.option(
@@ -115,13 +103,11 @@ def _is_onnx_file(model_input: str) -> bool:
     "vision: height, width, num_channels; "
     "audio: feature_size, nb_max_frames, audio_sequence_length.",
 )
-@click.option(
-    "-d",
-    "--device",
-    "device",
-    type=click.Choice(["auto", "npu", "gpu", "cpu"], case_sensitive=False),
+@cli_utils.device_option(
+    required=False,
+    optional_message="Affects quant/compile config.",
     default="auto",
-    help="Target device (affects quant/compile config). Default: auto (no changes to config).",
+    include_auto=True,
 )
 @cli_utils.ep_option(
     required=False,
@@ -165,7 +151,7 @@ def _is_onnx_file(model_input: str) -> bool:
 )
 @cli_utils.trust_remote_code_option()
 def config(
-    hf_model: str | None,
+    model: str | None,
     task: str | None,
     model_class: str | None,
     model_type: str | None,
@@ -191,6 +177,9 @@ def config(
 
     Requires at least one of -m/--model, --model-type, or --model-class.
 
+    If device is auto or EP is None, they are inferred from the system configuration.
+    If both are specified, the combination is only validated but not against the system.
+
     \b
     Examples:
         # Basic usage - auto-detect everything
@@ -226,6 +215,7 @@ def config(
     if verbose:
         logging.basicConfig(level=logging.DEBUG)
 
+    hf_model = model  # rename for clarity in this function
     # Validate: at least one of -m, --model-type, or --model-class is required
     if hf_model is None and model_type is None and model_class is None:
         # Show header even for errors
@@ -443,18 +433,12 @@ def config(
 
             console.print("   \u2699\ufe0f  [bold]Resolution:[/bold]")
 
-            # Fix #4: Device from resolve_device (existing API)
-            from ..sysinfo import resolve_device as _rd
+            # Use the same resolution logic as the config generation to determine what to display
+            from ..sysinfo import resolve_check_device_ep
 
-            _resolved_dev, _ = _rd(device, ep=ep)
+            _resolved_dev, _, _resolved_eps = resolve_check_device_ep(device=device, ep=ep)
             console.print(f"      Device:     [cyan]{_resolved_dev.upper()}[/cyan]")
-
-            # EP — only shown when user explicitly passed --ep
-            if ep:
-                from ..utils.constants import normalize_ep_name
-
-                _ep_full = normalize_ep_name(ep) or ep
-                console.print(f"      EP:         [cyan]{_ep_full}[/cyan]")
+            console.print(f"      EP:         [cyan]{_resolved_eps[0]}[/cyan]")
 
             # Quant types — display exactly what config contains
             if _quant:
diff --git a/src/winml/modelkit/commands/eval.py b/src/winml/modelkit/commands/eval.py
index 8ff385c99..1ac039b4a 100644
--- a/src/winml/modelkit/commands/eval.py
+++ b/src/winml/modelkit/commands/eval.py
@@ -140,7 +140,7 @@
     default=False,
     help="Print expected dataset schema for the given --task and exit.",
 )
-@cli_utils.build_config_option
+@cli_utils.build_config_option()
 @click.pass_context
 def eval(
     ctx: click.Context,
@@ -191,8 +191,7 @@ def eval(
         if schema is None:
             supported = ", ".join(sorted(TASK_SCHEMAS))
             raise click.UsageError(
-                f"Task '{task_arg}' is not supported by `winml eval`. "
-                f"Supported tasks: {supported}."
+                f"Task '{task_arg}' is not supported by `winml eval`. Supported tasks: {supported}."
             )
         _print_schema(task_arg, schema)
         return
diff --git a/src/winml/modelkit/commands/export.py b/src/winml/modelkit/commands/export.py
index c4de564cf..b58c9c84c 100644
--- a/src/winml/modelkit/commands/export.py
+++ b/src/winml/modelkit/commands/export.py
@@ -128,7 +128,7 @@ def _delete_onnx_with_external_data(onnx_path: Path) -> None:
     default=None,
     help='JSON with shape overrides (e.g., {"sequence_length": 2048, "height": 640}).',
 )
-@cli_utils.build_config_option
+@cli_utils.build_config_option()
 @click.pass_context
 def export(
     ctx: click.Context,
@@ -267,6 +267,7 @@ def export(
 
     # Always auto-resolve input/output tensors via loader + Optimum
     from ..export import ONNXConfigNotFoundError
+
     try:
         from ..export import resolve_export_config as resolve_cfg
 
diff --git a/src/winml/modelkit/commands/perf.py b/src/winml/modelkit/commands/perf.py
index 7fb4337ea..ac511fcb6 100644
--- a/src/winml/modelkit/commands/perf.py
+++ b/src/winml/modelkit/commands/perf.py
@@ -756,6 +756,7 @@ def _perf_modules(
 # Report Generation
 # =============================================================================
 
+
 def _device_string(req_device: str, act_device: str, ep_name: EPName | None) -> str:
     device_str = f"{req_device} ({act_device})" if req_device != act_device else act_device
     if ep_name:
@@ -1195,7 +1196,7 @@ def _run_onnx_benchmark(
     default=False,
     help="Enable verbose output",
 )
-@cli_utils.build_config_option
+@cli_utils.build_config_option()
 @click.pass_context
 def perf(
     ctx: click.Context,
diff --git a/src/winml/modelkit/commands/quantize.py b/src/winml/modelkit/commands/quantize.py
index 759b57b98..9b4539165 100644
--- a/src/winml/modelkit/commands/quantize.py
+++ b/src/winml/modelkit/commands/quantize.py
@@ -107,7 +107,7 @@
     default=False,
     help="Enable verbose output",
 )
-@cli_utils.build_config_option
+@cli_utils.build_config_option()
 @click.pass_context
 def quantize(
     ctx: click.Context,
diff --git a/src/winml/modelkit/config/build.py b/src/winml/modelkit/config/build.py
index b8276c2e0..7636829bc 100644
--- a/src/winml/modelkit/config/build.py
+++ b/src/winml/modelkit/config/build.py
@@ -326,10 +326,10 @@ def resolve_quant_compile_config(
         Tuple of (quant_config, compile_config). Either may be None when the
         policy does not require that stage (e.g., CPU with fp32).
     """
-    from ..sysinfo import resolve_device
+    from ..sysinfo import resolve_check_device_ep
     from .precision import resolve_precision
 
-    resolved_device, available_devices = resolve_device(device=device, ep=ep)
+    resolved_device, available_devices, resolved_eps = resolve_check_device_ep(device=device, ep=ep)
     logger.info(
         "Device resolved: %s (available: %s)",
         resolved_device,
@@ -339,7 +339,7 @@ def resolve_quant_compile_config(
     policy = resolve_precision(
         device=resolved_device,
         precision=precision,
-        ep=ep,
+        ep=resolved_eps[0],
         available_devices=available_devices,
         task=task,
     )
@@ -625,12 +625,12 @@ class name. Uses torchinfo to discover submodules and infer
     # =========================================================================
     # STEP 4.5: Apply device/precision policy (affects quant + compile only)
     # =========================================================================
-    from ..sysinfo import resolve_device
+    from ..sysinfo import resolve_check_device_ep
     from .precision import resolve_precision
 
     # ALWAYS detect hardware — even when device="auto" — so we don't
     # blindly default to QNN on machines without an NPU (#412).
-    resolved_device, available_devices = resolve_device(device=device, ep=ep)
+    resolved_device, available_devices, resolved_eps = resolve_check_device_ep(device=device, ep=ep)
     logger.info(
         "Device resolved: %s (available: %s)",
         resolved_device,
@@ -640,7 +640,7 @@ class name. Uses torchinfo to discover submodules and infer
     policy = resolve_precision(
         device=resolved_device,
         precision=precision,
-        ep=ep,
+        ep=resolved_eps[0],
         available_devices=available_devices,
         task=parent_config.loader.task,
     )
diff --git a/src/winml/modelkit/sysinfo/__init__.py b/src/winml/modelkit/sysinfo/__init__.py
index 1cbbef058..71e8916ff 100644
--- a/src/winml/modelkit/sysinfo/__init__.py
+++ b/src/winml/modelkit/sysinfo/__init__.py
@@ -2,7 +2,13 @@
 # Copyright (c) Microsoft Corporation. All rights reserved.
 # Licensed under the MIT License.
 # --------------------------------------------------------------------------
-from .device import get_device_ep_map, get_ep_device_map, resolve_device, resolve_eps
+from .device import (
+    get_device_ep_map,
+    get_ep_device_map,
+    resolve_check_device_ep,
+    resolve_device,
+    resolve_eps,
+)
 from .hardware import CPU, GPU, NPU
 from .software import OS
 from .sysinfo import SysInfo
@@ -16,6 +22,7 @@
     "SysInfo",
     "get_device_ep_map",
     "get_ep_device_map",
+    "resolve_check_device_ep",
     "resolve_device",
     "resolve_eps",
 ]
diff --git a/src/winml/modelkit/sysinfo/device.py b/src/winml/modelkit/sysinfo/device.py
index 97cd6d16e..96bd6fb8a 100644
--- a/src/winml/modelkit/sysinfo/device.py
+++ b/src/winml/modelkit/sysinfo/device.py
@@ -223,3 +223,43 @@ def resolve_eps(resolved_device: str) -> list[EPName]:
     device = resolved_device.lower()
     available_eps = set(_get_device_ep_map_from_ort().get(device, ()))
     return [ep for ep in _DEVICE_EP_MAP.get(device, []) if ep in available_eps]
+
+
+def resolve_check_device_ep(
+    *, device: str = "auto", ep: EPNameOrAlias | None = None
+) -> tuple[str, list[str], list[EPName]]:
+    """Resolve or check that the requested device and/or EP combination is valid, raising if not.
+
+    Ideal for commands that do not need the device + ep actually exists on the system.
+
+    Args:
+        device: "auto", "npu", "gpu", or "cpu".
+        ep: Optional EP short name (e.g., "qnn", "dml"). When set,
+            availability is checked and an error is raised if no compatible EP
+            is found.
+
+    Raises:
+        ValueError: If the requested device or EP combination is not valid.
+
+    Returns:
+    Tuple of (resolved_device, available_devices, available_eps) where:
+    - resolved_device: The device that should be used based on the input parameters.
+    - available_devices: List of devices that are compatible with the first in available_eps
+    - available_eps: List of EPs that are compatible with the resolved device.
+    """
+    ep_name = normalize_ep_name(ep)
+    if device == "auto" or ep_name is None:
+        resolved_device, _ = resolve_device(device=device, ep=ep_name)
+        available_eps: list[EPName] = resolve_eps(resolved_device) if ep_name is None else [ep_name]
+        supported_devices = EP_SUPPORTED_DEVICES[available_eps[0]]
+        return resolved_device, list(supported_devices), available_eps
+
+    if ep_name not in EP_SUPPORTED_DEVICES:
+        raise ValueError(f"Unknown EP '{ep}'. Expected one of: {sorted(EP_SUPPORTED_DEVICES)}")
+    supported_devices = EP_SUPPORTED_DEVICES[ep_name]
+    if device.lower() not in supported_devices:
+        raise ValueError(
+            f"EP '{ep}' does not support device '{device}'. "
+            f"Supported devices for {ep_name}: {', '.join(supported_devices)}."
+        )
+    return device.lower(), list(supported_devices), [ep_name]
diff --git a/src/winml/modelkit/utils/cli.py b/src/winml/modelkit/utils/cli.py
index 61216f4fd..b6967c54c 100644
--- a/src/winml/modelkit/utils/cli.py
+++ b/src/winml/modelkit/utils/cli.py
@@ -74,7 +74,7 @@ def model_path_option(required=True):
     )
 
 
-def model_option(required=True):
+def model_option(required=True, optional_message=None):
     """Add --model option that accepts any model reference.
 
     Accepts a HuggingFace model ID, build output directory, or .onnx file path.
@@ -86,12 +86,15 @@ def model_option(required=True):
     Returns:
         Decorator function
     """
+    help = "Model: HF model ID, build output directory, or .onnx file path"
+    if optional_message:
+        help = f"{help}. {optional_message}"
     return click.option(
         "--model",
         "-m",
         required=required,
         default=None,
-        help="Model: HF model ID, build output directory, or .onnx file path",
+        help=help,
     )
 
 
@@ -169,6 +172,7 @@ def device_option(required=True, optional_message=None, default="NPU", include_a
         help_text = f"{help_text}. {optional_message}"
 
     return click.option(
+        "-d",
         "--device",
         required=required,
         default=default if not required else None,
@@ -209,17 +213,21 @@ def verbosity_options(f):
     return f  # noqa: RET504
 
 
-def build_config_option(func):
+def build_config_option(help: str | None = None):
     """Add -c/--config option for WinMLBuildConfig JSON file."""
+    if help is None:
+        help = (
+            "WinMLBuildConfig JSON file (from winml config). "
+            "Provides defaults; explicit CLI options take precedence."
+        )
     return click.option(
         "-c",
         "--config",
         "config_file",
         type=click.Path(exists=True, path_type=Path),
         default=None,
-        help="WinMLBuildConfig JSON file (from winml config). "
-        "Provides defaults; explicit CLI options take precedence.",
-    )(func)
+        help=help,
+    )
 
 
 def trust_remote_code_option(optional_message: str | None = None):
diff --git a/tests/cli/test_catalog_cli.py b/tests/cli/test_catalog_cli.py
index 5b3cd9b8f..ad3dad146 100644
--- a/tests/cli/test_catalog_cli.py
+++ b/tests/cli/test_catalog_cli.py
@@ -182,7 +182,7 @@ def test_invalid_ep_choice_exits_two(self) -> None:
     def test_invalid_device_choice_exits_two(self) -> None:
         result = _run("--device", "TPU")
         assert result.exit_code == 2
-        assert "Invalid value for '--device'" in result.output
+        assert "Invalid value for '-d' / '--device'" in result.output
 
     def test_short_flags_accepted(self, type_task_pair: tuple[str, str], tmp_path: Path) -> None:
         """-t / -k short aliases are accepted by the parser."""
diff --git a/tests/unit/commands/test_build.py b/tests/unit/commands/test_build.py
index 9bd1e6a72..a444b3463 100644
--- a/tests/unit/commands/test_build.py
+++ b/tests/unit/commands/test_build.py
@@ -30,15 +30,29 @@
 }
 
 
+def _fake_resolve_check_device_ep(*, device: str = "auto", ep: str | None = None):
+    """Side effect for resolve_check_device_ep that honours the requested device.
+
+    The build command's --device path calls resolve_quant_compile_config which
+    in turn calls resolve_check_device_ep. Tests pass explicit devices like
+    "npu", "gpu", "cpu" -- echo them back with a canonical EP so the downstream
+    precision policy resolves deterministically.
+    """
+    resolved = device.lower() if device != "auto" else "npu"
+    eps = _DEVICE_TO_EPS.get(resolved, ["CPUExecutionProvider"])
+    return resolved, ["npu", "gpu", "cpu"], eps
+
+
 @pytest.fixture(autouse=True)
 def mock_resolve_device():
-    """Mock resolve_device / resolve_eps to avoid hardware detection.
-
-    The build command calls resolve_device() / resolve_eps() to auto-select
-    an EP when ``--ep`` is not specified. Both must be mocked to avoid
-    slow DLL scanning and WinML SDK discovery on CI runners without WinML
-    installed. WinMLEPRegistry.get_instance is also patched defensively
-    for any downstream code path that may touch it.
+    """Mock device/EP resolution to avoid hardware detection.
+
+    The build command calls ``resolve_device`` / ``resolve_eps`` to auto-select
+    an EP when ``--ep`` is not specified, and ``resolve_check_device_ep`` (via
+    ``resolve_quant_compile_config``) when ``--device`` is explicit. All three
+    must be mocked to avoid slow DLL scanning and WinML SDK discovery on CI
+    runners without WinML installed. ``WinMLEPRegistry.get_instance`` is also
+    patched defensively for any downstream code path that may touch it.
     """
     mock_registry = MagicMock()
     mock_registry.is_ep_available.return_value = False
@@ -52,6 +66,10 @@ def mock_resolve_device():
             "winml.modelkit.sysinfo.resolve_eps",
             side_effect=lambda device: list(_DEVICE_TO_EPS.get(device, [])),
         ),
+        patch(
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            side_effect=_fake_resolve_check_device_ep,
+        ),
         patch(
             "winml.modelkit.session.ep_registry.WinMLEPRegistry.get_instance",
             return_value=mock_registry,
diff --git a/tests/unit/commands/test_config_cli.py b/tests/unit/commands/test_config_cli.py
index 841a2fa8d..c993b0e32 100644
--- a/tests/unit/commands/test_config_cli.py
+++ b/tests/unit/commands/test_config_cli.py
@@ -25,14 +25,15 @@
 
 @pytest.fixture(autouse=True)
 def mock_resolve_device():
-    """Mock resolve_device to avoid hardware detection in all config CLI tests.
+    """Mock resolve_check_device_ep to avoid hardware detection in CLI tests.
 
-    The config command may call resolve_device() for device/precision resolution.
-    We mock it at the source module since it's a lazy import.
+    The config command calls resolve_check_device_ep() (lazy import) for
+    device/EP resolution and display. We mock at the source module since the
+    import happens at call time.
     """
     with patch(
-        "winml.modelkit.sysinfo.resolve_device",
-        return_value=("npu", ["npu", "gpu", "cpu"]),
+        "winml.modelkit.sysinfo.resolve_check_device_ep",
+        return_value=("npu", ["npu", "gpu", "cpu"], ["QNNExecutionProvider"]),
     ):
         yield
 
diff --git a/tests/unit/config/conftest.py b/tests/unit/config/conftest.py
index 40cd23450..7328488f0 100644
--- a/tests/unit/config/conftest.py
+++ b/tests/unit/config/conftest.py
@@ -4,10 +4,11 @@
 # --------------------------------------------------------------------------
 """Shared fixtures for config tests.
 
-Mocks ``resolve_device`` and ``resolve_eps`` to avoid slow EP discovery
-in CI and to keep ``compile_provider`` resolution deterministic regardless
-of which EPs the test host has installed (e.g., OpenVINO would otherwise
-out-rank QNN/DML/CPU under the dynamic resolution in ``resolve_precision``).
+Mocks ``resolve_check_device_ep`` and ``resolve_eps`` to avoid slow EP
+discovery in CI and to keep ``compile_provider`` resolution deterministic
+regardless of which EPs the test host has installed (e.g., OpenVINO would
+otherwise out-rank QNN/DML/CPU under the dynamic resolution in
+``resolve_precision``).
 """
 
 from unittest.mock import patch
@@ -26,7 +27,10 @@
 def mock_resolve_device():
     """Mock device + EP resolution globally for all config tests.
 
-    - ``resolve_device``: stubbed so EP discovery via WinML doesn't slow CI.
+    - ``resolve_check_device_ep``: stubbed so EP discovery via WinML doesn't
+      slow CI. ``build.py`` calls this (not ``resolve_device`` directly), so
+      patching the higher-level entry point is what intercepts the lazy
+      import in ``generate_hf_build_config`` / ``resolve_quant_compile_config``.
     - ``resolve_eps``: returns a canonical single-EP list per device so
       ``resolve_precision`` produces deterministic ``compile_provider``
       values (QNN for npu, DML for gpu, CPU→None for cpu) independent of
@@ -34,8 +38,8 @@ def mock_resolve_device():
     """
     with (
         patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("npu", ["npu", "gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("npu", ["npu", "gpu", "cpu"], ["QNNExecutionProvider"]),
         ),
         patch(
             "winml.modelkit.config.precision.resolve_eps",
diff --git a/tests/unit/config/test_build.py b/tests/unit/config/test_build.py
index fb1835254..66efdcd76 100644
--- a/tests/unit/config/test_build.py
+++ b/tests/unit/config/test_build.py
@@ -1975,10 +1975,18 @@ def test_config_gen_device_precision(
             ),
             patch("winml.modelkit.models.hf.MODEL_BUILD_CONFIGS", {}),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
                 return_value=(
                     "npu" if device == "auto" else device,
                     ["npu", "gpu", "cpu"],
+                    [
+                        {
+                            "npu": "QNNExecutionProvider",
+                            "gpu": "DmlExecutionProvider",
+                            "cpu": "CPUExecutionProvider",
+                            "auto": "QNNExecutionProvider",
+                        }[device]
+                    ],
                 ),
             ),
         ):
@@ -2043,8 +2051,8 @@ def test_auto_auto_is_noop(self) -> None:
         # Default compile provider is "qnn" (from WinMLCompileConfig -> EPConfig)
         assert result.compile.ep_config.provider == "qnn"
 
-    def test_auto_auto_still_calls_resolve_device(self) -> None:
-        """device='auto' + precision='auto' DOES call resolve_device (#412).
+    def test_auto_auto_still_calls_resolve_check_device_ep(self) -> None:
+        """device='auto' + precision='auto' DOES call resolve_check_device_ep (#412).
 
         Previously this was skipped, causing EPConfig to default to 'qnn'
         on machines without an NPU. Now we always detect hardware.
@@ -2064,9 +2072,9 @@ def test_auto_auto_still_calls_resolve_device(self) -> None:
             ),
             patch("winml.modelkit.models.hf.MODEL_BUILD_CONFIGS", {}),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "gpu", "cpu"]),
-            ) as mock_rd,
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "gpu", "cpu"], ["QNNExecutionProvider"]),
+            ) as mock_rcde,
         ):
             generate_build_config(
                 "bert-base-uncased",
@@ -2074,10 +2082,10 @@ def test_auto_auto_still_calls_resolve_device(self) -> None:
                 precision="auto",
             )
 
-        mock_rd.assert_called_once_with(device="auto", ep=None)
+        mock_rcde.assert_called_once_with(device="auto", ep=None)
 
-    def test_explicit_precision_triggers_resolve_device(self) -> None:
-        """device='auto' + precision='int8' DOES call resolve_device."""
+    def test_explicit_precision_triggers_resolve_check_device_ep(self) -> None:
+        """device='auto' + precision='int8' DOES call resolve_check_device_ep."""
         with (
             patch(
                 "winml.modelkit.config.build.resolve_loader_config",
@@ -2093,9 +2101,9 @@ def test_explicit_precision_triggers_resolve_device(self) -> None:
             ),
             patch("winml.modelkit.models.hf.MODEL_BUILD_CONFIGS", {}),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "gpu", "cpu"]),
-            ) as mock_rd,
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "gpu", "cpu"], ["QNNExecutionProvider"]),
+            ) as mock_rcde,
         ):
             generate_build_config(
                 "bert-base-uncased",
@@ -2103,7 +2111,7 @@ def test_explicit_precision_triggers_resolve_device(self) -> None:
                 precision="int8",
             )
 
-        mock_rd.assert_called_once()
+        mock_rcde.assert_called_once()
 
 
 # =============================================================================
@@ -2138,8 +2146,8 @@ def _mock_deps(
             ),
             "registry": patch("winml.modelkit.models.hf.MODEL_BUILD_CONFIGS", {}),
             "device": patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "gpu", "cpu"], ["QNNExecutionProvider"]),
             ),
         }
 
@@ -2176,8 +2184,8 @@ def test_device_npu_produces_qnn(self, tmp_path) -> None:
     def test_device_gpu_precision_fp16(self, tmp_path) -> None:
         """--device gpu --precision fp16 → no quant, compile.provider=dml."""
         self._patches["device"] = patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("gpu", ["gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
         )
         result, output_file = self._invoke(
             tmp_path,
@@ -2192,8 +2200,8 @@ def test_device_gpu_precision_fp16(self, tmp_path) -> None:
     def test_device_cpu_precision_fp32(self, tmp_path) -> None:
         """--device cpu --precision fp32 → no quant, no compile."""
         self._patches["device"] = patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("cpu", ["cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
         )
         result, output_file = self._invoke(
             tmp_path,
@@ -2271,8 +2279,8 @@ def test_config_onnx_with_device_precision(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "gpu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             runner = CliRunner()
@@ -2349,8 +2357,8 @@ def test_raw_onnx_full_pipeline(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="npu")
@@ -2371,8 +2379,8 @@ def test_raw_onnx_cpu(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="cpu")
@@ -2390,8 +2398,8 @@ def test_quantized_onnx_skips_quant(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=True),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="npu")
@@ -2409,8 +2417,8 @@ def test_quantized_onnx_cpu(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=True),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="cpu")
@@ -2469,8 +2477,8 @@ def test_export_always_none(self, tmp_path) -> None:
                 patch("winml.modelkit.onnx.is_compiled_onnx", return_value=is_compiled),
                 patch("winml.modelkit.onnx.is_quantized_onnx", return_value=is_quantized),
                 patch(
-                    "winml.modelkit.sysinfo.resolve_device",
-                    return_value=("cpu", ["cpu"]),
+                    "winml.modelkit.sysinfo.resolve_check_device_ep",
+                    return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
                 ),
             ):
                 config = generate_onnx_build_config(str(onnx_file))
@@ -2492,8 +2500,8 @@ def test_optim_always_present(self, tmp_path) -> None:
                 patch("winml.modelkit.onnx.is_compiled_onnx", return_value=is_compiled),
                 patch("winml.modelkit.onnx.is_quantized_onnx", return_value=is_quantized),
                 patch(
-                    "winml.modelkit.sysinfo.resolve_device",
-                    return_value=("cpu", ["cpu"]),
+                    "winml.modelkit.sysinfo.resolve_check_device_ep",
+                    return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
                 ),
             ):
                 config = generate_onnx_build_config(str(onnx_file))
@@ -2511,8 +2519,8 @@ def test_task_stored_in_loader(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -2531,8 +2539,8 @@ def test_task_none_by_default(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file))
@@ -2559,8 +2567,8 @@ def test_override_applied_last(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -2588,8 +2596,8 @@ def test_override_quant_none_on_raw(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
             patch(
                 "winml.modelkit.config.build.merge_config",
@@ -2650,8 +2658,8 @@ def test_override_none_is_noop(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -2677,8 +2685,8 @@ def test_onnx_path_as_string(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file))
@@ -2696,8 +2704,8 @@ def test_onnx_path_as_pathlib(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(Path(onnx_file))
@@ -2717,8 +2725,8 @@ def test_auto_device_auto_precision_defaults(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("auto", ["npu", "gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("auto", ["npu", "gpu", "cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file))
@@ -2752,8 +2760,8 @@ def test_raw_onnx_with_gpu(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("gpu", ["gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="gpu")
@@ -2771,8 +2779,8 @@ def test_ep_override_forwarded(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("gpu", ["gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -2800,8 +2808,8 @@ class TestResolveQuantCompileConfig:
     def test_auto_auto_returns_none_none(self) -> None:
         """device=auto + precision=auto returns (None, None)."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("auto", ["npu", "gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("auto", ["npu", "gpu", "cpu"], ["CPUExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config()
 
@@ -2811,8 +2819,8 @@ def test_auto_auto_returns_none_none(self) -> None:
     def test_npu_returns_quant_and_compile(self) -> None:
         """device=npu returns (WinMLQuantizationConfig, WinMLCompileConfig)."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("npu", ["npu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config(device="npu")
 
@@ -2825,8 +2833,8 @@ def test_npu_returns_quant_and_compile(self) -> None:
     def test_gpu_returns_none_quant_and_none_compile(self) -> None:
         """device=gpu returns (None, None) — DML has no offline compile step."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("gpu", ["gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config(device="gpu")
 
@@ -2836,8 +2844,8 @@ def test_gpu_returns_none_quant_and_none_compile(self) -> None:
     def test_cpu_returns_none_none(self) -> None:
         """device=cpu returns (None, None) since CPU has no compile provider."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("cpu", ["cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config(device="cpu")
 
@@ -2851,9 +2859,13 @@ def test_ep_override_changes_provider(self) -> None:
         Device is stored in ep_config.device (not provider_options) to avoid
         crashes when trtrtx gets device_type in add_provider_for_devices.
         """
+        # The mock must echo the requested ep back as available_eps[0] —
+        # resolve_quant_compile_config forwards available_eps[0] to
+        # resolve_precision, so the ep argument to resolve_precision needs to
+        # match what the user passed (nv_tensorrt_rtx).
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("gpu", ["gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("gpu", ["gpu", "cpu"], ["NvTensorRTRTXExecutionProvider"]),
         ):
             _quant, compile_cfg = resolve_quant_compile_config(
                 device="gpu",
@@ -2872,8 +2884,8 @@ def test_task_forwarded_to_resolve_precision(self) -> None:
         """
         with (
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("gpu", ["gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
             ),
             patch(
                 "winml.modelkit.config.precision.resolve_precision",
@@ -2890,8 +2902,8 @@ def test_task_forwarded_to_resolve_precision(self) -> None:
     def test_explicit_int8_precision_on_npu(self) -> None:
         """Explicit precision=int8 on npu produces uint8 quant."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("npu", ["npu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
         ):
             quant, _compile_cfg = resolve_quant_compile_config(
                 device="npu",
@@ -2905,8 +2917,8 @@ def test_explicit_int8_precision_on_npu(self) -> None:
     def test_explicit_fp32_precision_no_quant(self) -> None:
         """Explicit precision=fp32 produces no quantization."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("gpu", ["gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
         ):
             quant, _compile_cfg = resolve_quant_compile_config(
                 device="gpu",
diff --git a/tests/unit/config/test_build_onnx.py b/tests/unit/config/test_build_onnx.py
index 7510774df..50440069b 100644
--- a/tests/unit/config/test_build_onnx.py
+++ b/tests/unit/config/test_build_onnx.py
@@ -124,8 +124,8 @@ def test_config_onnx_with_device_precision(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "gpu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             runner = CliRunner()
@@ -202,8 +202,8 @@ def test_raw_onnx_full_pipeline(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="npu")
@@ -224,8 +224,8 @@ def test_raw_onnx_cpu(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="cpu")
@@ -243,8 +243,8 @@ def test_quantized_onnx_skips_quant(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=True),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="npu")
@@ -262,8 +262,8 @@ def test_quantized_onnx_cpu(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=True),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="cpu")
@@ -322,8 +322,8 @@ def test_export_always_none(self, tmp_path) -> None:
                 patch("winml.modelkit.onnx.is_compiled_onnx", return_value=is_compiled),
                 patch("winml.modelkit.onnx.is_quantized_onnx", return_value=is_quantized),
                 patch(
-                    "winml.modelkit.sysinfo.resolve_device",
-                    return_value=("cpu", ["cpu"]),
+                    "winml.modelkit.sysinfo.resolve_check_device_ep",
+                    return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
                 ),
             ):
                 config = generate_onnx_build_config(str(onnx_file))
@@ -345,8 +345,8 @@ def test_optim_always_present(self, tmp_path) -> None:
                 patch("winml.modelkit.onnx.is_compiled_onnx", return_value=is_compiled),
                 patch("winml.modelkit.onnx.is_quantized_onnx", return_value=is_quantized),
                 patch(
-                    "winml.modelkit.sysinfo.resolve_device",
-                    return_value=("cpu", ["cpu"]),
+                    "winml.modelkit.sysinfo.resolve_check_device_ep",
+                    return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
                 ),
             ):
                 config = generate_onnx_build_config(str(onnx_file))
@@ -364,8 +364,8 @@ def test_task_stored_in_loader(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -384,8 +384,8 @@ def test_task_none_by_default(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file))
@@ -412,8 +412,8 @@ def test_override_applied_last(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -441,8 +441,8 @@ def test_override_quant_none_on_raw(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
             patch(
                 "winml.modelkit.config.build.merge_config",
@@ -503,8 +503,8 @@ def test_override_none_is_noop(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -530,8 +530,8 @@ def test_onnx_path_as_string(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file))
@@ -549,8 +549,8 @@ def test_onnx_path_as_pathlib(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("cpu", ["cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(Path(onnx_file))
@@ -570,8 +570,8 @@ def test_auto_device_auto_precision_defaults(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("auto", ["npu", "gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("auto", ["npu", "gpu", "cpu"], ["CPUExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file))
@@ -609,8 +609,8 @@ def test_raw_onnx_with_gpu(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("gpu", ["gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(str(onnx_file), device="gpu")
@@ -631,8 +631,8 @@ def test_ep_override_forwarded(self, tmp_path) -> None:
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("gpu", ["gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
             ),
         ):
             config = generate_onnx_build_config(
@@ -659,8 +659,8 @@ class TestResolveQuantCompileConfig:
     def test_auto_auto_returns_none_none(self) -> None:
         """device=auto + precision=auto returns (None, None)."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("auto", ["npu", "gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("auto", ["npu", "gpu", "cpu"], ["CPUExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config()
 
@@ -670,8 +670,8 @@ def test_auto_auto_returns_none_none(self) -> None:
     def test_npu_returns_quant_and_compile(self) -> None:
         """device=npu returns (WinMLQuantizationConfig, WinMLCompileConfig)."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("npu", ["npu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config(device="npu")
 
@@ -684,8 +684,8 @@ def test_npu_returns_quant_and_compile(self) -> None:
     def test_gpu_returns_none_quant_and_none_compile(self) -> None:
         """device=gpu returns (None, None) — DML has no EPContext step."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("gpu", ["gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config(device="gpu")
 
@@ -695,8 +695,8 @@ def test_gpu_returns_none_quant_and_none_compile(self) -> None:
     def test_cpu_returns_none_none(self) -> None:
         """device=cpu returns (None, None) since CPU has no compile provider."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("cpu", ["cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("cpu", ["cpu"], ["CPUExecutionProvider"]),
         ):
             quant, compile_cfg = resolve_quant_compile_config(device="cpu")
 
@@ -710,9 +710,13 @@ def test_ep_override_changes_provider(self) -> None:
         Device is stored in ep_config.device (not provider_options) to avoid
         crashes when trtrtx gets device_type in add_provider_for_devices.
         """
+        # The mock must echo the requested ep back as available_eps[0] —
+        # resolve_quant_compile_config forwards available_eps[0] to
+        # resolve_precision, so the ep argument to resolve_precision needs to
+        # match what the user passed (nv_tensorrt_rtx).
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("gpu", ["gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("gpu", ["gpu", "cpu"], ["NvTensorRTRTXExecutionProvider"]),
         ):
             _quant, compile_cfg = resolve_quant_compile_config(
                 device="gpu",
@@ -731,8 +735,8 @@ def test_task_forwarded_to_resolve_precision(self) -> None:
         """
         with (
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("gpu", ["gpu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
             ),
             patch(
                 "winml.modelkit.config.precision.resolve_precision",
@@ -749,8 +753,8 @@ def test_task_forwarded_to_resolve_precision(self) -> None:
     def test_explicit_int8_precision_on_npu(self) -> None:
         """Explicit precision=int8 on npu produces uint8 quant."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("npu", ["npu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
         ):
             quant, _compile_cfg = resolve_quant_compile_config(
                 device="npu",
@@ -764,8 +768,8 @@ def test_explicit_int8_precision_on_npu(self) -> None:
     def test_explicit_fp32_precision_no_quant(self) -> None:
         """Explicit precision=fp32 produces no quantization."""
         with patch(
-            "winml.modelkit.sysinfo.resolve_device",
-            return_value=("gpu", ["gpu", "cpu"]),
+            "winml.modelkit.sysinfo.resolve_check_device_ep",
+            return_value=("gpu", ["gpu", "cpu"], ["DmlExecutionProvider"]),
         ):
             quant, _compile_cfg = resolve_quant_compile_config(
                 device="gpu",
diff --git a/tests/unit/models/auto/test_auto_onnx.py b/tests/unit/models/auto/test_auto_onnx.py
index 5766916ac..3daf2d6fc 100644
--- a/tests/unit/models/auto/test_auto_onnx.py
+++ b/tests/unit/models/auto/test_auto_onnx.py
@@ -50,8 +50,8 @@ def test_auto_generates_config_when_none(self, fake_onnx: Path, tmp_path: Path):
             patch("winml.modelkit.onnx.is_compiled_onnx", return_value=False),
             patch("winml.modelkit.onnx.is_quantized_onnx", return_value=False),
             patch(
-                "winml.modelkit.sysinfo.resolve_device",
-                return_value=("npu", ["npu", "cpu"]),
+                "winml.modelkit.sysinfo.resolve_check_device_ep",
+                return_value=("npu", ["npu", "cpu"], ["QNNExecutionProvider"]),
             ),
             patch(
                 "winml.modelkit.config.precision.resolve_eps",
diff --git a/tests/unit/sysinfo/test_device.py b/tests/unit/sysinfo/test_device.py
index bd92348ff..2429d4905 100644
--- a/tests/unit/sysinfo/test_device.py
+++ b/tests/unit/sysinfo/test_device.py
@@ -14,6 +14,7 @@
     _DEVICE_EP_MAP,
     _EP_DEVICE_MAP,
     _get_available_devices,
+    resolve_check_device_ep,
     resolve_device,
     resolve_eps,
 )
@@ -493,3 +494,157 @@ def test_no_eps_available_returns_empty(self) -> None:
             assert resolve_eps("npu") == []
             assert resolve_eps("gpu") == []
             assert resolve_eps("cpu") == []
+
+
+class TestResolveCheckDeviceEp:
+    """Tests for resolve_check_device_ep().
+
+    The function has two distinct code paths:
+
+    - **Path A** — ``device == "auto"`` OR ``ep is None``. Resolves the
+      concrete device via :func:`resolve_device` (system-aware: raises if the
+      device/EP is not present) and the EP list via :func:`resolve_eps`. The
+      returned ``available_devices`` is the *static* device set for the first
+      available EP (``EP_SUPPORTED_DEVICES[available_eps[0]]``), not the
+      runtime-available list — keeps the contract symmetric with Path B.
+    - **Path B** — explicit device AND explicit ep. Validates only against
+      ``EP_SUPPORTED_DEVICES``. Does **not** consult ORT, so it succeeds on
+      hosts with no EPs installed — for callers that just want to validate a
+      (device, ep) pair without running it.
+    """
+
+    def test_auto_no_ep_delegates_to_system(self) -> None:
+        """device='auto', ep=None -> Path A: device + EPs come from system probe."""
+        with _patch_device_ep_map(
+            {
+                "npu": ("QNNExecutionProvider",),
+                "gpu": ("DmlExecutionProvider",),
+                "cpu": ("CPUExecutionProvider",),
+            }
+        ):
+            device, available_devices, available_eps = resolve_check_device_ep(
+                device="auto", ep=None
+            )
+
+        assert device == "npu"
+        # available_devices is EP_SUPPORTED_DEVICES[available_eps[0]] (static),
+        # not the runtime device list. QNN supports npu/gpu, so cpu is absent
+        # even though the mocked system has it.
+        assert available_devices == ["npu", "gpu"]
+        # When ep=None, available_eps comes from resolve_eps -- the full list
+        # of EPs that target the resolved device, not a single explicit ep.
+        assert available_eps == ["QNNExecutionProvider"]
+
+    def test_auto_with_ep_returns_single_ep(self) -> None:
+        """device='auto', ep='qnn' -> Path A: available_eps narrows to [ep_name]."""
+        with _patch_device_ep_map(
+            {
+                "npu": ("QNNExecutionProvider",),
+                "gpu": ("QNNExecutionProvider", "DmlExecutionProvider"),
+                "cpu": ("CPUExecutionProvider",),
+            }
+        ):
+            device, available_devices, available_eps = resolve_check_device_ep(
+                device="auto", ep="qnn"
+            )
+
+        assert device == "npu"
+        assert available_devices == ["npu", "gpu"]
+        # Even though gpu also advertises DML, the EP filter pins this to qnn.
+        assert available_eps == ["QNNExecutionProvider"]
+
+    def test_explicit_device_no_ep_delegates(self) -> None:
+        """device='npu', ep=None -> Path A (ep_name is None): goes through resolve_device."""
+        with _patch_device_ep_map(
+            {
+                "npu": ("QNNExecutionProvider",),
+                "cpu": ("CPUExecutionProvider",),
+            }
+        ):
+            device, available_devices, available_eps = resolve_check_device_ep(
+                device="npu", ep=None
+            )
+
+        assert device == "npu"
+        # available_devices reflects the static EP_SUPPORTED_DEVICES for QNN
+        # (npu, gpu) rather than what the mocked system advertises.
+        assert available_devices == ["npu", "gpu"]
+        assert available_eps == ["QNNExecutionProvider"]
+
+    def test_explicit_device_and_ep_uses_static_mapping(self) -> None:
+        """device='npu', ep='qnn' -> Path B: returns from static EP_SUPPORTED_DEVICES.
+
+        The available_devices is the EP's supported set ('npu', 'gpu' for QNN),
+        not what the system actually exposes.
+        """
+        with _patch_device_ep_map(
+            {
+                "npu": ("QNNExecutionProvider",),
+                "cpu": ("CPUExecutionProvider",),
+            }
+        ):
+            device, available_devices, available_eps = resolve_check_device_ep(
+                device="npu", ep="qnn"
+            )
+
+        assert device == "npu"
+        assert sorted(available_devices) == ["gpu", "npu"]
+        assert available_eps == ["QNNExecutionProvider"]
+
+    def test_path_b_does_not_require_system_eps(self) -> None:
+        """Path B succeeds when the system has no EPs registered at all.
+
+        This is the key contract for callers that only need to *validate* a
+        (device, ep) pair without running it.
+        """
+        with _patch_device_ep_map({}):
+            device, available_devices, available_eps = resolve_check_device_ep(
+                device="npu", ep="qnn"
+            )
+
+        assert device == "npu"
+        assert "npu" in available_devices
+        assert available_eps == ["QNNExecutionProvider"]
+
+    def test_explicit_device_unsupported_by_ep_raises(self) -> None:
+        """device='cpu' + ep='qnn' -> ValueError: QNN does not support CPU."""
+        with (
+            _patch_device_ep_map({}),
+            pytest.raises(ValueError, match="does not support device 'cpu'"),
+        ):
+            resolve_check_device_ep(device="cpu", ep="qnn")
+
+    def test_explicit_unknown_ep_raises(self) -> None:
+        """device='npu' + ep='tpu' -> ValueError: 'Unknown EP'."""
+        with (
+            _patch_device_ep_map({}),
+            pytest.raises(ValueError, match="Unknown EP 'tpu'"),
+        ):
+            resolve_check_device_ep(device="npu", ep="tpu")
+
+    def test_auto_unknown_ep_raises_from_delegate(self) -> None:
+        """device='auto' + ep='tpu' -> Path A delegates to resolve_device, which raises.
+
+        Confirms the error message is consistent across paths so users get the
+        same diagnostic regardless of whether they passed an explicit device.
+        """
+        with (
+            _patch_device_ep_map(
+                {
+                    "npu": ("QNNExecutionProvider",),
+                    "cpu": ("CPUExecutionProvider",),
+                }
+            ),
+            pytest.raises(ValueError, match="Unknown EP 'tpu'"),
+        ):
+            resolve_check_device_ep(device="auto", ep="tpu")
+
+    def test_case_insensitive(self) -> None:
+        """Device and EP arguments are case-insensitive."""
+        with _patch_device_ep_map({}):
+            device, _available_devices, available_eps = resolve_check_device_ep(
+                device="NPU", ep="QNN"
+            )
+
+        assert device == "npu"
+        assert available_eps == ["QNNExecutionProvider"]

From 9ec03451d52b1895ebb2c24ebe90880a8668746a Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Thu, 28 May 2026 17:55:42 +0800
Subject: [PATCH 015/143] chore: enable checking types and fix analyze folder
 (#768)

Co-authored-by: hualxie <hualxie@microsoft.com>
---
 .github/workflows/lint.yml                    |  9 +++++
 pyproject.toml                                | 17 ++++++++
 src/winml/modelkit/analyze/analyzer.py        |  7 ++--
 src/winml/modelkit/analyze/console_writer.py  |  2 +-
 .../analyze/core/doc_constraint_checker.py    |  8 ++--
 .../analyze/core/information_engine.py        | 18 ++++++---
 .../constant_folding_validator.py             |  2 +-
 .../model_validator_manager.py                |  1 -
 .../pattern_matching_validator.py             |  2 +-
 .../node_checkers/ep_context_node_checker.py  |  1 +
 .../analyze/core/node_checkers/registry.py    |  2 +-
 .../modelkit/analyze/core/onnx_loader.py      |  3 +-
 .../analyze/core/output_aggregator.py         |  8 ++--
 .../analyze/core/pattern_extractor.py         | 15 ++++---
 .../modelkit/analyze/core/runtime_checker.py  |  2 +-
 .../analyze/core/runtime_checker_query.py     | 40 ++++++++++---------
 .../analyze/doc_checker/mapping_checkers.py   | 36 ++++++++++++++---
 .../modelkit/analyze/models/information.py    |  2 +-
 .../modelkit/analyze/models/runtime_checks.py |  2 +-
 .../analyze/pattern/check_patterns.py         |  7 +++-
 .../analyze/runtime_checker/check_ops.py      |  6 ++-
 .../runtime_checker/result_processor.py       | 21 ++++++----
 .../modelkit/analyze/utils/model_utils.py     | 38 ++++++++++++------
 .../modelkit/analyze/utils/node_key_utils.py  |  7 ++++
 src/winml/modelkit/analyze/utils/op_utils.py  |  6 +--
 src/winml/modelkit/pattern/config.py          |  7 ++--
 src/winml/modelkit/pattern/match.py           |  7 ++--
 27 files changed, 191 insertions(+), 85 deletions(-)

diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
index c5a1dcb37..13b25b92c 100644
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@@ -33,3 +33,12 @@ jobs:
 
       - name: Lint
         run: uv run ruff check src/ tests/
+
+      # Type-checking is currently advisory: src/ has many pre-existing
+      # mypy errors against the strict config in pyproject.toml. The job
+      # surfaces type issues in CI logs without blocking PRs while the
+      # backlog is worked down. Flip continue-on-error to false (or drop
+      # it) once src/ is clean.
+      - name: Type check (advisory)
+        continue-on-error: true
+        run: uv run mypy -p winml.modelkit
diff --git a/pyproject.toml b/pyproject.toml
index 7c8f1bfd2..6ce286c71 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -110,9 +110,13 @@ scripts.winml = "winml.modelkit.cli:main"
 [dependency-groups]
 dev = [
   "jupyter>=1.1.1",
+  "pandas-stubs>=3.0.0.260204",
   "pre-commit>=4.5.1",
   "pytest-cov>=7",
   "pytest-timeout>=2.4.0",
+  "types-jsonschema>=4.26.0.20260518",
+  "types-protobuf>=7.34.1.20260518",
+  "types-tqdm>=4.67.3.20260518",
 ]
 
 [tool.setuptools.packages.find]
@@ -430,6 +434,11 @@ exclude_lines = [
 [tool.mypy]
 python_version = "3.11"
 
+# Pydantic v2 plugin: teaches mypy about Field() defaults, so models with
+# Field(None, ...) or Field(default_factory=...) aren't reported as having
+# required call-args.
+plugins = [ "pydantic.mypy" ]
+
 # === Strict Mode ===
 strict = true
 
@@ -523,6 +532,14 @@ disable_error_code = [
   "has-type",
 ]
 
+# Fully skipped: vendored/generated ONNX opset definitions
+[[tool.mypy.overrides]]
+module = [
+  "winml.modelkit.analyze.onnx_opset",
+  "winml.modelkit.analyze.onnx_opset.*",
+]
+ignore_errors = true
+
 # =============================================================================
 # PYTEST - Testing Configuration
 # =============================================================================
diff --git a/src/winml/modelkit/analyze/analyzer.py b/src/winml/modelkit/analyze/analyzer.py
index 36ccb34e7..5525bf061 100644
--- a/src/winml/modelkit/analyze/analyzer.py
+++ b/src/winml/modelkit/analyze/analyzer.py
@@ -31,6 +31,7 @@
 
     from .models.information import Action
     from .models.output import AnalysisOutput
+    from .models.runtime_checks import PatternRuntime
 
 
 @dataclass
@@ -390,7 +391,7 @@ def get_optimization_config(self, ep: EPNameOrAlias | None = None) -> WinMLOptim
         actions = self.get_optimization_opportunities(ep=ep)
 
         # Collect all optimization options from action items
-        optim_options = {}
+        optim_options: dict[str, bool] = {}
         for action in actions:
             for action_item in action.action_items:
                 # Only process GraphOptimization type
@@ -727,8 +728,8 @@ def analyze_from_proto(
         extraction_ms = int((time.perf_counter() - extraction_start) * 1000)
 
         # Step 2: Check runtime support for each EP
-        check_op_results = {}
-        information_list = {}
+        check_op_results: dict[EPName, list[PatternRuntime]] = {}
+        information_list: dict[EPName, list[Information]] = {}
         ep_runtime_timing: dict[str, int] = {}
         ep_info_timing: dict[str, int] = {}
         for current_ep in eps_to_analyze:
diff --git a/src/winml/modelkit/analyze/console_writer.py b/src/winml/modelkit/analyze/console_writer.py
index 6aa8d36aa..c8be13a3f 100644
--- a/src/winml/modelkit/analyze/console_writer.py
+++ b/src/winml/modelkit/analyze/console_writer.py
@@ -52,7 +52,7 @@ def _bright_cyan(self, text: str | int | float) -> str:
         """Format text in bright cyan."""
         return f"[bold cyan]{text}[/bold cyan]"
 
-    def _bright_green(self, text: str) -> str:
+    def _bright_green(self, text: str | int | float) -> str:
         """Format text in bright green."""
         return f"[bold green]{text}[/bold green]"
 
diff --git a/src/winml/modelkit/analyze/core/doc_constraint_checker.py b/src/winml/modelkit/analyze/core/doc_constraint_checker.py
index f1d25a88b..3b4757d83 100644
--- a/src/winml/modelkit/analyze/core/doc_constraint_checker.py
+++ b/src/winml/modelkit/analyze/core/doc_constraint_checker.py
@@ -141,7 +141,7 @@ def _load_constraints(self) -> dict[str, pd.DataFrame]:
 
         return op_dfs
 
-    def _load_mapping_config(self) -> dict:
+    def _load_mapping_config(self) -> dict[Any, Any]:
         """Load ONNX to target EP operator mapping configuration.
 
         Returns:
@@ -162,7 +162,7 @@ def _load_mapping_config(self) -> dict:
             return {}
 
         with mapping_path.open(encoding="utf-8") as f:
-            mapping_config = json.load(f)
+            mapping_config: dict[Any, Any] = json.load(f)
 
         logger.debug(f"Loaded operator mapping config for {self.ep_name}")
         return mapping_config
@@ -301,7 +301,9 @@ def _execute_checker(
         index = checker_info.get("index", 0)
 
         # Get the checker function
-        checker_func = self.CHECKER_FUNCTIONS.get(checker_name)
+        checker_func = (
+            self.CHECKER_FUNCTIONS.get(checker_name) if checker_name is not None else None
+        )
         if checker_func is None:
             logger.debug(f"Unknown checker function: {checker_name}")
             return False, f"Unknown checker function: {checker_name}"
diff --git a/src/winml/modelkit/analyze/core/information_engine.py b/src/winml/modelkit/analyze/core/information_engine.py
index 92d478181..d4c574b16 100644
--- a/src/winml/modelkit/analyze/core/information_engine.py
+++ b/src/winml/modelkit/analyze/core/information_engine.py
@@ -9,11 +9,15 @@
 validation checks.
 """
 
+# Defensive None-checks here are unreachable per the type annotations but kept
+# as runtime safety nets, so silence mypy's [unreachable] for this file only.
+# mypy: disable-error-code="unreachable"
+
 from __future__ import annotations
 
 import logging
 import time
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any
 
 
 if TYPE_CHECKING:
@@ -24,7 +28,7 @@
     from ...utils.constants import EPName
     from ..models.information import Action, Information
     from ..models.onnx_model import ONNXModel
-    from ..models.runtime_checks import PatternRuntime
+    from ..models.runtime_checks import PatternAlternative, PatternRuntime
 
 from ..models.information import ActionLevel
 from ..models.support_level import SupportLevel
@@ -639,6 +643,9 @@ def _query_doc_constraints(self, runtime_result: PatternRuntime, pattern_id: str
                 return None
 
             # Query doc checker
+            if self._doc_checker is None:
+                logger.debug("Doc checker not initialized, skipping doc constraints query")
+                return None
             logger.debug("Running doc checker for node: %s", node.name)
             checker_start = time.perf_counter()
             doc_result = self._doc_checker.run_for_node(node)
@@ -723,7 +730,7 @@ def _determine_action_level_and_status(
         self,
         current_classification: SupportLevel,
         alternative_classification: SupportLevel,
-    ) -> tuple[ActionLevel, SupportLevel | None]:
+    ) -> tuple[ActionLevel | None, SupportLevel | None]:
         """Determine action level and status based on classification transition.
 
         Args:
@@ -854,11 +861,10 @@ def _create_action(
                     f"and no alternatives are available. "
                     f"Manual replacement or removal required."
                 )
-
         else:
             details = f"Pattern '{pattern_from_id}' status requires review."
 
-        action_kwargs = {
+        action_kwargs: dict[str, Any] = {
             "pattern_from_id": pattern_from_id,
             "pattern_to_id": pattern_to_id,
             "level": level,
@@ -1118,7 +1124,7 @@ def _process_predefined_information(
         classification = runtime_result.result.classification
 
         # Build a lookup map for alternatives by pattern_id
-        alternatives_map: dict[str, PatternRuntime] = {
+        alternatives_map: dict[str, PatternAlternative] = {
             alt.pattern_id: alt for alt in runtime_result.alternatives
         }
 
diff --git a/src/winml/modelkit/analyze/core/model_validators/constant_folding_validator.py b/src/winml/modelkit/analyze/core/model_validators/constant_folding_validator.py
index ebd463792..327de6153 100644
--- a/src/winml/modelkit/analyze/core/model_validators/constant_folding_validator.py
+++ b/src/winml/modelkit/analyze/core/model_validators/constant_folding_validator.py
@@ -66,7 +66,7 @@ def _collect_constant_nodes_from_runtime_results(self) -> list[dict]:
         Returns:
             List of dicts with node info: {name, op_type}
         """
-        constant_nodes = []
+        constant_nodes: list[dict] = []
 
         for runtime_result in self.op_runtime_results:
             # Check if this result has the ALL_INPUTS_CONSTANT tag
diff --git a/src/winml/modelkit/analyze/core/model_validators/model_validator_manager.py b/src/winml/modelkit/analyze/core/model_validators/model_validator_manager.py
index 5056cbcb9..29dd0235c 100644
--- a/src/winml/modelkit/analyze/core/model_validators/model_validator_manager.py
+++ b/src/winml/modelkit/analyze/core/model_validators/model_validator_manager.py
@@ -92,7 +92,6 @@ def __init__(
         self.model_proto = model.get_model()
         self.op_runtime_results = op_runtime_results or []
         self.device = device or "NPU"
-        self.device = device
         self.enabled_validators = enabled_validators or list(self.VALIDATORS.keys())
 
         # Instantiate enabled validators
diff --git a/src/winml/modelkit/analyze/core/model_validators/pattern_matching_validator.py b/src/winml/modelkit/analyze/core/model_validators/pattern_matching_validator.py
index faa478304..afdb59a2f 100644
--- a/src/winml/modelkit/analyze/core/model_validators/pattern_matching_validator.py
+++ b/src/winml/modelkit/analyze/core/model_validators/pattern_matching_validator.py
@@ -151,7 +151,7 @@ def _detect_error(self) -> tuple[PatternErrorConfig, str] | tuple[None, None]:
         return None, None
 
     def _create_information(
-        self, error_config: PatternErrorConfig, actual_error_msg: str
+        self, error_config: PatternErrorConfig, actual_error_msg: str | None
     ) -> Information:
         """Create Information object with pattern matching error details.
 
diff --git a/src/winml/modelkit/analyze/core/node_checkers/ep_context_node_checker.py b/src/winml/modelkit/analyze/core/node_checkers/ep_context_node_checker.py
index 65aea9a25..57bf0def6 100644
--- a/src/winml/modelkit/analyze/core/node_checkers/ep_context_node_checker.py
+++ b/src/winml/modelkit/analyze/core/node_checkers/ep_context_node_checker.py
@@ -45,6 +45,7 @@ def check(
     ) -> "PatternRuntime":
         """Check EPContext node partition_name against the execution provider name."""
         ep_name = kwargs.get("ep_name")
+        assert ep_name is not None, "ep_name must be provided for EPContextNodeChecker"
         partition_name = self.get_attribute_value(node, "partition_name")
         # Suffix every pattern_id with the EP label derived from
         # partition_name so multi-EP analysis reports stay disambiguated
diff --git a/src/winml/modelkit/analyze/core/node_checkers/registry.py b/src/winml/modelkit/analyze/core/node_checkers/registry.py
index fc6b7be8a..2cea0e7e1 100644
--- a/src/winml/modelkit/analyze/core/node_checkers/registry.py
+++ b/src/winml/modelkit/analyze/core/node_checkers/registry.py
@@ -8,7 +8,7 @@
 from .base import NodeChecker
 
 
-T = TypeVar("T")
+T = TypeVar("T", bound=type[NodeChecker])
 
 
 class NodeCheckerRegistry:
diff --git a/src/winml/modelkit/analyze/core/onnx_loader.py b/src/winml/modelkit/analyze/core/onnx_loader.py
index 5945f9508..7794f00d0 100644
--- a/src/winml/modelkit/analyze/core/onnx_loader.py
+++ b/src/winml/modelkit/analyze/core/onnx_loader.py
@@ -135,7 +135,8 @@ def load(self) -> ONNXModel:
         # Load model proto if not already loaded from memory
         if self._from_memory:
             # Type checker verified via constructor that _model_proto is not None here
-            model_proto = self._model_proto  # type: ignore[assignment]
+            assert self._model_proto is not None
+            model_proto = self._model_proto
             logger.info("Using ONNX model from memory")
         else:
             # Validate file extension (warning only, not blocking)
diff --git a/src/winml/modelkit/analyze/core/output_aggregator.py b/src/winml/modelkit/analyze/core/output_aggregator.py
index 64da3ce6f..3c93ea4fd 100644
--- a/src/winml/modelkit/analyze/core/output_aggregator.py
+++ b/src/winml/modelkit/analyze/core/output_aggregator.py
@@ -15,9 +15,11 @@
 
 
 if TYPE_CHECKING:
+    from ...utils.constants import EPName
     from ..models.information import Information
     from ..models.runtime_checks import PatternRuntime
 
+
 from ..models.output import AnalysisOutput, EPSupport, ModelStats
 from ..models.support_level import SupportLevel
 from ..utils import infer_ihv_from_ep_name
@@ -47,8 +49,8 @@ def __init__(self) -> None:
     def aggregate(
         self,
         metadata: ModelStats,
-        check_results: dict[str, list[PatternRuntime]],  # EP name -> check results
-        information_list: dict[str, list[Information]],  # EP name -> information
+        check_results: dict[EPName, list[PatternRuntime]],  # EP name -> check results
+        information_list: dict[EPName, list[Information]],  # EP name -> information
         device: str | None = None,  # Device type
     ) -> AnalysisOutput:
         """Aggregate all analysis results.
@@ -143,7 +145,7 @@ def build_ep_support(
         self,
         check_results: list[PatternRuntime],
         information_list: list[Information],
-        ep_type: str,
+        ep_type: EPName,
         device_type: str | None = None,
         ep_version: str | None = None,
         driver_version: str | None = None,
diff --git a/src/winml/modelkit/analyze/core/pattern_extractor.py b/src/winml/modelkit/analyze/core/pattern_extractor.py
index e2dd35279..43d188c8d 100644
--- a/src/winml/modelkit/analyze/core/pattern_extractor.py
+++ b/src/winml/modelkit/analyze/core/pattern_extractor.py
@@ -11,11 +11,11 @@
 
 import logging
 import time
-from typing import TYPE_CHECKING, TypedDict
+from typing import TYPE_CHECKING, TypedDict, cast
 
 from ...pattern.base import InvalidPatternMatcherModelError, PatternMatcher
 from ...pattern.config import UnifiedPatternConfig
-from ..models.onnx_model import ONNXModel
+from ..models.onnx_model import ModelTag, ONNXModel
 from ..models.output import extract_model_stats
 from ..utils.timing_utils import make_timing_logger
 
@@ -305,7 +305,7 @@ def extract_subgraph_patterns_with_pattern_matcher(self) -> list[PatternMatchRes
             # Model is invalid for pattern matching (e.g., nodes with empty names)
             logger.warning("Model validation failed for pattern matching: %s", str(e))
             # Mark model with the exception's associated tag and error message
-            self._model.model_tags[e.error_tag] = str(e)
+            self._model.model_tags[ModelTag(e.error_tag)] = str(e)
             _log_timing(
                 "pattern_extractor.pattern_matcher",
                 model=self._model.model_path,
@@ -500,6 +500,7 @@ def _match_subgraph_pattern_from_model_tags(
             return []
 
         pattern_label = pattern.semantic_label
+        assert pattern_label is not None  # ensured by _validate_pattern_for_matching
 
         # Get ONNX model
         model_proto = self._model.get_model()
@@ -577,10 +578,14 @@ def _match_subgraph_pattern_from_htp_metadata(
             return []
 
         pattern_label = pattern.semantic_label
+        assert pattern_label is not None  # ensured by _validate_pattern_for_matching
+
+        # The 'nodes' section of HTP metadata maps node names to traced tags (str -> str).
+        nodes_mapping = cast("dict[str, str]", htp_metadata["nodes"])
 
         # Group nodes by traced_tag that contains pattern_label
         grouped_nodes = self._group_nodes_by_traced_tag(
-            nodes_mapping=htp_metadata["nodes"],
+            nodes_mapping=nodes_mapping,
             pattern_label=pattern_label,
         )
 
@@ -666,7 +671,7 @@ def get_subgraph_patterns(self) -> list[SubgraphPattern]:
     def model_summary(
         self,
         detected_pattern_count: dict[str, int] | None = None,
-    ) -> ModelStats:  # type: ignore[name-defined]
+    ) -> ModelStats:
         """Get model metadata and statistics.
 
         Args:
diff --git a/src/winml/modelkit/analyze/core/runtime_checker.py b/src/winml/modelkit/analyze/core/runtime_checker.py
index 2f734da9b..19ae78dbb 100644
--- a/src/winml/modelkit/analyze/core/runtime_checker.py
+++ b/src/winml/modelkit/analyze/core/runtime_checker.py
@@ -484,7 +484,7 @@ def _get_node_key(self, op_runtime: PatternRuntime) -> str:
         """Extract stable node key from an op-level PatternRuntime."""
         pm = op_runtime.pattern_match
         if pm and hasattr(pm, "skeleton_match_result"):
-            node_keys = pm.skeleton_match_result.matched_node_keys
+            node_keys: list[str] = pm.skeleton_match_result.matched_node_keys
             if node_keys:
                 return node_keys[0]
         return ""
diff --git a/src/winml/modelkit/analyze/core/runtime_checker_query.py b/src/winml/modelkit/analyze/core/runtime_checker_query.py
index c59ec7575..d5962a036 100644
--- a/src/winml/modelkit/analyze/core/runtime_checker_query.py
+++ b/src/winml/modelkit/analyze/core/runtime_checker_query.py
@@ -289,9 +289,6 @@ def _read_and_sanitize_parquet_table(parquet_path: Path) -> pd.DataFrame | None:
 
     try:
         table_df = pd.read_parquet(parquet_path)
-        if not isinstance(table_df, pd.DataFrame):
-            return None
-
         table_df = table_df.replace({np.nan: None})
         return _sanitize_df(table_df)
     except Exception as e:
@@ -625,7 +622,7 @@ def get_query_conditions_for_node(
         runtime_checker_op = get_runtime_checker_op(node.op_type, domain=domain.value)(schema)
     except KeyError:
         raise OpUnsupportedError(f"Node {node.op_type} is not supported") from None
-    type_vars = {}
+    type_vars: dict[str, Any] = {}
 
     # fill missing attrs with default values; set None for optional attrs without defaults
     for k, v in schema.attributes.items():
@@ -680,7 +677,7 @@ def update_conditions_(
         is_constant: bool,
         shape: tuple[Any, ...] | list[Any] | None = None,
         value: Any | None = None,
-    ):
+    ) -> None:
         dyn_axes = _compute_dynamic_axes(shape, is_constant)
         if is_variadic:
             cond[f"{input_name}_is_constant"] = (
@@ -721,7 +718,7 @@ def _tensor_to_array_with_fallback(tensor: onnx.TensorProto) -> np.ndarray:
             try:
                 np_dtype = onnx.helper.tensor_dtype_to_np_dtype(tensor.data_type)
             except Exception:
-                np_dtype = np.float32
+                np_dtype = np.dtype(np.float32)
 
             shape = tuple(int(d) for d in tensor.dims)
             logger.warning(
@@ -829,9 +826,10 @@ def _tensor_to_array_with_fallback(tensor: onnx.TensorProto) -> np.ndarray:
                 type_vars[type_annotation] = dtype
         else:
             vi = valueinfo.get(inp_name)
-            shape, dtype = (None, None)
+            shape_seq: list | tuple[int, ...] | None = None
+            dtype = None
             if vi is not None:
-                shape, dtype = shape_and_dtype_from_valueinfo(vi)
+                shape_seq, dtype = shape_and_dtype_from_valueinfo(vi)
             else:
                 # Input is provided but valueinfo not found
                 # This commonly happens in quantized models where DequantizeLinear outputs
@@ -854,7 +852,7 @@ def _tensor_to_array_with_fallback(tensor: onnx.TensorProto) -> np.ndarray:
                 type_vars[type_annotation] = dtype
 
             is_constant = False  # QDQ doesn't care about constant status
-            update_conditions_(conditions, input_name, is_variadic, is_constant, shape, None)
+            update_conditions_(conditions, input_name, is_variadic, is_constant, shape_seq, None)
             conditions[f"{input_name}_is_none"] = False
 
     conditions["n_outputs"] = len(node.output)
@@ -873,9 +871,9 @@ def _tensor_to_array_with_fallback(tensor: onnx.TensorProto) -> np.ndarray:
             f"derive_properties: {e}"
         ) from e
 
-    for k, v in runtime_checker_op.type_var_dtypes_to_test.items():
-        if k not in type_vars:
-            type_vars[k] = v[0].annotation  # use first dtype as default
+    for tvar_name, dtypes in runtime_checker_op.type_var_dtypes_to_test.items():
+        if tvar_name not in type_vars:
+            type_vars[tvar_name] = dtypes[0].annotation  # use first dtype as default
     conditions.update(type_vars)
 
     qdq_conditions = _get_qdq_query_conditions_for_node(node, schema, input_to_dq, output_to_q)
@@ -950,7 +948,11 @@ def _compute_dynamic_axes(shape: tuple | None, is_constant: bool) -> tuple[int,
         conditions[f"attr_{attr_name}"] = attr_value
         conditions[f"attr_{attr_name}_is_none"] = attr_value is None
 
-    conditions["n_outputs"] = len(pattern_match.skeleton_match_result.pattern.get_schema().outputs)
+    pattern_obj = pattern_match.skeleton_match_result.pattern
+    assert hasattr(pattern_obj, "get_schema"), (
+        f"Pattern {type(pattern_obj).__name__} does not provide get_schema()"
+    )
+    conditions["n_outputs"] = len(pattern_obj.get_schema().outputs)
 
     # Derive additional properties via pattern input generator
     if gen is not None:
@@ -1493,6 +1495,7 @@ def _generate_model_inputs(self, model: onnx.ModelProto) -> dict[str, np.ndarray
 
             np_dtype = SupportedONNXType.from_annotation(dtype_str).np_type
 
+            concrete_shape: tuple[int, ...]
             if shape is None:
                 concrete_shape = (default_dim_size,)
             else:
@@ -1536,7 +1539,7 @@ def _generate_node_inputs(self, node: onnx.NodeProto) -> dict[str, np.ndarray]:
                 try:
                     np_dtype = onnx.helper.tensor_dtype_to_np_dtype(init.data_type)
                 except Exception:
-                    np_dtype = np.float32
+                    np_dtype = np.dtype(np.float32)
 
                 shape = tuple(int(d) for d in init.dims)
                 input_feed[inp_name] = np.zeros(shape, dtype=np_dtype)
@@ -1552,7 +1555,7 @@ def _generate_node_inputs(self, node: onnx.NodeProto) -> dict[str, np.ndarray]:
                     f"not found in valueinfo"
                 )
 
-            shape, dtype_str = shape_and_dtype_from_valueinfo(vi)
+            vi_shape, dtype_str = shape_and_dtype_from_valueinfo(vi)
             if dtype_str is None:
                 raise ValueError(
                     f"Input '{inp_name}' for node '{node.name}' ({node.op_type}) "
@@ -1562,13 +1565,14 @@ def _generate_node_inputs(self, node: onnx.NodeProto) -> dict[str, np.ndarray]:
             # Convert dtype string to numpy dtype
             np_dtype = SupportedONNXType.from_annotation(dtype_str).np_type
 
-            if shape is None:
+            concrete_shape: tuple[int, ...]
+            if vi_shape is None:
                 # No shape info at all - use a simple 1D array
                 concrete_shape = (default_dim_size,)
             else:
                 # Replace dynamic dimensions (strings or None) with default size
                 concrete_shape = tuple(
-                    d if isinstance(d, int) and d > 0 else default_dim_size for d in shape
+                    d if isinstance(d, int) and d > 0 else default_dim_size for d in vi_shape
                 )
 
             input_feed[inp_name] = np.zeros(concrete_shape, dtype=np_dtype)
@@ -2409,7 +2413,7 @@ def _finish(result: PatternRuntime, outcome: str) -> PatternRuntime:
         # Phase 1: Extract conditions to determine if node is QDQ
         is_qdq = False
 
-        def get_pattern_id(is_qdq):
+        def get_pattern_id(is_qdq: bool) -> str:
             return (
                 pattern_match.pattern.pattern_id + QDQ_SUFFIX
                 if is_qdq
diff --git a/src/winml/modelkit/analyze/doc_checker/mapping_checkers.py b/src/winml/modelkit/analyze/doc_checker/mapping_checkers.py
index eb0e3fa17..64cad5bd5 100644
--- a/src/winml/modelkit/analyze/doc_checker/mapping_checkers.py
+++ b/src/winml/modelkit/analyze/doc_checker/mapping_checkers.py
@@ -16,14 +16,28 @@
 removed in final design optimization (see ADR-002).
 """
 
-from typing import Any
+from __future__ import annotations
+
+from typing import TYPE_CHECKING, Any
 
 # Import from same op_checker package
 from .shape_checker import ShapeConstraintChecker
 from .value_checker import ValueConstraintChecker
 
 
-def check_input_rank(node, input_index: int, rank: int, valueinfo=None, **kwargs) -> bool:
+if TYPE_CHECKING:
+    from collections.abc import Callable
+
+    import onnx
+
+
+def check_input_rank(
+    node: onnx.NodeProto,
+    input_index: int,
+    rank: int,
+    valueinfo: dict[str, onnx.ValueInfoProto] | None = None,
+    **kwargs: Any,
+) -> bool:
     """Check if input at specified index has expected rank.
 
     Used to distinguish between 2D and 3D operator variants:
@@ -90,7 +104,12 @@ def check_input_rank(node, input_index: int, rank: int, valueinfo=None, **kwargs
     return success
 
 
-def check_attribute(node, attribute_name: str, expected_value: Any | list[Any], **kwargs) -> bool:
+def check_attribute(
+    node: onnx.NodeProto,
+    attribute_name: str,
+    expected_value: Any | list[Any],
+    **kwargs: Any,
+) -> bool:
     """Check if attribute matches expected value(s).
 
     Supports both single value comparison and multi-value matching.
@@ -139,6 +158,7 @@ def check_attribute(node, attribute_name: str, expected_value: Any | list[Any],
     for attr in node.attribute:
         if attr.name == attribute_name:
             # Extract attribute value based on type
+            value: Any = None
             if hasattr(attr, "s"):  # String
                 value = attr.s.decode("utf-8") if isinstance(attr.s, bytes) else attr.s
             elif hasattr(attr, "i"):  # Integer
@@ -161,13 +181,19 @@ def check_attribute(node, attribute_name: str, expected_value: Any | list[Any],
 
 # Checker function registry for operator mapping (only conditional mappings)
 # Note: Direct mappings don't need checkers - they map 1:1 without conditions
-CHECKER_REGISTRY = {
+# Callable[..., bool] because the two functions have different signatures but
+# are invoked uniformly via **kwargs at the call site.
+CHECKER_REGISTRY: dict[str, Callable[..., bool]] = {
     "check_input_rank": check_input_rank,
     "check_attribute": check_attribute,
 }
 
 
-def get_qnn_op_for_onnx_node(onnx_node, mapping_config, valueinfo=None):
+def get_qnn_op_for_onnx_node(
+    onnx_node: onnx.NodeProto,
+    mapping_config: dict[str, Any],
+    valueinfo: dict[str, onnx.ValueInfoProto] | None = None,
+) -> tuple[str | None, str]:
     """Determine QNN operator for an ONNX node using conditional checker framework.
 
     This is the main entry point for operator mapping resolution. It handles:
diff --git a/src/winml/modelkit/analyze/models/information.py b/src/winml/modelkit/analyze/models/information.py
index afd6d0d5a..481b9419d 100644
--- a/src/winml/modelkit/analyze/models/information.py
+++ b/src/winml/modelkit/analyze/models/information.py
@@ -35,7 +35,7 @@ class ActionItem(BaseModel):
     type: str = Field(
         ..., description="Type of transformation or optimization, e.g. Olive pass name"
     )
-    optimization_options: dict[str, object] | None = Field(
+    optimization_options: dict[str, bool] | None = Field(
         default=None, description="Configuration options"
     )
 
diff --git a/src/winml/modelkit/analyze/models/runtime_checks.py b/src/winml/modelkit/analyze/models/runtime_checks.py
index 5356534f3..ecd60ebaa 100644
--- a/src/winml/modelkit/analyze/models/runtime_checks.py
+++ b/src/winml/modelkit/analyze/models/runtime_checks.py
@@ -52,7 +52,7 @@ class RuntimeTestResult(BaseModel):
         default_factory=list, description="List of NodeTag enums for classifying this node"
     )
     debug_details: Any | None = Field(
-        None, description="Optional debug information for runtime checks"
+        default=None, description="Optional debug information for runtime checks"
     )
 
     @property
diff --git a/src/winml/modelkit/analyze/pattern/check_patterns.py b/src/winml/modelkit/analyze/pattern/check_patterns.py
index 771ff67fa..e5594f6fc 100644
--- a/src/winml/modelkit/analyze/pattern/check_patterns.py
+++ b/src/winml/modelkit/analyze/pattern/check_patterns.py
@@ -31,6 +31,9 @@
 
 
 if TYPE_CHECKING:
+    import argparse
+    from collections.abc import Callable
+
     import onnxruntime as ort
 
     from ...utils.constants import EPName
@@ -245,7 +248,7 @@ def get_ep_checker(ep_name: EPName, device: str) -> EPChecker:
         ValueError: If the execution provider name is not supported.
     """
     device_type = constants.DEVICE_TO_DEVICE_TYPE[device]
-    ep_name_to_checker: dict[str, Any] = {
+    ep_name_to_checker: dict[str, Callable[..., EPChecker]] = {
         "QNNExecutionProvider": QNNNPUChecker,
         "OpenVINOExecutionProvider": OpenVINONPUChecker,
         # Add other EPChecker subclasses here as needed
@@ -259,7 +262,7 @@ def get_ep_checker(ep_name: EPName, device: str) -> EPChecker:
     return ep_name_to_checker[ep_name](device_type=device_type)
 
 
-def build_parser():
+def build_parser() -> argparse.ArgumentParser:
     """Build argument parser for check_patterns-style commands."""
     import argparse
 
diff --git a/src/winml/modelkit/analyze/runtime_checker/check_ops.py b/src/winml/modelkit/analyze/runtime_checker/check_ops.py
index 907908032..b4fbcc022 100644
--- a/src/winml/modelkit/analyze/runtime_checker/check_ops.py
+++ b/src/winml/modelkit/analyze/runtime_checker/check_ops.py
@@ -33,6 +33,8 @@
 
 
 if TYPE_CHECKING:
+    import argparse
+
     from ...utils.constants import EPName
 from ...pattern.op_input_gen.qdq_gen import QDQGenerator
 from ...sysinfo import SysInfo
@@ -67,7 +69,7 @@ def check_ops(
     not_run_start_id: int = 1,
     case_index: str | list[str] | None = None,
     conflict_file: str | Path | None = None,
-):
+) -> None:
     """Run operators on execution provider.
 
     Args:
@@ -312,7 +314,7 @@ def get_ep_checker(ep_name: EPName, device: str) -> EPChecker:
     return ep_name_to_checker[ep_name](device_type=device_type)
 
 
-def build_parser():
+def build_parser() -> argparse.ArgumentParser:
     """Build argument parser for check_ops-style commands."""
     import argparse
 
diff --git a/src/winml/modelkit/analyze/runtime_checker/result_processor.py b/src/winml/modelkit/analyze/runtime_checker/result_processor.py
index e9614719f..0a48b8463 100644
--- a/src/winml/modelkit/analyze/runtime_checker/result_processor.py
+++ b/src/winml/modelkit/analyze/runtime_checker/result_processor.py
@@ -6,7 +6,7 @@
 
 import json
 from pathlib import Path
-from typing import TYPE_CHECKING, Any
+from typing import TYPE_CHECKING, Any, cast
 
 import numpy as np
 import pandas as pd
@@ -132,7 +132,7 @@ def _get_all_attr_names(check_results: list[dict[str, Any]]) -> set[str]:
 
 def item_to_row(
     item: dict[str, Any],
-    input_constraint_types: dict[str, str] | None = None,
+    input_constraint_types: dict[str, str],
     all_attr_names: set[str] | None = None,
     replace_float_with_dummy: bool = True,
     use_qdq: bool = False,
@@ -176,7 +176,7 @@ def item_to_row(
     # TODO: add _dyanmic_axes and _is_fixed_shape for QDQ?
     dynamic_axes = item.get("dynamic_axes", {})
 
-    def set_properties_for_dynamic_axes(input_name: str, is_constant: bool):
+    def set_properties_for_dynamic_axes(input_name: str, is_constant: bool) -> None:
         if "__" not in input_name:  # skip variadic inputs
             axes = dynamic_axes.get(input_name, ())
             res[f"{input_name}_is_constant"] = is_constant
@@ -398,7 +398,7 @@ def np_to_python_value(value: Any) -> Any:
 
 def extract_single_negative_rules(
     df: pd.DataFrame, result_col: str, ignored_cols: list[str]
-) -> dict[str, list[dict[str, Any]]]:
+) -> tuple[list[dict[str, list[dict[str, Any]]]], list[Any]]:
     """Extract single negative rules from DataFrame.
 
     A negative rule identifies property values that always lead to failure.
@@ -417,9 +417,9 @@ def extract_single_negative_rules(
 
     target_cols = [c for c in df.columns if c not in ignored_cols and c != result_col]
     if not target_cols:
-        return {}
+        return [], []
 
-    n_results = df[result_col][0].__len__()  # type: ignore[attr-defined]
+    n_results = df[result_col][0].__len__()
     assert n_results == 2
     all_negative_rules = []
     all_failed = []
@@ -735,7 +735,9 @@ def _json_safe_records(df: pd.DataFrame) -> list[dict[str, Any]]:
     """Convert dataframe records to JSON-safe Python objects."""
     if df.empty:
         return []
-    return json.loads(df.to_json(orient="records", force_ascii=False))
+    # df.to_json(orient="records") returns the records JSON array string;
+    # json.loads returns Any, so cast to the expected shape.
+    return cast("list[dict[str, Any]]", json.loads(df.to_json(orient="records", force_ascii=False)))
 
 
 def _encode_condition_columns_for_parquet(
@@ -875,8 +877,11 @@ def _encode_condition_columns_for_parquet(
                     schema, qdq_generator=qdq_generator if is_qdq else None
                 )
             except SchemaError:
+                # Use the already-converted ONNXDomain so the dict keys have
+                # one concrete type; mixing raw `op_domain: str` with the enum
+                # widens the inferred key type and breaks PatternInputGenerator.
                 domain_versions = {
-                    op_domain: opset_version,
+                    domain: opset_version,
                     ONNXDomain.COM_MICROSOFT: 1,
                 }
                 input_generator = get_pattern_input_generator(op_name)(domain_versions)
diff --git a/src/winml/modelkit/analyze/utils/model_utils.py b/src/winml/modelkit/analyze/utils/model_utils.py
index c48f8c8bd..a0eb49240 100644
--- a/src/winml/modelkit/analyze/utils/model_utils.py
+++ b/src/winml/modelkit/analyze/utils/model_utils.py
@@ -19,7 +19,7 @@
 from ...pattern.models import OperatorPattern, PatternType
 
 # Re-export shared utilities from pattern.utils
-from ...pattern.utils import (  # noqa: F401
+from ...pattern.utils import (
     DUMMY_FLOAT,
     collect_initializers,
     collect_valueinfo_dict,
@@ -32,6 +32,21 @@
 )
 
 
+__all__ = [
+    "DUMMY_FLOAT",
+    "collect_initializers",
+    "collect_valueinfo_dict",
+    "dtype_from_tensorproto_enum",
+    "encode_rule_condition_value_for_parquet",
+    "get_attribute_proto_value",
+    "get_op_input_properties",
+    "get_op_since_version",
+    "make_hashable",
+    "node_to_pattern_match",
+    "shape_and_dtype_from_valueinfo",
+]
+
+
 if TYPE_CHECKING:
     from onnx import NodeProto
 
@@ -46,25 +61,24 @@ def _normalize_for_parquet_encoding(value: object) -> object:
     """
     if isinstance(value, np.generic):
         return _normalize_for_parquet_encoding(value.item())
-
-    val_type = type(value)
     if value is None:
         return {"t": "none"}
-    if val_type is bool:
+    # bool must be checked before int (bool is an int subclass).
+    if isinstance(value, bool):
         return {"t": "bool", "v": value}
-    if val_type is int:
+    if isinstance(value, int):
         return {"t": "int", "v": value}
-    if val_type is float:
+    if isinstance(value, float):
         return {"t": "float", "v": repr(value)}
-    if val_type is str:
+    if isinstance(value, str):
         return {"t": "str", "v": value}
-    if val_type is bytes:
+    if isinstance(value, bytes):
         return {"t": "bytes", "v": base64.b64encode(value).decode("ascii")}
-    if val_type is tuple:
+    if isinstance(value, tuple):
         return {"t": "tuple", "v": [_normalize_for_parquet_encoding(v) for v in value]}
-    if val_type is list:
+    if isinstance(value, list):
         return {"t": "list", "v": [_normalize_for_parquet_encoding(v) for v in value]}
-    if val_type is dict:
+    if isinstance(value, dict):
         items = sorted(value.items(), key=lambda kv: str(kv[0]))
         return {
             "t": "dict",
@@ -75,7 +89,7 @@ def _normalize_for_parquet_encoding(value: object) -> object:
         }
 
     # Fallback: keep determinism and type visibility for unknown objects.
-    return {"t": "repr", "type": val_type.__name__, "v": repr(value)}
+    return {"t": "repr", "type": type(value).__name__, "v": repr(value)}
 
 
 def encode_rule_condition_value_for_parquet(value: object) -> str:
diff --git a/src/winml/modelkit/analyze/utils/node_key_utils.py b/src/winml/modelkit/analyze/utils/node_key_utils.py
index 0c2aaf52f..8e3abedb5 100644
--- a/src/winml/modelkit/analyze/utils/node_key_utils.py
+++ b/src/winml/modelkit/analyze/utils/node_key_utils.py
@@ -22,6 +22,13 @@
 from ...pattern.utils import make_stable_node_key
 
 
+__all__ = [
+    "build_node_key_by_node_id",
+    "make_stable_node_key",
+    "resolve_stable_node_key",
+]
+
+
 def build_node_key_by_node_id(graph_nodes: Sequence[onnx.NodeProto]) -> dict[int, str]:
     """Build id(node) -> stable-key map for a graph snapshot."""
     return {id(node): make_stable_node_key(node, index) for index, node in enumerate(graph_nodes)}
diff --git a/src/winml/modelkit/analyze/utils/op_utils.py b/src/winml/modelkit/analyze/utils/op_utils.py
index edf88ab58..d1a365b52 100644
--- a/src/winml/modelkit/analyze/utils/op_utils.py
+++ b/src/winml/modelkit/analyze/utils/op_utils.py
@@ -77,7 +77,7 @@ def compute_case_signature(case: dict, *, namespace: str) -> str:
         sig_parts.append(f"ns:{namespace}")
 
     def _safe_dump(obj: Any) -> str:
-        def _default(o: Any):
+        def _default(o: Any) -> Any:
             if isinstance(o, onnx.TensorProto):
                 return json.loads(json_format.MessageToJson(o))
             if isinstance(o, np.ndarray):
@@ -165,7 +165,7 @@ def __init__(
         self.file_path = Path(file_path)
         self.sys_info = sys_info
         self.save_per_cases = save_per_cases
-        self.results = []
+        self.results: list[dict[str, Any]] = []
         self.pending_count = 0
         self.rerun_failed = rerun_failed
         self.delta_only = delta_only
@@ -430,7 +430,7 @@ def _save(self) -> None:
             key=lambda x: x.get("case_index", "") if isinstance(x, dict) else "",
         )
 
-        def json_default(obj):
+        def json_default(obj: Any) -> Any:
             if isinstance(obj, onnx.TensorProto):
                 return json.loads(json_format.MessageToJson(obj))
             if isinstance(obj, np.ndarray):
diff --git a/src/winml/modelkit/pattern/config.py b/src/winml/modelkit/pattern/config.py
index d68a228f7..dfebb6332 100644
--- a/src/winml/modelkit/pattern/config.py
+++ b/src/winml/modelkit/pattern/config.py
@@ -19,8 +19,9 @@
 
 
 if TYPE_CHECKING:
-    from winml.modelkit.pattern.base import Pattern
-    from winml.modelkit.pattern.models import SubgraphPattern
+    from .base import Pattern
+    from .models import Pattern as PatternModel
+    from .models import SubgraphPattern
 
 logger = logging.getLogger(__name__)
 
@@ -196,7 +197,7 @@ def get_htp_patterns(self) -> list[SubgraphPattern]:
             self._load_config()
         return self._htp_patterns.copy()
 
-    def get_alternatives(self, pattern: Pattern) -> list[PatternAlternative]:
+    def get_alternatives(self, pattern: Pattern | PatternModel) -> list[PatternAlternative]:
         """Get alternative pattern metadata for the given pattern.
 
         Alternatives are used by runtime_checker to generate PatternTestResult,
diff --git a/src/winml/modelkit/pattern/match.py b/src/winml/modelkit/pattern/match.py
index e905399e7..25a9909f7 100644
--- a/src/winml/modelkit/pattern/match.py
+++ b/src/winml/modelkit/pattern/match.py
@@ -20,7 +20,8 @@
 if TYPE_CHECKING:
     from onnx import NodeProto
 
-    from winml.modelkit.pattern.base import Pattern, PatternMatcher
+    from .base import Pattern, PatternMatcher
+    from .models import Pattern as PatternModel
 
 
 @dataclass
@@ -59,7 +60,7 @@ class SkeletonMatchResult:
         matched_node_keys: List of stable node keys aligned with matched_nodes.
     """
 
-    pattern: "Pattern"  # Pattern instance
+    pattern: "Pattern | PatternModel"  # Pattern ABC instance or Pydantic Pattern model
     matched_nodes: list["NodeProto"]
     matcher: "PatternMatcher" = field(repr=False)  # PatternMatcher reference
     inputs: list[str] = field(default_factory=list)
@@ -102,7 +103,7 @@ class PatternMatchResult:
     match_id: str = field(default_factory=lambda: str(uuid.uuid4()))
 
     @property
-    def pattern(self):
+    def pattern(self) -> Pattern | PatternModel:
         """Get the pattern that was matched."""
         return self.skeleton_match_result.pattern
 

From 80bc7e0b5c1e8967cca84d1254ac58e7db0032d9 Mon Sep 17 00:00:00 2001
From: ssss141414 <407748083@qq.com>
Date: Fri, 29 May 2026 10:22:04 +0800
Subject: [PATCH 016/143] examples: add 12 builtin model recipes (fp16 + w8a8 +
 w8a16, 36 configs) (#785)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds curated recipe configs for the 12 builtin models — those that pass
fp16 eval on all 9 (EP, device) buckets.
---
 .../sentence-similarity_fp16_config.json      | 77 +++++++++++++++
 .../sentence-similarity_w8a16_config.json     | 94 +++++++++++++++++++
 .../sentence-similarity_w8a8_config.json      | 94 +++++++++++++++++++
 examples/recipes/README.md                    | 30 ++++++
 .../text-classification_fp16_config.json      | 63 +++++++++++++
 .../text-classification_w8a16_config.json     | 80 ++++++++++++++++
 .../text-classification_w8a8_config.json      | 80 ++++++++++++++++
 .../question-answering_fp16_config.json       | 69 ++++++++++++++
 .../question-answering_w8a16_config.json      | 86 +++++++++++++++++
 .../question-answering_w8a8_config.json       | 86 +++++++++++++++++
 .../question-answering_fp16_config.json       | 69 ++++++++++++++
 .../question-answering_w8a16_config.json      | 86 +++++++++++++++++
 .../question-answering_w8a8_config.json       | 86 +++++++++++++++++
 .../image-feature-extraction_fp16_config.json | 49 ++++++++++
 ...image-feature-extraction_w8a16_config.json | 66 +++++++++++++
 .../image-feature-extraction_w8a8_config.json | 66 +++++++++++++
 .../image-feature-extraction_fp16_config.json | 49 ++++++++++
 ...image-feature-extraction_w8a16_config.json | 66 +++++++++++++
 .../image-feature-extraction_w8a8_config.json | 66 +++++++++++++
 .../image-feature-extraction_fp16_config.json | 49 ++++++++++
 ...image-feature-extraction_w8a16_config.json | 66 +++++++++++++
 .../image-feature-extraction_w8a8_config.json | 66 +++++++++++++
 .../feature-extraction_fp16_config.json       | 71 ++++++++++++++
 .../feature-extraction_w8a16_config.json      | 88 +++++++++++++++++
 .../feature-extraction_w8a8_config.json       | 88 +++++++++++++++++
 .../image-feature-extraction_fp16_config.json | 49 ++++++++++
 ...image-feature-extraction_w8a16_config.json | 66 +++++++++++++
 .../image-feature-extraction_w8a8_config.json | 66 +++++++++++++
 .../feature-extraction_fp16_config.json       | 71 ++++++++++++++
 .../feature-extraction_w8a16_config.json      | 88 +++++++++++++++++
 .../feature-extraction_w8a8_config.json       | 88 +++++++++++++++++
 .../feature-extraction_fp16_config.json       | 77 +++++++++++++++
 .../feature-extraction_w8a16_config.json      | 94 +++++++++++++++++++
 .../feature-extraction_w8a8_config.json       | 94 +++++++++++++++++++
 .../sentence-similarity_fp16_config.json      | 77 +++++++++++++++
 .../sentence-similarity_w8a16_config.json     | 94 +++++++++++++++++++
 .../sentence-similarity_w8a8_config.json      | 94 +++++++++++++++++++
 37 files changed, 2748 insertions(+)
 create mode 100644 examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_fp16_config.json
 create mode 100644 examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a16_config.json
 create mode 100644 examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a8_config.json
 create mode 100644 examples/recipes/README.md
 create mode 100644 examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_fp16_config.json
 create mode 100644 examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a16_config.json
 create mode 100644 examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a8_config.json
 create mode 100644 examples/recipes/deepset_roberta-base-squad2/question-answering_fp16_config.json
 create mode 100644 examples/recipes/deepset_roberta-base-squad2/question-answering_w8a16_config.json
 create mode 100644 examples/recipes/deepset_roberta-base-squad2/question-answering_w8a8_config.json
 create mode 100644 examples/recipes/deepset_tinyroberta-squad2/question-answering_fp16_config.json
 create mode 100644 examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a16_config.json
 create mode 100644 examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a8_config.json
 create mode 100644 examples/recipes/facebook_dinov2-base/image-feature-extraction_fp16_config.json
 create mode 100644 examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a16_config.json
 create mode 100644 examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a8_config.json
 create mode 100644 examples/recipes/facebook_dinov2-small/image-feature-extraction_fp16_config.json
 create mode 100644 examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a16_config.json
 create mode 100644 examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a8_config.json
 create mode 100644 examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_fp16_config.json
 create mode 100644 examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a16_config.json
 create mode 100644 examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a8_config.json
 create mode 100644 examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_fp16_config.json
 create mode 100644 examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a16_config.json
 create mode 100644 examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a8_config.json
 create mode 100644 examples/recipes/microsoft_rad-dino/image-feature-extraction_fp16_config.json
 create mode 100644 examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a16_config.json
 create mode 100644 examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a8_config.json
 create mode 100644 examples/recipes/openai_clip-vit-base-patch16/feature-extraction_fp16_config.json
 create mode 100644 examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a16_config.json
 create mode 100644 examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a8_config.json
 create mode 100644 examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_fp16_config.json
 create mode 100644 examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a16_config.json
 create mode 100644 examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a8_config.json
 create mode 100644 examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_fp16_config.json
 create mode 100644 examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a16_config.json
 create mode 100644 examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a8_config.json

diff --git a/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_fp16_config.json b/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_fp16_config.json
new file mode 100644
index 000000000..eed96889f
--- /dev/null
+++ b/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_fp16_config.json
@@ -0,0 +1,77 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "sentence-similarity",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "sentence-similarity",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a16_config.json b/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a16_config.json
new file mode 100644
index 000000000..6f1450a96
--- /dev/null
+++ b/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a16_config.json
@@ -0,0 +1,94 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "sentence-similarity",
+    "model_name": "BAAI/bge-large-en-v1.5"
+  },
+  "loader": {
+    "task": "sentence-similarity",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "sentence-similarity",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a8_config.json b/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a8_config.json
new file mode 100644
index 000000000..275363979
--- /dev/null
+++ b/examples/recipes/BAAI_bge-large-en-v1.5/sentence-similarity_w8a8_config.json
@@ -0,0 +1,94 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "sentence-similarity",
+    "model_name": "BAAI/bge-large-en-v1.5"
+  },
+  "loader": {
+    "task": "sentence-similarity",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "sentence-similarity",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/README.md b/examples/recipes/README.md
new file mode 100644
index 000000000..caaa2c15f
--- /dev/null
+++ b/examples/recipes/README.md
@@ -0,0 +1,30 @@
+# Built-in Model Recipes
+
+Curated recipe configuration samples for **portable, high-performance, and high-quality** AI models on Windows ML, working consistently across supported EPs.
+
+**Supported EPs:**
+
+DML/GPU, MLAS/CPU, OpenVINO (CPU/GPU/NPU), QNN (GPU/NPU), VitisAI/NPU, NVIDIA TensorRT RTX/GPU
+
+Each *(model, task)* includes:
+
+- `fp16`
+- `w8a8`
+- `w8a16` quantized variants
+
+## Models
+
+| Model | Task |
+|---|---|
+| BAAI/bge-large-en-v1.5 | sentence-similarity |
+| cardiffnlp/twitter-roberta-base-sentiment-latest | text-classification |
+| deepset/roberta-base-squad2 | question-answering |
+| deepset/tinyroberta-squad2 | question-answering |
+| facebook/dinov2-base | image-feature-extraction |
+| facebook/dinov2-small | image-feature-extraction |
+| google/vit-base-patch16-224-in21k | image-feature-extraction |
+| laion/CLIP-ViT-B-32-laion2B-s34B-b79K | feature-extraction |
+| microsoft/rad-dino | image-feature-extraction |
+| openai/clip-vit-base-patch16 | feature-extraction |
+| sentence-transformers/all-MiniLM-L6-v2 | feature-extraction |
+| sentence-transformers/all-MiniLM-L6-v2 | sentence-similarity |
diff --git a/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_fp16_config.json b/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_fp16_config.json
new file mode 100644
index 000000000..186a6cbb8
--- /dev/null
+++ b/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_fp16_config.json
@@ -0,0 +1,63 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "text-classification",
+    "model_class": "AutoModelForSequenceClassification",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "text-classification",
+    "dataset": {
+      "path": "tweet_eval",
+      "name": "sentiment",
+      "columns_mapping": {
+        "input_column": "text"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a16_config.json b/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a16_config.json
new file mode 100644
index 000000000..b96d6b2eb
--- /dev/null
+++ b/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a16_config.json
@@ -0,0 +1,80 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "text-classification",
+    "model_name": "cardiffnlp/twitter-roberta-base-sentiment-latest"
+  },
+  "loader": {
+    "task": "text-classification",
+    "model_class": "AutoModelForSequenceClassification",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "text-classification",
+    "dataset": {
+      "path": "tweet_eval",
+      "name": "sentiment",
+      "columns_mapping": {
+        "input_column": "text"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a8_config.json b/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a8_config.json
new file mode 100644
index 000000000..2cb7114d2
--- /dev/null
+++ b/examples/recipes/cardiffnlp_twitter-roberta-base-sentiment-latest/text-classification_w8a8_config.json
@@ -0,0 +1,80 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "text-classification",
+    "model_name": "cardiffnlp/twitter-roberta-base-sentiment-latest"
+  },
+  "loader": {
+    "task": "text-classification",
+    "model_class": "AutoModelForSequenceClassification",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "text-classification",
+    "dataset": {
+      "path": "tweet_eval",
+      "name": "sentiment",
+      "columns_mapping": {
+        "input_column": "text"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/deepset_roberta-base-squad2/question-answering_fp16_config.json b/examples/recipes/deepset_roberta-base-squad2/question-answering_fp16_config.json
new file mode 100644
index 000000000..e97d94cb3
--- /dev/null
+++ b/examples/recipes/deepset_roberta-base-squad2/question-answering_fp16_config.json
@@ -0,0 +1,69 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "start_logits"
+      },
+      {
+        "name": "end_logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "question-answering",
+    "model_class": "AutoModelForQuestionAnswering",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "question-answering",
+    "dataset": {
+      "path": "rajpurkar/squad_v2",
+      "split": "validation",
+      "columns_mapping": {
+        "question_column": "question",
+        "context_column": "context",
+        "id_column": "id",
+        "label_column": "answers"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/deepset_roberta-base-squad2/question-answering_w8a16_config.json b/examples/recipes/deepset_roberta-base-squad2/question-answering_w8a16_config.json
new file mode 100644
index 000000000..5fdbafca2
--- /dev/null
+++ b/examples/recipes/deepset_roberta-base-squad2/question-answering_w8a16_config.json
@@ -0,0 +1,86 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "start_logits"
+      },
+      {
+        "name": "end_logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "question-answering",
+    "model_name": "deepset/roberta-base-squad2"
+  },
+  "loader": {
+    "task": "question-answering",
+    "model_class": "AutoModelForQuestionAnswering",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "question-answering",
+    "dataset": {
+      "path": "rajpurkar/squad_v2",
+      "split": "validation",
+      "columns_mapping": {
+        "question_column": "question",
+        "context_column": "context",
+        "id_column": "id",
+        "label_column": "answers"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/deepset_roberta-base-squad2/question-answering_w8a8_config.json b/examples/recipes/deepset_roberta-base-squad2/question-answering_w8a8_config.json
new file mode 100644
index 000000000..60f6039f1
--- /dev/null
+++ b/examples/recipes/deepset_roberta-base-squad2/question-answering_w8a8_config.json
@@ -0,0 +1,86 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "start_logits"
+      },
+      {
+        "name": "end_logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "question-answering",
+    "model_name": "deepset/roberta-base-squad2"
+  },
+  "loader": {
+    "task": "question-answering",
+    "model_class": "AutoModelForQuestionAnswering",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "question-answering",
+    "dataset": {
+      "path": "rajpurkar/squad_v2",
+      "split": "validation",
+      "columns_mapping": {
+        "question_column": "question",
+        "context_column": "context",
+        "id_column": "id",
+        "label_column": "answers"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/deepset_tinyroberta-squad2/question-answering_fp16_config.json b/examples/recipes/deepset_tinyroberta-squad2/question-answering_fp16_config.json
new file mode 100644
index 000000000..e97d94cb3
--- /dev/null
+++ b/examples/recipes/deepset_tinyroberta-squad2/question-answering_fp16_config.json
@@ -0,0 +1,69 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "start_logits"
+      },
+      {
+        "name": "end_logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "question-answering",
+    "model_class": "AutoModelForQuestionAnswering",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "question-answering",
+    "dataset": {
+      "path": "rajpurkar/squad_v2",
+      "split": "validation",
+      "columns_mapping": {
+        "question_column": "question",
+        "context_column": "context",
+        "id_column": "id",
+        "label_column": "answers"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a16_config.json b/examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a16_config.json
new file mode 100644
index 000000000..7a006bd7f
--- /dev/null
+++ b/examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a16_config.json
@@ -0,0 +1,86 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "start_logits"
+      },
+      {
+        "name": "end_logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "question-answering",
+    "model_name": "deepset/tinyroberta-squad2"
+  },
+  "loader": {
+    "task": "question-answering",
+    "model_class": "AutoModelForQuestionAnswering",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "question-answering",
+    "dataset": {
+      "path": "rajpurkar/squad_v2",
+      "split": "validation",
+      "columns_mapping": {
+        "question_column": "question",
+        "context_column": "context",
+        "id_column": "id",
+        "label_column": "answers"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a8_config.json b/examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a8_config.json
new file mode 100644
index 000000000..0d8e54344
--- /dev/null
+++ b/examples/recipes/deepset_tinyroberta-squad2/question-answering_w8a8_config.json
@@ -0,0 +1,86 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          50265
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "start_logits"
+      },
+      {
+        "name": "end_logits"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "question-answering",
+    "model_name": "deepset/tinyroberta-squad2"
+  },
+  "loader": {
+    "task": "question-answering",
+    "model_class": "AutoModelForQuestionAnswering",
+    "model_type": "roberta"
+  },
+  "eval": {
+    "task": "question-answering",
+    "dataset": {
+      "path": "rajpurkar/squad_v2",
+      "split": "validation",
+      "columns_mapping": {
+        "question_column": "question",
+        "context_column": "context",
+        "id_column": "id",
+        "label_column": "answers"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/facebook_dinov2-base/image-feature-extraction_fp16_config.json b/examples/recipes/facebook_dinov2-base/image-feature-extraction_fp16_config.json
new file mode 100644
index 000000000..b3e1216fd
--- /dev/null
+++ b/examples/recipes/facebook_dinov2-base/image-feature-extraction_fp16_config.json
@@ -0,0 +1,49 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": null,
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a16_config.json b/examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a16_config.json
new file mode 100644
index 000000000..95915049a
--- /dev/null
+++ b/examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a16_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "facebook/dinov2-base"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a8_config.json b/examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a8_config.json
new file mode 100644
index 000000000..f9e729c97
--- /dev/null
+++ b/examples/recipes/facebook_dinov2-base/image-feature-extraction_w8a8_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "facebook/dinov2-base"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/facebook_dinov2-small/image-feature-extraction_fp16_config.json b/examples/recipes/facebook_dinov2-small/image-feature-extraction_fp16_config.json
new file mode 100644
index 000000000..b3e1216fd
--- /dev/null
+++ b/examples/recipes/facebook_dinov2-small/image-feature-extraction_fp16_config.json
@@ -0,0 +1,49 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": null,
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a16_config.json b/examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a16_config.json
new file mode 100644
index 000000000..800258542
--- /dev/null
+++ b/examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a16_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "facebook/dinov2-small"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a8_config.json b/examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a8_config.json
new file mode 100644
index 000000000..c3a3af051
--- /dev/null
+++ b/examples/recipes/facebook_dinov2-small/image-feature-extraction_w8a8_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "facebook/dinov2-small"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_fp16_config.json b/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_fp16_config.json
new file mode 100644
index 000000000..628768221
--- /dev/null
+++ b/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_fp16_config.json
@@ -0,0 +1,49 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": null,
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "vit"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a16_config.json b/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a16_config.json
new file mode 100644
index 000000000..fa44cce05
--- /dev/null
+++ b/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a16_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "google/vit-base-patch16-224-in21k"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "vit"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a8_config.json b/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a8_config.json
new file mode 100644
index 000000000..7124b916a
--- /dev/null
+++ b/examples/recipes/google_vit-base-patch16-224-in21k/image-feature-extraction_w8a8_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          224,
+          224
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "google/vit-base-patch16-224-in21k"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "vit"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "timm/mini-imagenet",
+      "split": "test",
+      "samples": 1000
+    }
+  }
+}
diff --git a/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_fp16_config.json b/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_fp16_config.json
new file mode 100644
index 000000000..5a186260f
--- /dev/null
+++ b/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_fp16_config.json
@@ -0,0 +1,71 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          49408
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "text_embeds"
+      },
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true,
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "CLIPTextModelWithProjection",
+    "model_type": "clip_text_model"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a16_config.json b/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a16_config.json
new file mode 100644
index 000000000..11ba24b65
--- /dev/null
+++ b/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a16_config.json
@@ -0,0 +1,88 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          49408
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "text_embeds"
+      },
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true,
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "feature-extraction",
+    "model_name": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
+  },
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "CLIPTextModelWithProjection",
+    "model_type": "clip_text_model"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a8_config.json b/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a8_config.json
new file mode 100644
index 000000000..849dc5481
--- /dev/null
+++ b/examples/recipes/laion_CLIP-ViT-B-32-laion2B-s34B-b79K/feature-extraction_w8a8_config.json
@@ -0,0 +1,88 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          49408
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "text_embeds"
+      },
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true,
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "feature-extraction",
+    "model_name": "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
+  },
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "CLIPTextModelWithProjection",
+    "model_type": "clip_text_model"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/microsoft_rad-dino/image-feature-extraction_fp16_config.json b/examples/recipes/microsoft_rad-dino/image-feature-extraction_fp16_config.json
new file mode 100644
index 000000000..35f88c701
--- /dev/null
+++ b/examples/recipes/microsoft_rad-dino/image-feature-extraction_fp16_config.json
@@ -0,0 +1,49 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          518,
+          518
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": null,
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "Ewakaa/pneumonia_classification_chest_xray",
+      "split": "test",
+      "samples": 582
+    }
+  }
+}
diff --git a/examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a16_config.json b/examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a16_config.json
new file mode 100644
index 000000000..5e037c79c
--- /dev/null
+++ b/examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a16_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          518,
+          518
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "microsoft/rad-dino"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "Ewakaa/pneumonia_classification_chest_xray",
+      "split": "test",
+      "samples": 582
+    }
+  }
+}
diff --git a/examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a8_config.json b/examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a8_config.json
new file mode 100644
index 000000000..7f02b647f
--- /dev/null
+++ b/examples/recipes/microsoft_rad-dino/image-feature-extraction_w8a8_config.json
@@ -0,0 +1,66 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          518,
+          518
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-feature-extraction",
+    "model_name": "microsoft/rad-dino"
+  },
+  "loader": {
+    "task": "image-feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "dinov2"
+  },
+  "eval": {
+    "task": "image-feature-extraction",
+    "dataset": {
+      "path": "Ewakaa/pneumonia_classification_chest_xray",
+      "split": "test",
+      "samples": 582
+    }
+  }
+}
diff --git a/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_fp16_config.json b/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_fp16_config.json
new file mode 100644
index 000000000..5a186260f
--- /dev/null
+++ b/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_fp16_config.json
@@ -0,0 +1,71 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          49408
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "text_embeds"
+      },
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true,
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "CLIPTextModelWithProjection",
+    "model_type": "clip_text_model"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a16_config.json b/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a16_config.json
new file mode 100644
index 000000000..98323e872
--- /dev/null
+++ b/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a16_config.json
@@ -0,0 +1,88 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          49408
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "text_embeds"
+      },
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true,
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "feature-extraction",
+    "model_name": "openai/clip-vit-base-patch16"
+  },
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "CLIPTextModelWithProjection",
+    "model_type": "clip_text_model"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a8_config.json b/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a8_config.json
new file mode 100644
index 000000000..ae40a470a
--- /dev/null
+++ b/examples/recipes/openai_clip-vit-base-patch16/feature-extraction_w8a8_config.json
@@ -0,0 +1,88 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          49408
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          77
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "text_embeds"
+      },
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true,
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "feature-extraction",
+    "model_name": "openai/clip-vit-base-patch16"
+  },
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "CLIPTextModelWithProjection",
+    "model_type": "clip_text_model"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_fp16_config.json b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_fp16_config.json
new file mode 100644
index 000000000..ba6575a35
--- /dev/null
+++ b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_fp16_config.json
@@ -0,0 +1,77 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a16_config.json b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a16_config.json
new file mode 100644
index 000000000..d05456800
--- /dev/null
+++ b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a16_config.json
@@ -0,0 +1,94 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "feature-extraction",
+    "model_name": "sentence-transformers/all-MiniLM-L6-v2"
+  },
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a8_config.json b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a8_config.json
new file mode 100644
index 000000000..c7de443dc
--- /dev/null
+++ b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/feature-extraction_w8a8_config.json
@@ -0,0 +1,94 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "feature-extraction",
+    "model_name": "sentence-transformers/all-MiniLM-L6-v2"
+  },
+  "loader": {
+    "task": "feature-extraction",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "feature-extraction",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_fp16_config.json b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_fp16_config.json
new file mode 100644
index 000000000..eed96889f
--- /dev/null
+++ b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_fp16_config.json
@@ -0,0 +1,77 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": null,
+  "loader": {
+    "task": "sentence-similarity",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "sentence-similarity",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a16_config.json b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a16_config.json
new file mode 100644
index 000000000..49c2bb8b8
--- /dev/null
+++ b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a16_config.json
@@ -0,0 +1,94 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "sentence-similarity",
+    "model_name": "sentence-transformers/all-MiniLM-L6-v2"
+  },
+  "loader": {
+    "task": "sentence-similarity",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "sentence-similarity",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}
diff --git a/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a8_config.json b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a8_config.json
new file mode 100644
index 000000000..29534eb50
--- /dev/null
+++ b/examples/recipes/sentence-transformers_all-MiniLM-L6-v2/sentence-similarity_w8a8_config.json
@@ -0,0 +1,94 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "input_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          30522
+        ]
+      },
+      {
+        "name": "attention_mask",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      },
+      {
+        "name": "token_type_ids",
+        "dtype": "int32",
+        "shape": [
+          1,
+          512
+        ],
+        "value_range": [
+          0,
+          2
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "last_hidden_state"
+      }
+    ]
+  },
+  "optim": {
+    "clamp_constant_values": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "sentence-similarity",
+    "model_name": "sentence-transformers/all-MiniLM-L6-v2"
+  },
+  "loader": {
+    "task": "sentence-similarity",
+    "model_class": "AutoModel",
+    "model_type": "bert"
+  },
+  "eval": {
+    "task": "sentence-similarity",
+    "dataset": {
+      "path": "mteb/stsbenchmark-sts",
+      "split": "test",
+      "columns_mapping": {
+        "input_column_1": "sentence1",
+        "input_column_2": "sentence2",
+        "score_column": "score"
+      }
+    }
+  }
+}

From 7152b82333dd03a6f6ce44e5ce34544d8f54cb90 Mon Sep 17 00:00:00 2001
From: Charles Zhang <zhangchao@microsoft.com>
Date: Fri, 29 May 2026 15:01:14 +0800
Subject: [PATCH 017/143] Validate model task in config. (#723)

---
 src/winml/modelkit/build/hf.py                |  19 +-
 src/winml/modelkit/commands/build.py          | 191 +++++++++++-
 src/winml/modelkit/commands/config.py         |  10 +-
 src/winml/modelkit/loader/hf.py               |  13 +-
 src/winml/modelkit/utils/cli.py               |  10 +
 tests/unit/commands/test_build.py             |  45 ++-
 .../unit/commands/test_build_validate_task.py | 291 ++++++++++++++++++
 tests/unit/commands/test_config_cli.py        |   2 +-
 .../unit/loader/test_resolve_loader_config.py |   5 +-
 9 files changed, 565 insertions(+), 21 deletions(-)
 create mode 100644 tests/unit/commands/test_build_validate_task.py

diff --git a/src/winml/modelkit/build/hf.py b/src/winml/modelkit/build/hf.py
index 627f8a1c6..e9e67aeb3 100644
--- a/src/winml/modelkit/build/hf.py
+++ b/src/winml/modelkit/build/hf.py
@@ -435,14 +435,28 @@ def _load_model(
     model_id: str | None,
     trust_remote_code: bool,
     random_init: bool = False,
+    hf_config: Any | None = None,
 ) -> Any:
-    """Load PyTorch model — pretrained or random weights."""
+    """Load PyTorch model — pretrained or random weights.
+
+    Args:
+        config: Build config (loader fields used).
+        model_id: HuggingFace model ID or local path.
+        trust_remote_code: Whether to trust remote code.
+        random_init: If True, build with random weights (no download).
+        hf_config: Optional pre-loaded ``PretrainedConfig`` to reuse. When
+            provided, skips the ``AutoConfig.from_pretrained`` round-trip in
+            both the random-init path and the pretrained ``load_hf_model``
+            path (PR #719 dedup pattern).
+    """
     task = config.loader.task
 
     if random_init:
         from transformers import AutoConfig
 
-        if model_id is not None:
+        if hf_config is not None:
+            pass
+        elif model_id is not None:
             hf_config = AutoConfig.from_pretrained(model_id)
         else:
             logger.warning(
@@ -493,6 +507,7 @@ def _load_model(
             model_name_or_path=model_id,
             task=task,
             trust_remote_code=effective_trust,
+            hf_config=hf_config,
         )
         return pytorch_model
 
diff --git a/src/winml/modelkit/commands/build.py b/src/winml/modelkit/commands/build.py
index 11db0f68f..1a3cb68f2 100644
--- a/src/winml/modelkit/commands/build.py
+++ b/src/winml/modelkit/commands/build.py
@@ -224,6 +224,179 @@ def _build_modules(
     return results
 
 
+def _validate_task_supported_for_model(
+    model_id: str,
+    task: str,
+    *,
+    task_field_name: str = "task",
+    trust_remote_code: bool = False,
+    library_name: str = "transformers",
+    hf_config: Any | None = None,
+) -> Any:
+    """Validate that a task is supported for a model's architecture.
+
+    Private helper for ``winml build`` only. Loads HuggingFace config metadata
+    and validates against ``TasksManager`` supported-task mapping.
+
+    Why this lives here and not in ``loader/`` as public API:
+        Only ``winml build`` accepts task and model from independent sources
+        (config JSON's ``loader.task`` + ``--model``) and runs the full
+        export+optimize+quantize+compile pipeline that benefits from a fast
+        upfront fail. Other CLI entrypoints get equivalent coverage through
+        their existing resolution paths:
+
+        - ``winml config`` derives task from the model when both are present,
+          so the mismatch can't be silently constructed.
+        - ``winml export`` / ``winml perf`` surface incompatibilities through
+          ``resolve_cfg`` -> ``ONNXConfigNotFoundError`` later in the call.
+
+        Promoting this to public API would signal that any command should
+        wire it in, which is not the current design. If a second caller
+        appears, move this back to ``loader/`` and re-export it.
+
+    Args:
+        model_id: HuggingFace model ID or local path.
+        task: Requested task name.
+        task_field_name: Field label used in user-facing error messages.
+        trust_remote_code: Whether to trust remote/custom code while loading config.
+        library_name: Source library for TasksManager lookup.
+        hf_config: Optional pre-loaded HF config. When supplied, the
+            ``AutoConfig.from_pretrained`` round-trip is skipped. Used by
+            ``_validate_loader_tasks_for_model`` to preflight multiple tasks
+            against the same model without re-fetching.
+
+    Returns:
+        The loaded (or passed-through) HuggingFace config. Callers can reuse
+        this to avoid a duplicate ``AutoConfig.from_pretrained`` later
+        (see PR #719 -- same deduping pattern as ``resolve_loader_config``).
+
+    Raises:
+        ValueError: If the task is not supported for the model architecture.
+    """
+    from ..export.io import TASK_SYNONYM_EXTENSIONS, ensure_hf_models_registered
+    from ..loader.task import get_supported_tasks, normalize_task
+
+    if hf_config is None:
+        from transformers import AutoConfig
+
+        hf_config = AutoConfig.from_pretrained(
+            model_id,
+            trust_remote_code=trust_remote_code,
+        )
+    model_type = getattr(hf_config, "model_type", None)
+    if not model_type:
+        return hf_config
+
+    # Ensure optimum.exporters.onnx.model_configs is imported before querying
+    # the registry. TasksManager._SUPPORTED_MODEL_TYPE is populated lazily
+    # when optimum's ONNX model_configs module is first imported (triggered by
+    # any import of optimum.exporters.onnx). Without this, get_supported_tasks
+    # returns [] for models like resnet that are registered there, not in the
+    # winml custom registry.
+    ensure_hf_models_registered()
+
+    supported_tasks = get_supported_tasks(model_type, library_name=library_name)
+    # If the upstream registry has no task list for this architecture,
+    # defer to downstream loader resolution instead of hard-failing here.
+    if not supported_tasks:
+        return hf_config
+
+    # [1] Verbatim canonical match — definitive accept. Comparing without
+    #     normalization first means an arch that lists `image-feature-extraction`
+    #     in its supported set accepts that name as-is, while a text-only arch
+    #     that lists only `feature-extraction` does not silently accept it via
+    #     Optimum's synonym collapse on this branch.
+    if task in supported_tasks:
+        return hf_config
+
+    # [2] HF-pipeline-only task names that Optimum's TasksManager does not
+    #     know but the rest of the CLI accepts (e.g. ``next-sentence-prediction``
+    #     handled via HF_TASK_DEFAULTS, ``mask-generation`` preserved for SAM2).
+    #     These are routed downstream by export/io.py::_map_task_synonym, so
+    #     rejecting here would break invocations that ``winml config`` and
+    #     ``winml export`` accept.
+    if task in TASK_SYNONYM_EXTENSIONS:
+        return hf_config
+
+    # [3] Optimum synonym fallback — e.g. ``masked-lm`` -> ``fill-mask``.
+    #     Accept, but warn so users converge on the canonical spelling.
+    #
+    #     Known limitation: Optimum collapses text/image variants of
+    #     feature-extraction (``image-feature-extraction`` -> ``feature-extraction``)
+    #     and routes ``sentence-similarity`` -> ``feature-extraction``. This
+    #     branch therefore silently accepts cross-modality combinations such as
+    #     ``--task image-feature-extraction`` against a text-only arch. Such
+    #     mismatches must be caught downstream where the HF-pipeline-keyed
+    #     registries see the un-collapsed ``loader.task`` value.
+    normalized = normalize_task(task)
+    normalized_supported = {normalize_task(t) for t in supported_tasks}
+    if normalized in normalized_supported:
+        if normalized != task:
+            logger.warning(
+                "%s=%r matches via Optimum synonym mapping; consider using the canonical name %r.",
+                task_field_name,
+                task,
+                normalized,
+            )
+        return hf_config
+
+    supported_list = ", ".join(supported_tasks)
+    raise ValueError(
+        f"{task_field_name}='{task}' is not supported for --model {model_id} "
+        f"(architecture: {model_type}).\n"
+        f"Supported tasks: {supported_list}."
+    )
+
+
+def _validate_loader_tasks_for_model(
+    *,
+    model_id: str | None,
+    configs: list[WinMLBuildConfig],
+    trust_remote_code: bool,
+) -> Any | None:
+    """Validate config loader task(s) against --model architecture.
+
+    This runs at command entry before setup/stage output so incompatible
+    config/model combinations fail with an actionable one-line error.
+
+    Loads ``AutoConfig`` at most once and reuses it across every per-task
+    check, then returns it so the build pipeline can plumb it down to
+    ``load_hf_model`` and avoid the second/third round-trip that PR #719
+    deduped on the inspect path.
+
+    See ``_validate_task_supported_for_model`` for the rationale on why this
+    preflight is wired into ``winml build`` only.
+
+    Returns:
+        Pre-loaded ``PretrainedConfig`` (caller should pass this into
+        ``_run_single_build`` so ``load_hf_model`` skips its own
+        ``AutoConfig.from_pretrained`` call), or ``None`` when no model_id
+        was provided / model_id is an ONNX file / no task to validate.
+    """
+    if model_id is None:
+        return None
+
+    if cli_utils.is_onnx_file_path(model_id):
+        return None
+
+    tasks = {
+        cfg.loader.task for cfg in configs if cfg.loader is not None and cfg.loader.task is not None
+    }
+    if not tasks:
+        return None
+
+    hf_config: Any | None = None
+    for task in sorted(tasks):
+        hf_config = _validate_task_supported_for_model(
+            model_id=model_id,
+            task=task,
+            task_field_name="config.loader.task",
+            trust_remote_code=trust_remote_code,
+            hf_config=hf_config,
+        )
+    return hf_config
+
+
 # =============================================================================
 # CLI COMMAND
 # =============================================================================
@@ -476,6 +649,12 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
         except ValueError as e:
             raise click.UsageError(f"Config validation failed: {e}") from e
 
+        preloaded_hf_config = _validate_loader_tasks_for_model(
+            model_id=model_id,
+            configs=_configs_to_validate,
+            trust_remote_code=trust_remote_code,
+        )
+
         # Build extra kwargs for pipeline control
         extra_kwargs: dict[str, Any] = {}
         if no_optimize:
@@ -593,6 +772,7 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
                 ep=ep,
                 device=device,
                 extra_kwargs=extra_kwargs,
+                preloaded_hf_config=preloaded_hf_config,
             )
 
     except click.UsageError:
@@ -639,11 +819,10 @@ def _run_single_build(
     ep: EPNameOrAlias | None,
     device: str | None,
     extra_kwargs: dict[str, Any],
+    preloaded_hf_config: Any | None = None,
 ) -> None:
     """Run single-model build with Rich Live progress per stage."""
-    from .config import _is_onnx_file
-
-    _is_onnx = model_id is not None and _is_onnx_file(model_id)
+    _is_onnx = model_id is not None and cli_utils.is_onnx_file_path(model_id)
     # Derive source from _is_onnx to guarantee header label matches pipeline
     source = "ONNX" if _is_onnx else detect_model_source(model_id)
 
@@ -708,6 +887,7 @@ def _run_single_build(
                 ep=ep,
                 device=device,
                 extra_kwargs=extra_kwargs,
+                preloaded_hf_config=preloaded_hf_config,
             )
 
         elapsed = time.monotonic() - start_time
@@ -1098,6 +1278,7 @@ def _build_hf_pipeline(
     ep: EPNameOrAlias | None,
     device: str | None,
     extra_kwargs: dict[str, Any],
+    preloaded_hf_config: Any | None = None,
 ) -> list[tuple[str, float | None]] | None:
     """HF build pipeline with cascading StageLive per stage.
 
@@ -1151,7 +1332,9 @@ def _name(base: str) -> str:
         sl.set_status("Exporting to ONNX...")
 
         # Load + export (blocking)
-        pytorch_model = _load_model(config, model_id, trust_remote_code=False)
+        pytorch_model = _load_model(
+            config, model_id, trust_remote_code=False, hf_config=preloaded_hf_config
+        )
         t0 = time.monotonic()
         export_onnx(
             model=pytorch_model,
diff --git a/src/winml/modelkit/commands/config.py b/src/winml/modelkit/commands/config.py
index 4172c6573..489344e68 100644
--- a/src/winml/modelkit/commands/config.py
+++ b/src/winml/modelkit/commands/config.py
@@ -57,12 +57,6 @@ def _apply_stage_overrides(cfg: Any, *, no_quant: bool, no_compile: bool) -> Non
         cfg.compile = None
 
 
-def _is_onnx_file(model_input: str) -> bool:
-    """Check if input is a path to an existing .onnx file."""
-    path = Path(model_input)
-    return path.suffix == ".onnx" and path.exists()
-
-
 @click.command("config")
 @cli_utils.model_option(required=False, optional_message="Optional when --model-type is provided.")
 @click.option(
@@ -279,12 +273,12 @@ def config(
             _shape_config_file = shape_config_path.name
 
         # ONNX file detection: generate simpler config without loader/export
-        if hf_model and _is_onnx_file(hf_model) and module:
+        if hf_model and cli_utils.is_onnx_file_path(hf_model) and module:
             raise click.UsageError(
                 "--module is not supported with ONNX file input. "
                 "Module discovery requires a HuggingFace model."
             )
-        if hf_model and _is_onnx_file(hf_model):
+        if hf_model and cli_utils.is_onnx_file_path(hf_model):
             config_obj = generate_onnx_build_config(
                 hf_model,
                 task=task,
diff --git a/src/winml/modelkit/loader/hf.py b/src/winml/modelkit/loader/hf.py
index 8c0063dd2..5a90b5828 100644
--- a/src/winml/modelkit/loader/hf.py
+++ b/src/winml/modelkit/loader/hf.py
@@ -149,6 +149,7 @@ def load_hf_model(
     model_class: str | None = None,
     user_script: str | None = None,
     trust_remote_code: bool = False,
+    hf_config: PretrainedConfig | None = None,
 ) -> tuple[nn.Module, PretrainedConfig, str]:
     """Load, detect task, and prepare HuggingFace model.
 
@@ -173,6 +174,9 @@ def load_hf_model(
             The script must define a class matching `model_class` at module level.
             Requires trust_remote_code=True for security.
         trust_remote_code: Whether to trust remote code (required for user_script)
+        hf_config: Optional pre-loaded HF config. When supplied, the
+            ``AutoConfig.from_pretrained`` round-trip is skipped — same dedup
+            pattern as ``resolve_loader_config(hf_config=...)`` from PR #719.
 
     Returns:
         Tuple of (model, hf_config, task)
@@ -214,10 +218,11 @@ def load_hf_model(
             raise ValueError("model_class must be specified when using user_script")
 
     # [1] Load HF Config
-    hf_config = AutoConfig.from_pretrained(
-        model_name_or_path,
-        trust_remote_code=trust_remote_code,
-    )
+    if hf_config is None:
+        hf_config = AutoConfig.from_pretrained(
+            model_name_or_path,
+            trust_remote_code=trust_remote_code,
+        )
 
     # [2] Task & Model Class Resolution
     if user_script is not None:
diff --git a/src/winml/modelkit/utils/cli.py b/src/winml/modelkit/utils/cli.py
index b6967c54c..5d5d07ca9 100644
--- a/src/winml/modelkit/utils/cli.py
+++ b/src/winml/modelkit/utils/cli.py
@@ -292,6 +292,16 @@ def load_build_config(config_path: Path) -> tuple[WinMLBuildConfig, dict]:
     return WinMLBuildConfig.from_dict(data), data
 
 
+def is_onnx_file_path(model_input: str) -> bool:
+    """Check if input is a path to an existing ``.onnx`` file.
+
+    Shared helper for CLI commands that accept either a HuggingFace model ID
+    or a local ``.onnx`` file path for the ``-m/--model`` option.
+    """
+    path = Path(model_input)
+    return path.suffix == ".onnx" and path.exists()
+
+
 def is_cli_provided(ctx: click.Context, param_name: str) -> bool:
     """Check whether a CLI parameter was explicitly provided by the user.
 
diff --git a/tests/unit/commands/test_build.py b/tests/unit/commands/test_build.py
index a444b3463..d1a728211 100644
--- a/tests/unit/commands/test_build.py
+++ b/tests/unit/commands/test_build.py
@@ -78,6 +78,20 @@ def mock_resolve_device():
         yield
 
 
+@pytest.fixture(autouse=True)
+def mock_task_model_compatibility_validator():
+    """Default to no-op for preflight task/model compatibility checks.
+
+    Most build command unit tests are CLI plumbing tests and should not hit
+    HuggingFace config resolution paths.
+    """
+    with patch(
+        "winml.modelkit.commands.build._validate_task_supported_for_model",
+        return_value=None,
+    ):
+        yield
+
+
 @pytest.fixture
 def runner() -> CliRunner:
     """Create a CLI test runner."""
@@ -357,6 +371,35 @@ def test_module_array_non_object_entry(self, tmp_path: Path):
         assert result.exit_code != 0
         assert "object" in result.output.lower()
 
+    def test_rejects_incompatible_config_task_and_model(self, tmp_path: Path):
+        """config.loader.task + --model mismatch fails before pipeline starts."""
+        cfg = _make_minimal_config_file(tmp_path, task="text-generation")
+        msg = (
+            "config.loader.task='text-generation' is not supported for "
+            "--model microsoft/resnet-50 (architecture: resnet). "
+            "Supported tasks: image-classification, image-feature-extraction."
+        )
+
+        with (
+            patch(
+                "winml.modelkit.commands.build._validate_task_supported_for_model",
+                side_effect=ValueError(msg),
+            ) as mock_validate,
+            patch("winml.modelkit.commands.build._run_single_build") as mock_run,
+        ):
+            result = _invoke(["-c", cfg, "-m", "microsoft/resnet-50", "-o", str(tmp_path / "out")])
+
+        assert result.exit_code != 0
+        assert msg in result.output
+        mock_validate.assert_called_once_with(
+            model_id="microsoft/resnet-50",
+            task="text-generation",
+            task_field_name="config.loader.task",
+            trust_remote_code=False,
+            hf_config=None,
+        )
+        mock_run.assert_not_called()
+
     def test_help_lists_all_options(self):
         """``--help`` must surface every behavior-bearing option."""
         result = _invoke(["--help"])
@@ -1267,7 +1310,7 @@ def test_build_onnx_suffix_but_not_exists_uses_hf(
             ["-c", str(sample_config_file), "-m", "nonexistent.onnx", "-o", str(output_dir)],
             obj={"debug": False},
         )
-        # _is_onnx_file checks suffix AND exists(); nonexistent.onnx
+        # is_onnx_file_path checks suffix AND exists(); nonexistent.onnx
         # falls through to HF path since the file doesn't exist on disk
         assert result.exit_code == 0, f"Build failed: {result.output}"
         assert mock_build_api.called
diff --git a/tests/unit/commands/test_build_validate_task.py b/tests/unit/commands/test_build_validate_task.py
new file mode 100644
index 000000000..4448f2977
--- /dev/null
+++ b/tests/unit/commands/test_build_validate_task.py
@@ -0,0 +1,291 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Tests for `_validate_task_supported_for_model` preflight in build CLI.
+
+This helper used to live at `loader/config.py::validate_task_supported_for_model`
+but was demoted to a private helper of the build command because it is the only
+caller. Tests live in a dedicated module so they bypass the autouse fixture in
+`test_build.py` that mocks the helper out for CLI-plumbing tests.
+"""
+
+from __future__ import annotations
+
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from winml.modelkit.commands.build import _validate_task_supported_for_model
+
+
+class TestValidateTaskSupportedForModel:
+    """Tests for `_validate_task_supported_for_model` preflight helper."""
+
+    def test_raises_for_task_model_mismatch(self) -> None:
+        """Incompatible task/model combinations raise a clear ValueError."""
+        mock_config = MagicMock()
+        mock_config.model_type = "resnet"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["image-classification", "image-feature-extraction"],
+            ),
+            patch("winml.modelkit.loader.task.normalize_task", side_effect=lambda t: t),
+            pytest.raises(
+                ValueError,
+                match=r"config\.loader\.task='text-generation' is not supported",
+            ),
+        ):
+            _validate_task_supported_for_model(
+                model_id="microsoft/resnet-50",
+                task="text-generation",
+                task_field_name="config.loader.task",
+            )
+
+    def test_accepts_supported_task(self) -> None:
+        """A supported task should pass without raising."""
+        mock_config = MagicMock()
+        mock_config.model_type = "resnet"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["image-classification", "image-feature-extraction"],
+            ),
+            patch("winml.modelkit.loader.task.normalize_task", side_effect=lambda t: t),
+        ):
+            _validate_task_supported_for_model(
+                model_id="microsoft/resnet-50",
+                task="image-classification",
+            )
+
+    def test_ensure_hf_models_registered_called_before_lookup(self) -> None:
+        """ensure_hf_models_registered() is called to populate the ONNX registry
+        before get_supported_tasks, so models like resnet return the correct tasks."""
+        mock_config = MagicMock()
+        mock_config.model_type = "resnet"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered") as mock_ensure,
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["feature-extraction", "image-classification"],
+            ),
+            patch("winml.modelkit.loader.task.normalize_task", side_effect=lambda t: t),
+            pytest.raises(ValueError, match=r"text-generation.*is not supported"),
+        ):
+            _validate_task_supported_for_model(
+                model_id="microsoft/resnet-50",
+                task="text-generation",
+                task_field_name="config.loader.task",
+            )
+        mock_ensure.assert_called_once()
+
+    def test_defers_when_registry_still_empty_after_registration(self) -> None:
+        """When get_supported_tasks returns [] even after registry population,
+        validation defers to the downstream loader without raising."""
+        mock_config = MagicMock()
+        mock_config.model_type = "custom-model"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            patch("winml.modelkit.loader.task.get_supported_tasks", return_value=[]),
+        ):
+            # Should NOT raise — defer to downstream loader
+            _validate_task_supported_for_model(
+                model_id="org/custom-model",
+                task="text-generation",
+            )
+
+    def test_error_message_format(self) -> None:
+        """Error message has task/model/architecture on line 1, Supported tasks on line 2."""
+        mock_config = MagicMock()
+        mock_config.model_type = "resnet"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["image-classification"],
+            ),
+            patch("winml.modelkit.loader.task.normalize_task", side_effect=lambda t: t),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            pytest.raises(ValueError) as exc_info,
+        ):
+            _validate_task_supported_for_model(
+                model_id="microsoft/resnet-50",
+                task="text-generation",
+                task_field_name="config.loader.task",
+            )
+
+        msg = str(exc_info.value)
+        lines = msg.splitlines()
+        assert len(lines) == 2
+        assert lines[0].endswith("(architecture: resnet).")
+        assert lines[1].startswith("Supported tasks:")
+
+    def test_accepts_next_sentence_prediction_for_bert(self) -> None:
+        """``next-sentence-prediction`` is in ``TASK_SYNONYM_EXTENSIONS`` and must
+        be accepted, even though Optimum's per-arch supported_tasks does not list
+        it. Regression for pre-PR behavior, see review claim 2.
+        """
+        mock_config = MagicMock()
+        mock_config.model_type = "bert"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["feature-extraction", "fill-mask", "text-classification"],
+            ),
+        ):
+            # Should NOT raise — short-circuited via TASK_SYNONYM_EXTENSIONS.
+            _validate_task_supported_for_model(
+                model_id="bert-base-uncased",
+                task="next-sentence-prediction",
+            )
+
+    def test_accepts_mask_generation_via_synonym_extensions(self) -> None:
+        """``mask-generation`` is preserved in ``TASK_SYNONYM_EXTENSIONS`` for SAM2.
+
+        Optimum's ``map_from_synonym`` would normalize it to ``feature-extraction``,
+        which is wrong for the HF-pipeline-keyed downstream registries.
+        """
+        mock_config = MagicMock()
+        mock_config.model_type = "sam"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["feature-extraction"],
+            ),
+        ):
+            _validate_task_supported_for_model(
+                model_id="facebook/sam-vit-base",
+                task="mask-generation",
+            )
+
+    def test_accepts_optimum_synonym_with_warning(self, caplog: pytest.LogCaptureFixture) -> None:
+        """Optimum-known synonyms (e.g. ``masked-lm`` -> ``fill-mask``) are accepted
+        but logged as a warning so users converge on the canonical spelling.
+        """
+        mock_config = MagicMock()
+        mock_config.model_type = "bert"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["feature-extraction", "fill-mask"],
+            ),
+            patch(
+                "winml.modelkit.loader.task.normalize_task",
+                side_effect=lambda t: {"masked-lm": "fill-mask"}.get(t, t),
+            ),
+            caplog.at_level("WARNING", logger="winml.modelkit.commands.build"),
+        ):
+            _validate_task_supported_for_model(
+                model_id="bert-base-uncased",
+                task="masked-lm",
+            )
+
+        assert any(
+            "synonym" in rec.message and "fill-mask" in rec.message for rec in caplog.records
+        ), f"Expected canonical-name hint, got: {[r.message for r in caplog.records]}"
+
+    def test_silently_accepts_cross_modality_feature_extraction(
+        self, caplog: pytest.LogCaptureFixture
+    ) -> None:
+        """Documented limitation: Optimum collapses ``image-feature-extraction``
+        and ``feature-extraction``. A text-only arch with ``--task
+        image-feature-extraction`` is therefore accepted (with a warning) at this
+        gate; cross-modality routing errors must surface downstream where the
+        HF-pipeline-keyed registries see the un-collapsed ``loader.task``.
+
+        See review claim 1 — this test documents the limitation rather than
+        asserting a fix.
+        """
+        mock_config = MagicMock()
+        mock_config.model_type = "bert"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["feature-extraction", "fill-mask"],
+            ),
+            patch(
+                "winml.modelkit.loader.task.normalize_task",
+                side_effect=lambda t: "feature-extraction"
+                if t in {"image-feature-extraction", "feature-extraction"}
+                else t,
+            ),
+            caplog.at_level("WARNING", logger="winml.modelkit.commands.build"),
+        ):
+            _validate_task_supported_for_model(
+                model_id="bert-base-uncased",
+                task="image-feature-extraction",
+                task_field_name="config.loader.task",
+            )
+
+        # Accepted, but the warning must fire so the limitation is at least visible.
+        assert any("synonym" in rec.message for rec in caplog.records)
+
+    def test_rejects_unrelated_task_after_all_fallbacks(self) -> None:
+        """A task that is not verbatim-supported, not in ``TASK_SYNONYM_EXTENSIONS``,
+        and whose Optimum-normalized form is not in the arch's supported set is
+        still rejected. Ensures the new branches did not turn the gate into a
+        no-op.
+        """
+        mock_config = MagicMock()
+        mock_config.model_type = "resnet"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["image-classification", "image-feature-extraction"],
+            ),
+            patch("winml.modelkit.loader.task.normalize_task", side_effect=lambda t: t),
+            pytest.raises(ValueError, match=r"text-generation.*is not supported"),
+        ):
+            _validate_task_supported_for_model(
+                model_id="microsoft/resnet-50",
+                task="text-generation",
+            )
+
+    def test_verbatim_match_does_not_warn(self, caplog: pytest.LogCaptureFixture) -> None:
+        """When the task is the exact canonical name in the supported set, no
+        synonym-warning should fire (verbatim branch short-circuits before
+        normalization).
+        """
+        mock_config = MagicMock()
+        mock_config.model_type = "vit"
+
+        with (
+            patch("transformers.AutoConfig.from_pretrained", return_value=mock_config),
+            patch("winml.modelkit.export.io.ensure_hf_models_registered"),
+            patch(
+                "winml.modelkit.loader.task.get_supported_tasks",
+                return_value=["feature-extraction", "image-classification"],
+            ),
+            caplog.at_level("WARNING", logger="winml.modelkit.commands.build"),
+        ):
+            _validate_task_supported_for_model(
+                model_id="google/vit-base-patch16-224",
+                task="image-classification",
+            )
+
+        assert not any("synonym" in rec.message for rec in caplog.records)
diff --git a/tests/unit/commands/test_config_cli.py b/tests/unit/commands/test_config_cli.py
index c993b0e32..1f16fa491 100644
--- a/tests/unit/commands/test_config_cli.py
+++ b/tests/unit/commands/test_config_cli.py
@@ -315,7 +315,7 @@ def test_onnx_no_quant(self, runner: CliRunner, tmp_path: Path) -> None:
         """--no-quant should set quant=None even for ONNX configs."""
         from winml.modelkit.commands.config import config
 
-        # Create a fake .onnx file so _is_onnx_file returns True
+        # Create a fake .onnx file so is_onnx_file_path returns True
         onnx_file = tmp_path / "model.onnx"
         onnx_file.write_bytes(b"fake")
 
diff --git a/tests/unit/loader/test_resolve_loader_config.py b/tests/unit/loader/test_resolve_loader_config.py
index a6438034a..659702bc8 100644
--- a/tests/unit/loader/test_resolve_loader_config.py
+++ b/tests/unit/loader/test_resolve_loader_config.py
@@ -15,7 +15,10 @@
 
 import pytest
 
-from winml.modelkit.loader import WinMLLoaderConfig, resolve_loader_config
+from winml.modelkit.loader import (
+    WinMLLoaderConfig,
+    resolve_loader_config,
+)
 
 
 # =============================================================================

From 32e2584f03726c4b3ad17e1e1c7d16227d773705 Mon Sep 17 00:00:00 2001
From: Yue Sun <yuesu@microsoft.com>
Date: Fri, 29 May 2026 15:26:36 +0800
Subject: [PATCH 018/143] fix: run eval failed on amd (#783)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Fixes `scripts/e2e_eval/run_eval.py` crashing on VitisAI EP (AMD Ryzen
AI NPU) and a latent bug in `winml build` that prevented the script's
`--no-quant` workaround from actually taking effect.

The crash: VitisAI ships its own internal quantizer and runs it at
session-create time. Layering winml's generic QDQ quantization pass on
top produces a model VitisAI cannot consume, which manifests as
`DpuKernelRunner.cpp:1920 DPU timeout` during `winml perf`. The fix is
to tell winml to skip its own quantization when the selected EP
quantizes natively.

## Changes

### `src/winml/modelkit/commands/build.py` — root-cause fix (1 line)

When `--device` was passed to `winml build`, the internal
`_patch_device` helper unconditionally re-populated `cfg.quant` with the
device's default quantization config, silently undoing any prior
`--no-quant`. The condition now respects `no_quant`:

```python
if no_quant or resolved_quant is None:
    cfg.quant = None
```

Without this, `winml build … --device npu --no-quant` still produced a
`_quantized.onnx` artifact.

### `scripts/e2e_eval/run_eval.py` — script wiring

- New canonical-name set `_NATIVE_QUANT_EPS =
{"VitisAIExecutionProvider"}` plus a helper `_ep_quantizes_natively(ep)`
that funnels both canonical names and user aliases (e.g. `vitisai`)
through `winml.modelkit.utils.constants.normalize_ep_name`. No hardcoded
aliases.
- `_resolve_precision(...)` gained an `ep` parameter; for native-quant
EPs it returns `None` so no precision flag is sent.
- `_run_build` now passes `--no-quant` to **both** `winml config` (so
the persisted `build_config.json` has `quant: null` up-front) and `winml
build` (defense in depth) when the EP quantizes natively.
- Call sites in `run_model` and `main` updated to thread `ep` through
`_resolve_precision`.

## Why the earlier commits in this branch weren't enough

The first attempt (`fix(run_eval): skip quantize when VitisAI EP is
selected`) wired `--no-quant` only into `winml build`. That didn't take
effect because of the `_patch_device` bug above. The second attempt
(`fix(vitisai): resolve auto-precision to w8a8 for VitisAI NPU`) tried
to switch precision instead of skipping — also wrong, since VitisAI
wants an fp32 input and quantizes it itself. The final state keeps the
script clean (`--no-quant`, no precision override) and fixes the actual
`winml build` bug.

## Verification

Manual end-to-end on AMD Ryzen AI (VitisAI NPU), with a clean
`~/.cache/winml/artifacts/...` and output dir:

```pwsh
uv run --no-sync python scripts/e2e_eval/run_eval.py `
  --hf-model facebook/convnext-tiny-224 `
  --task image-classification `
  --device npu --ep vitisai `
  --eval-type perf --no-report --verbose --timeout 1800 `
  --output-dir e2e-test\vitisai_npu
```

Before: `winml perf` crashed with `DpuKernelRunner.cpp:1920 DPU
timeout`.
After:
- Cached `imgcls_*_winml_build_config.json` has `"quant": null`.
- No `_quantized.onnx` artifact produced.
- Perf step: **PASS** in ~120 s.
---
 scripts/e2e_eval/run_eval.py            |  65 ++++++++-
 src/winml/modelkit/commands/build.py    |   2 +-
 tests/unit/eval/test_run_eval_script.py | 172 ++++++++++++++++++++++++
 3 files changed, 232 insertions(+), 7 deletions(-)
 create mode 100644 tests/unit/eval/test_run_eval_script.py

diff --git a/scripts/e2e_eval/run_eval.py b/scripts/e2e_eval/run_eval.py
index a4ef945e7..ae50351d1 100644
--- a/scripts/e2e_eval/run_eval.py
+++ b/scripts/e2e_eval/run_eval.py
@@ -81,8 +81,31 @@
 _DEFAULT_SAMPLES = 1000
 _DEFAULT_PRECISION_NPU = "w8a16"
 
-
-def _resolve_precision(device: str, explicit: str | None) -> str | None:
+# EPs whose eval track keeps the model unquantized (the "fp" variant)
+# rather than running winml's QDQ pass on top.  This is an eval-setup
+# choice -- e.g. VitisAI / AMD Ryzen AI is benchmarked on the fp32/fp16
+# model -- not a claim about the EP's internal pipeline.  For these EPs
+# the harness passes ``--no-quant`` to both ``winml config`` and
+# ``winml build`` (see :func:`_run_build` and :func:`run_model`).
+#
+# Entries are canonical ``EPName`` values (the ``*ExecutionProvider`` form);
+# user-facing aliases like ``vitisai`` are normalised via
+# ``normalize_ep_name`` in :func:`_should_skip_winml_quant` so each EP only
+# needs to be listed once.
+_EPS_SKIP_WINML_QUANT = frozenset({"VitisAIExecutionProvider"})
+
+
+def _should_skip_winml_quant(ep: str | None) -> bool:
+    """True if the eval harness should run this EP on the unquantized model."""
+    # Lazy import: keeps ``scripts/e2e_eval`` cheap to load (winml.modelkit
+    # transitively imports onnxruntime) and matches the existing in-function
+    # import pattern used elsewhere in this script.
+    from winml.modelkit.utils.constants import normalize_ep_name
+
+    return normalize_ep_name(ep) in _EPS_SKIP_WINML_QUANT
+
+
+def _resolve_precision(device: str, explicit: str | None, ep: str | None = None) -> str | None:
     """Return the precision to pass to winml config/perf, or None to omit the flag.
 
     w8a16 is only applied by default on NPU.  For CPU/GPU the flag is omitted
@@ -91,8 +114,22 @@ def _resolve_precision(device: str, explicit: str | None) -> str | None:
     (NHWC layout transformer inserts Conv nodes that QNN GPU's GetCapability
     does not claim).
 
-    An explicit per-model precision always takes precedence.
+    For EPs in :data:`_EPS_SKIP_WINML_QUANT` (e.g. VitisAI) the flag is forced
+    off regardless of ``explicit``: the harness pairs these EPs with
+    ``--no-quant`` at config/build time, so a non-empty ``--precision`` would
+    produce a config that says "quantize to X" while the build says "skip
+    quantization" -- a contradiction.  An explicit value is dropped with a
+    one-line warning so the override is visible in the log.
+
+    Otherwise an explicit per-model precision always takes precedence.
     """
+    if _should_skip_winml_quant(ep):
+        if explicit:
+            safe_print(
+                f"  [precision] Ignoring explicit precision={explicit!r} for EP {ep!r}: "
+                "this EP is run on the unquantized variant (--no-quant)."
+            )
+        return None
     if explicit:
         return explicit
     return _DEFAULT_PRECISION_NPU if device == "npu" else None
@@ -406,6 +443,13 @@ def _run_build(
         config_args += ["--task", entry.task]
     if ep:
         config_args += ["--ep", ep]
+    # EPs in _EPS_SKIP_WINML_QUANT are evaluated on the unquantized variant.
+    # Pass --no-quant to winml config so the generated build_config.json is
+    # written with quant=None up-front; otherwise on NPU the config command
+    # would still apply its default precision (w8a16) and we'd be relying on
+    # --no-quant at build time alone to override it.
+    if _should_skip_winml_quant(ep):
+        config_args += ["--no-quant"]
 
     config_proc = _run_subprocess(config_args, timeout)
     if config_proc["exit_code"] != 0:
@@ -446,6 +490,11 @@ def _run_build(
         ]
         if ep:
             build_args += ["--ep", ep]
+        # Mirror the --no-quant passed to winml config above so the build
+        # stage also skips QDQ regardless of what the config carries (defence
+        # in depth; see _EPS_SKIP_WINML_QUANT for the rationale).
+        if _should_skip_winml_quant(ep):
+            build_args += ["--no-quant"]
 
         build_proc = _run_subprocess(build_args, timeout)
         last_proc = build_proc
@@ -548,8 +597,10 @@ def run_model(
     code, concatenated stdout/stderr, summed elapsed).
     """
     if not onnx_paths:
-        # No pre-built paths: fall back to HF model ID (single model only)
-        precision = _resolve_precision(device, None)
+        # No pre-built paths: fall back to HF model ID (single model only).
+        # winml perf builds internally; the same --no-quant gating used by
+        # _run_build must apply here so the EP sees the unquantized variant.
+        precision = _resolve_precision(device, None, ep=ep)
         args = [
             *WINML_CLI,
             "perf",
@@ -564,6 +615,8 @@ def run_model(
             args += ["--task", entry.task]
         if ep:
             args += ["--ep", ep]
+        if _should_skip_winml_quant(ep):
+            args += ["--no-quant"]
         args += ["--iterations", "10", "--warmup", "2"]
         args += entry.perf_args
 
@@ -1336,7 +1389,7 @@ def main() -> None:
             build_result = _run_build(
                 entry,
                 args.device,
-                _resolve_precision(args.device, entry.precision),
+                _resolve_precision(args.device, entry.precision, ep=args.ep),
                 args.timeout,
                 model_dir,
                 ep=args.ep,
diff --git a/src/winml/modelkit/commands/build.py b/src/winml/modelkit/commands/build.py
index 1a3cb68f2..e55f15bfc 100644
--- a/src/winml/modelkit/commands/build.py
+++ b/src/winml/modelkit/commands/build.py
@@ -610,7 +610,7 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
                 from ..config import resolve_quant_compile_config
 
                 resolved_quant, _ = resolve_quant_compile_config(device=device)
-                if resolved_quant is None:
+                if no_quant or resolved_quant is None:
                     cfg.quant = None
                 elif cfg.quant is None:
                     # Populate calibration identifiers from the loader/model
diff --git a/tests/unit/eval/test_run_eval_script.py b/tests/unit/eval/test_run_eval_script.py
new file mode 100644
index 000000000..85a9331cc
--- /dev/null
+++ b/tests/unit/eval/test_run_eval_script.py
@@ -0,0 +1,172 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Unit tests for ``scripts/e2e_eval/run_eval.py``.
+
+The script is not packaged, so we load it via ``importlib`` (same pattern
+as ``TestBuildEvalResultEpField`` in ``test_eval.py``) and exercise the
+small helpers that gate the ``--no-quant`` / ``--precision`` injection
+for EPs run on the unquantized variant (currently VitisAI).
+"""
+
+from __future__ import annotations
+
+import importlib.util
+import sys
+from pathlib import Path
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+
+def _load_run_eval():
+    """Load scripts/e2e_eval/run_eval.py as a module."""
+    repo_root = Path(__file__).resolve().parents[3]
+    script_path = repo_root / "scripts" / "e2e_eval" / "run_eval.py"
+
+    # run_eval.py does ``sys.path.insert(0, str(Path(__file__).parent))``
+    # at import time so its sibling ``utils`` package resolves; mirror that
+    # here in case the module is loaded before the script runs.
+    scripts_dir = str(script_path.parent)
+    if scripts_dir not in sys.path:
+        sys.path.insert(0, scripts_dir)
+
+    spec = importlib.util.spec_from_file_location("_e2e_run_eval", script_path)
+    mod = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(mod)
+    return mod
+
+
+@pytest.fixture(scope="module")
+def run_eval():
+    return _load_run_eval()
+
+
+class TestShouldSkipWinmlQuant:
+    """Membership test for ``_should_skip_winml_quant``.
+
+    Aliases (case-insensitive) and the canonical ``*ExecutionProvider`` form
+    should both resolve via ``normalize_ep_name``.
+    """
+
+    @pytest.mark.parametrize(
+        "ep",
+        ["vitisai", "VitisAI", "VITISAI", "VitisAIExecutionProvider"],
+    )
+    def test_vitisai_skips_quant(self, run_eval, ep):
+        assert run_eval._should_skip_winml_quant(ep) is True
+
+    @pytest.mark.parametrize(
+        "ep",
+        [None, "", "cpu", "dml", "qnn", "QNNExecutionProvider", "DmlExecutionProvider"],
+    )
+    def test_other_eps_do_not_skip(self, run_eval, ep):
+        assert run_eval._should_skip_winml_quant(ep) is False
+
+
+class TestResolvePrecision:
+    """Behaviour of ``_resolve_precision`` for the new ``ep`` arg."""
+
+    def test_npu_default_unchanged(self, run_eval):
+        assert run_eval._resolve_precision("npu", None) == "w8a16"
+        assert run_eval._resolve_precision("npu", None, ep=None) == "w8a16"
+
+    def test_cpu_default_unchanged(self, run_eval):
+        assert run_eval._resolve_precision("cpu", None) is None
+        assert run_eval._resolve_precision("gpu", None) is None
+
+    def test_explicit_precision_takes_precedence_on_npu(self, run_eval):
+        assert run_eval._resolve_precision("npu", "fp16") == "fp16"
+        assert run_eval._resolve_precision("cpu", "fp16", ep="DmlExecutionProvider") == "fp16"
+
+    def test_skip_quant_ep_drops_default(self, run_eval):
+        assert run_eval._resolve_precision("npu", None, ep="vitisai") is None
+        assert (
+            run_eval._resolve_precision("npu", None, ep="VitisAIExecutionProvider") is None
+        )
+
+    def test_skip_quant_ep_drops_explicit_with_warning(self, run_eval, capsys):
+        result = run_eval._resolve_precision("npu", "w8a8", ep="vitisai")
+        assert result is None
+        captured = capsys.readouterr()
+        # Warning must mention the dropped value and the EP so the override
+        # is visible in the log when an explicit per-model precision is set
+        # for an EP that runs on the unquantized variant.
+        assert "w8a8" in captured.out
+        assert "vitisai" in captured.out
+
+
+class TestRunBuildNoQuantInjection:
+    """``_run_build`` must append ``--no-quant`` to both winml config and
+    winml build invocations when the EP is in ``_EPS_SKIP_WINML_QUANT``.
+    """
+
+    @staticmethod
+    def _make_entry(hf_id="microsoft/resnet-50", task="image-classification"):
+        entry = MagicMock()
+        entry.hf_id = hf_id
+        entry.task = task
+        entry.perf_args = []
+        return entry
+
+    @staticmethod
+    def _make_config_proc(config_path: Path):
+        return {
+            "exit_code": 0,
+            "stdout": f"Generated {config_path}",
+            "stderr": "",
+            "elapsed": 0.1,
+            "command": "winml config ...",
+        }
+
+    @staticmethod
+    def _make_build_proc():
+        return {
+            "exit_code": 0,
+            "stdout": "Build cache: /tmp/x_model.onnx",
+            "stderr": "",
+            "elapsed": 0.1,
+            "command": "winml build ...",
+        }
+
+    def _invoke(self, run_eval, ep, tmp_path):
+        entry = self._make_entry()
+        # _run_build composes config_path = model_dir / "build_config.json"
+        # internally; pre-create it so the post-config glob fallback resolves
+        # to a single sub-config and the build loop runs once.
+        config_path = tmp_path / "build_config.json"
+        config_path.write_text("{}")
+
+        captured_args: list[list[str]] = []
+
+        def fake_subprocess(args, _timeout):
+            captured_args.append(list(args))
+            if "config" in args:
+                return self._make_config_proc(config_path)
+            return self._make_build_proc()
+
+        with patch.object(run_eval, "_run_subprocess", side_effect=fake_subprocess), patch.object(
+            run_eval, "_extract_onnx_path", return_value=str(tmp_path / "model.onnx")
+        ):
+            run_eval._run_build(
+                entry,
+                "npu",
+                None,
+                300,
+                tmp_path,
+                ep=ep,
+            )
+        return captured_args
+
+    def test_vitisai_injects_no_quant_into_both_config_and_build(self, run_eval, tmp_path):
+        calls = self._invoke(run_eval, "vitisai", tmp_path)
+        config_call = next(args for args in calls if "config" in args)
+        build_call = next(args for args in calls if "build" in args)
+        assert "--no-quant" in config_call
+        assert "--no-quant" in build_call
+
+    def test_other_ep_omits_no_quant(self, run_eval, tmp_path):
+        calls = self._invoke(run_eval, "dml", tmp_path)
+        assert all("--no-quant" not in args for args in calls)

From 3dcee5b8e5e7fa4b57f35bd4c0130e7d7cb909ba Mon Sep 17 00:00:00 2001
From: ziyuanguo1998 <Siryuanshao@gmail.com>
Date: Fri, 29 May 2026 16:41:27 +0800
Subject: [PATCH 019/143] fix(inspect): validate --task at Click time with
 clean error (#546) (#771)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Closes #546.

`winml inspect --task bogus-task` was leaking optimum's internal
`TasksManager` class name and pointing users to optimum docs:

> Error: Inspection error: Task 'bogus-task' not supported by
TasksManager. Check optimum documentation for supported tasks.

Now the value is validated at Click parse time against the hand-coded
`KNOWN_TASKS` set, before any heavy imports:

```
$ winml inspect -m microsoft/resnet-50 --task bogus-task
Usage: winml inspect [OPTIONS]
Try 'winml inspect --help' for help.

Error: Invalid task 'bogus-task'. Valid: audio-classification, audio-frame-classification, audio-xvector, automatic-speech-recognition, depth-estimation, ... (35 total). See 'winml inspect --list-tasks' for the full list.
```

- Exit code 2 (Click UsageError)
- No third-party class names; no optimum-docs pointer
- Callback imports only `..loader.task.KNOWN_TASKS` — avoids the ~10s
optimum/transformers cold start, so the fail-fast stays fast
- `--list-tasks` and valid `--task` paths unchanged

Co-authored-by: Ziyuan Guo (WE TEAM) <ziyuanguo@microsoft.com>
---
 src/winml/modelkit/commands/inspect.py  | 22 +++++++++++++++
 tests/unit/commands/test_inspect_cli.py | 37 +++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/src/winml/modelkit/commands/inspect.py b/src/winml/modelkit/commands/inspect.py
index d52b9d33c..1eec26467 100644
--- a/src/winml/modelkit/commands/inspect.py
+++ b/src/winml/modelkit/commands/inspect.py
@@ -40,6 +40,27 @@
 _LOCAL_FILE_EXTS = frozenset({".onnx", ".pt", ".pth", ".safetensors", ".bin"})
 
 
+def _validate_task(
+    ctx: click.Context, param: click.Parameter, value: str | None
+) -> str | None:
+    """Click-time validation for --task against the hand-coded KNOWN_TASKS set.
+
+    Imports only ..loader.task to keep validation cheap — going through optimum
+    would cost ~10s on a warm cache and defeats fail-fast on bad input.
+    """
+    if value is None:
+        return None
+    from ..loader.task import KNOWN_TASKS
+
+    if value in KNOWN_TASKS:
+        return value
+    examples = ", ".join(sorted(KNOWN_TASKS)[:5])
+    raise click.UsageError(
+        f"Invalid task '{value}'. Valid: {examples}, ... ({len(KNOWN_TASKS)} total). "
+        f"See 'winml inspect --list-tasks' for the full list."
+    )
+
+
 def _looks_like_local_path(model_id: str) -> bool:
     """Return True when model_id is explicitly a local path.
 
@@ -86,6 +107,7 @@ def _looks_like_local_path(model_id: str) -> bool:
     "-t",
     "--task",
     default=None,
+    callback=_validate_task,
     help="Override auto-detected task (e.g., image-classification, feature-extraction)",
 )
 @click.option(
diff --git a/tests/unit/commands/test_inspect_cli.py b/tests/unit/commands/test_inspect_cli.py
index 601f6ff02..13568d872 100644
--- a/tests/unit/commands/test_inspect_cli.py
+++ b/tests/unit/commands/test_inspect_cli.py
@@ -111,6 +111,43 @@ def test_invalid_format_rejected(self, runner: CliRunner) -> None:
         result = runner.invoke(inspect, ["-m", "test", "-f", "xml"], obj={})
         assert result.exit_code != 0
 
+    def test_invalid_task_rejected_at_click_time(self, runner: CliRunner) -> None:
+        """`--task bogus-task` must fail with a clean error before any heavy work.
+
+        Patches _inspect_model_v2 to assert validation kicks in *before* the API
+        is reached — fail-fast on bad input.
+        """
+        from winml.modelkit.commands.inspect import inspect
+
+        with patch(_INSPECT_MODEL) as mock_api:
+            result = runner.invoke(
+                inspect, ["-m", "test", "--task", "bogus-task"], obj={}
+            )
+            assert result.exit_code == 2, f"Expected exit 2, got {result.exit_code}"
+            mock_api.assert_not_called()
+            # User-facing error must name the bad value and point to --list-tasks,
+            # and must NOT leak internal optimum jargon (see issue #546).
+            assert "bogus-task" in result.output
+            assert "--list-tasks" in result.output
+            assert "TasksManager" not in result.output
+            assert "optimum" not in result.output.lower()
+
+    def test_valid_task_accepted(
+        self,
+        runner: CliRunner,
+        mock_inspect_result: MagicMock,
+    ) -> None:
+        from winml.modelkit.commands.inspect import inspect
+
+        with (
+            patch(_INSPECT_MODEL, return_value=mock_inspect_result),
+            patch(_OUTPUT_TABLE),
+        ):
+            result = runner.invoke(
+                inspect, ["-m", "test", "--task", "image-classification"], obj={}
+            )
+            assert result.exit_code == 0, f"Failed: {result.output}"
+
 
 # =============================================================================
 # OUTPUT FORMAT TESTS

From 3f0a890b1dc227f95a5c8fa56d2265015665fe7a Mon Sep 17 00:00:00 2001
From: ziyuanguo1998 <Siryuanshao@gmail.com>
Date: Fri, 29 May 2026 16:52:57 +0800
Subject: [PATCH 020/143] fix: align `winml catalog` `-t` short flag with other
 commands (#541) (#772)

## Summary

Fixes #541.

`winml catalog` was the only command where `-t` did NOT mean `--task`:

| Command   | `-t` means       |
|-----------|------------------|
| `inspect` | `--task`         |
| `export`  | `--task`         |
| `config`  | `--task`         |
| `catalog` | `--model-type` (inconsistent) |

A user who has memorized `-t` to mean `--task` in 3 commands would type
`-t image-classification` against `winml catalog` and silently get
`--model-type=image-classification` (no such model type) instead.

## Change

In `src/winml/modelkit/commands/catalog.py`:
- Dropped the `-t` short from `--model-type` (no short alias now).
- Moved `-t` to `--task` (replacing the previous `-k`).

`--model-type` is still fully supported via its long form.

Adds a regression guard test (`test_model_type_has_no_short_flag`) that
checks both the `--help` output AND that passing a model_type via `-t`
is interpreted as a task. All 115 catalog tests pass.

Co-authored-by: Ziyuan Guo (WE TEAM) <ziyuanguo@microsoft.com>
---
 src/winml/modelkit/commands/catalog.py |  3 +-
 tests/cli/test_catalog_cli.py          | 38 ++++++++++++++++++++------
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/src/winml/modelkit/commands/catalog.py b/src/winml/modelkit/commands/catalog.py
index 4c0001f88..3341a9d72 100644
--- a/src/winml/modelkit/commands/catalog.py
+++ b/src/winml/modelkit/commands/catalog.py
@@ -363,14 +363,13 @@ def _save_json(data: Any, path: Path) -> None:
 @click.command()
 @click.option(
     "--model-type",
-    "-t",
     default=None,
     metavar="TYPE",
     help="Filter by model architecture (e.g. bert, roberta, vit).",
 )
 @click.option(
     "--task",
-    "-k",
+    "-t",
     default=None,
     metavar="TASK",
     help="Filter by HuggingFace task (e.g. text-classification, image-segmentation).",
diff --git a/tests/cli/test_catalog_cli.py b/tests/cli/test_catalog_cli.py
index ad3dad146..aad48f2ca 100644
--- a/tests/cli/test_catalog_cli.py
+++ b/tests/cli/test_catalog_cli.py
@@ -184,15 +184,35 @@ def test_invalid_device_choice_exits_two(self) -> None:
         assert result.exit_code == 2
         assert "Invalid value for '-d' / '--device'" in result.output
 
-    def test_short_flags_accepted(self, type_task_pair: tuple[str, str], tmp_path: Path) -> None:
-        """-t / -k short aliases are accepted by the parser."""
-        model_type, task = type_task_pair
-        models = _run_json(tmp_path / "out.json", "-t", model_type, "-k", task)
-        assert len(models) > 0, f"Expected at least one {model_type}/{task} model"
-        assert all(
-            m["model_type"].lower() == model_type.lower() and m["task"].lower() == task.lower()
-            for m in models
-        )
+    def test_short_flag_task_accepted(
+        self, type_task_pair: tuple[str, str], tmp_path: Path
+    ) -> None:
+        """``-t`` is the short alias for ``--task`` (consistent with other commands)."""
+        _, task = type_task_pair
+        models = _run_json(tmp_path / "out.json", "-t", task)
+        assert len(models) > 0, f"Expected at least one model with task {task}"
+        assert all(m["task"].lower() == task.lower() for m in models)
+
+    def test_model_type_has_no_short_flag(
+        self, help_output: str, model_types: list[str], tmp_path: Path
+    ) -> None:
+        """``-t`` must mean ``--task`` here, matching inspect/export/config.
+
+        Regression guard for issue #541: ``-t`` previously bound to
+        ``--model-type`` in catalog only, while every other command used
+        ``-t`` for ``--task``.
+        """
+        assert "-t, --task" in help_output
+        assert "-t, --model-type" not in help_output
+
+        # A real model_type passed via -t must be interpreted as a task,
+        # so the result is disjoint from filtering by --model-type.
+        if not model_types:
+            pytest.skip("catalog has no model_types to probe")
+        mtype = model_types[0]
+        as_task = _run_json(tmp_path / "as_task.json", "-t", mtype)
+        as_mtype = _run_json(tmp_path / "as_mtype.json", "--model-type", mtype)
+        assert _model_ids(as_task).isdisjoint(_model_ids(as_mtype)) or len(as_task) == 0
 
 
 # ===========================================================================

From b59a5d8249e9fbf65f1841ce3baa7ad41808f289 Mon Sep 17 00:00:00 2001
From: Zhenchao Ni <zhenni@microsoft.com>
Date: Mon, 1 Jun 2026 17:06:39 +0800
Subject: [PATCH 021/143] Skip some e2e test cases or assertion (#796)

**Skips compilation related cases**

There are some model fail to be compiled in VitisAI Execution Provider.
The error is an "Access Violation" error which causes the python process
to crash. This would be an EP side problem. To unblock our e2e test, I
have skipped them for VitisAI

**Skips npu usage assertion for small model**

Running small mock model can be super fast. For this case, the NPU usage
is zero. However, our assertion logic still expectes to have some NPU
usage. This makes the e2e not stable. Considering that we have already
this assertion on real model e2e test cases, I skip this assertion for
small model only.

**Skips eval metric value range assertion**

The eval e2e test only uses 10 samples because we aim to see the eval
pipeline is working rather than truly eval a model in e2e. In assertion
logic, we have a metric range. But the metric range is calcuated on qnn
device, which may not be the same for other devices. Using the same
range may cause e2e instable. Therefore, I only assert the metric range
for qnn. For other device, I just assert the metric value is available.
---
 tests/e2e/require_ep.py       | 19 +++++++++++++++
 tests/e2e/test_compile_e2e.py |  2 ++
 tests/e2e/test_eval_e2e.py    | 44 ++++++++++++++++++++++++++++-------
 tests/e2e/test_perf_e2e.py    | 34 ++++++++++++++++++++-------
 4 files changed, 82 insertions(+), 17 deletions(-)

diff --git a/tests/e2e/require_ep.py b/tests/e2e/require_ep.py
index 26fa12434..d4d2f744a 100644
--- a/tests/e2e/require_ep.py
+++ b/tests/e2e/require_ep.py
@@ -89,3 +89,22 @@ def require_not_ep(ep: str) -> None:
 
     if WinMLEPRegistry.get_instance().is_ep_available(provider):
         pytest.skip(f"EP is available on this host (test requires it absent): {provider}")
+
+
+def is_host(ep: str) -> bool:
+    """Return True iff ``ep`` is available on this host.
+
+    Non-skipping probe used to gate assertions whose tolerance depends on
+    the active EP (e.g. only enforce a metric-magnitude bound on QNN,
+    where quantization preserves accuracy, while still running the rest
+    of the test on every EP for pipeline-regression coverage).
+    """
+    from winml.modelkit.session import WinMLEPRegistry
+    from winml.modelkit.utils import normalize_ep_name
+
+    provider = normalize_ep_name(ep)
+    if provider is None:
+        return False
+    if provider in ("CPUExecutionProvider", "DmlExecutionProvider"):
+        return True
+    return WinMLEPRegistry.get_instance().is_ep_available(provider)
diff --git a/tests/e2e/test_compile_e2e.py b/tests/e2e/test_compile_e2e.py
index b2f52e1be..82e960f13 100644
--- a/tests/e2e/test_compile_e2e.py
+++ b/tests/e2e/test_compile_e2e.py
@@ -681,6 +681,8 @@ def test_good_input_compiles_and_runs(
     can load and run on the requested EP+device.
     """
     require_ep(require_ep_name)
+    # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
+    require_not_ep("vitisai")
 
     out = tmp_path / f"{device or 'nodev'}_{ep or 'noep'}.onnx"
     cmd = ["-m", str(simple_matmul_onnx), "-o", str(out)]
diff --git a/tests/e2e/test_eval_e2e.py b/tests/e2e/test_eval_e2e.py
index 14a11c65a..2e0926aa6 100644
--- a/tests/e2e/test_eval_e2e.py
+++ b/tests/e2e/test_eval_e2e.py
@@ -34,7 +34,7 @@
 from winml.modelkit.commands.eval import eval as eval_cmd
 
 from .conftest import find_cache_dir
-from .require_ep import require_ep
+from .require_ep import is_host, require_ep, require_not_ep
 
 
 if TYPE_CHECKING:
@@ -188,9 +188,14 @@ def test_text_classification(self, runner: CliRunner, tmp_path: Path) -> None:
         ])
         data = _assert_metrics_present(out, ["accuracy"])
         # bert-mrpc full MRPC ≈ 0.86; MRPC majority baseline ≈ 0.68.
-        _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
+        # Magnitude assertion is QNN-only: VitisAI W8A8 quantization
+        # degrades this small BERT well below the floor.
+        if is_host("qnn"):
+            _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
 
     def test_token_classification(self, runner: CliRunner, tmp_path: Path) -> None:
+        # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
+        require_not_ep("vitisai")
         out = tmp_path / "result.json"
         _invoke(runner, [
             "-m", "dslim/bert-base-NER",
@@ -226,6 +231,8 @@ def test_object_detection(self, runner: CliRunner, tmp_path: Path) -> None:
             assert -1.0 <= v <= 1.0, f"{k}={v} outside [-1, 1]"
 
     def test_image_segmentation(self, runner: CliRunner, tmp_path: Path) -> None:
+        # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
+        require_not_ep("vitisai")
         out = tmp_path / "result.json"
         _invoke(runner, [
             "-m", "nvidia/segformer-b1-finetuned-ade-512-512",
@@ -265,8 +272,11 @@ def test_feature_extraction(self, runner: CliRunner, tmp_path: Path) -> None:
         ])
         # Spearman correlation reported as percentage in [-100, 100].
         # MiniLM-L6-v2 full STSB ≈ 80; 10-sample noise can be large.
+        # Magnitude assertion is QNN-only: VitisAI W8A8 quantization
+        # produces near-random embeddings for this small encoder.
         data = _assert_metrics_present(out, ["cosine_spearman"])
-        _assert_in_range(data["metrics"], "cosine_spearman", 40.0, 100.0)
+        if is_host("qnn"):
+            _assert_in_range(data["metrics"], "cosine_spearman", 40.0, 100.0)
 
     def test_sentence_similarity(self, runner: CliRunner, tmp_path: Path) -> None:
         # Alias for feature-extraction.
@@ -277,8 +287,10 @@ def test_sentence_similarity(self, runner: CliRunner, tmp_path: Path) -> None:
             "--samples", SAMPLES,
             "-o", str(out),
         ])
+        # Same quantization caveat as test_feature_extraction.
         data = _assert_metrics_present(out, ["cosine_spearman"])
-        _assert_in_range(data["metrics"], "cosine_spearman", 40.0, 100.0)
+        if is_host("qnn"):
+            _assert_in_range(data["metrics"], "cosine_spearman", 40.0, 100.0)
 
     def test_image_feature_extraction(
         self, runner: CliRunner, tmp_path: Path,
@@ -305,6 +317,8 @@ def test_image_feature_extraction(
 
     def test_image_to_text_fp16(self, runner: CliRunner, tmp_path: Path) -> None:
         # Only test that exercises non-auto --precision.
+        # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
+        require_not_ep("vitisai")
         out = tmp_path / "result.json"
         _invoke(runner, [
             "-m", "Salesforce/blip-image-captioning-base",
@@ -369,6 +383,8 @@ def test_zero_shot_classification(
     def test_zero_shot_image_classification(
         self, runner: CliRunner, tmp_path: Path,
     ) -> None:
+        # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
+        require_not_ep("vitisai")
         out = tmp_path / "result.json"
         _invoke(runner, [
             "-m", "openai/clip-vit-base-patch32",
@@ -422,6 +438,8 @@ def test_onnx_file_mode_monolithic(
     def test_onnx_file_mode_split_encoder(
         self, runner: CliRunner, tmp_path: Path,
     ) -> None:
+        # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
+        require_not_ep("vitisai")
         hf_id = "openai/clip-vit-base-patch32"
         task = "zero-shot-image-classification"
 
@@ -573,12 +591,16 @@ def test_dataset_name_explicit(
             "--samples", SAMPLES,
             "-o", str(out),
         ])
+        # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
-        _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
+        if is_host("qnn"):
+            _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
 
     def test_label_mapping_image_segmentation(
         self, runner: CliRunner, tmp_path: Path,
     ) -> None:
+        # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
+        require_not_ep("vitisai")
         from pathlib import Path as _Path
 
         label_map = _Path(ADE20K_LABEL_MAP)
@@ -614,8 +636,10 @@ def test_config_file_basic(
             "--config", str(cfg),
             "-o", str(out),
         ])
+        # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
-        _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
+        if is_host("qnn"):
+            _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
         assert data["dataset"]["samples"] == 5, (
             f"expected samples=5 from config, got {data['dataset']['samples']}"
         )
@@ -636,8 +660,10 @@ def test_config_file_cli_override(
             "--samples", "7",
             "-o", str(out),
         ])
+        # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
-        _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
+        if is_host("qnn"):
+            _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
         assert data["dataset"]["samples"] == 7, (
             f"expected CLI override samples=7, got {data['dataset']['samples']}"
         )
@@ -652,8 +678,10 @@ def test_auto_task_detection(
             "--samples", SAMPLES,
             "-o", str(out),
         ])
+        # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
-        _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
+        if is_host("qnn"):
+            _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
         assert data.get("task") == "text-classification", (
             f"expected auto-detected task, got {data.get('task')!r}"
         )
diff --git a/tests/e2e/test_perf_e2e.py b/tests/e2e/test_perf_e2e.py
index 8952b5c7d..48c7da303 100644
--- a/tests/e2e/test_perf_e2e.py
+++ b/tests/e2e/test_perf_e2e.py
@@ -78,7 +78,9 @@ def _require_npu() -> None:
         pytest.skip("No NPU detected via PDH")
 
 
-def _assert_hw_monitor_section(data: dict, device_kind: str) -> None:
+def _assert_hw_monitor_section(
+    data: dict, device_kind: str, *, require_utilization: bool = True
+) -> None:
     """Assert the ``hw_monitor`` section is present and well-formed.
 
     Checks the section emitted by HWMonitor when --monitor is passed:
@@ -94,7 +96,8 @@ def _assert_hw_monitor_section(data: dict, device_kind: str) -> None:
     else:
         assert hw["device_kind"] == device_kind
         assert hw["adapter_luid"] is not None
-        assert hw[device_kind]["mean_pct"] > 0
+        if require_utilization:
+            assert hw[device_kind]["mean_pct"] > 0
 
 
 def _build_perf_args(
@@ -138,7 +141,12 @@ def _build_perf_args(
 
 
 def _assert_monitor_result(
-    data: dict, *, device: str, device_kind: str | None = None, ep: str | None = None
+    data: dict,
+    *,
+    device: str,
+    device_kind: str | None = None,
+    ep: str | None = None,
+    require_utilization: bool = True,
 ) -> None:
     """Assert a monitored perf run produced the expected device + hw_monitor data.
 
@@ -146,7 +154,7 @@ def _assert_monitor_result(
     measured, and delegates the hw_monitor checks to
     :func:`_assert_hw_monitor_section`. ``device_kind`` defaults to ``device``
     when not given (only differs for cases like VitisAI where ``--device`` and
-    the monitored hardware diverge).
+    the monitored hardware diverge). ``require_utilization`` is forwarded.
     """
     if device_kind is None:
         device_kind = device
@@ -154,7 +162,7 @@ def _assert_monitor_result(
     assert data["latency_ms"]["mean"] > 0
     if ep is not None:
         assert data["benchmark_info"]["ep"] == ep
-    _assert_hw_monitor_section(data, device_kind)
+    _assert_hw_monitor_section(data, device_kind, require_utilization=require_utilization)
 
 
 # ===========================================================================
@@ -283,7 +291,8 @@ def test_benchmark_gpu_monitor(self, tmp_path: Path, model_arg: str):
 
         assert output_file.exists(), f"Output file not created: {output_file}"
         data = json.loads(output_file.read_text())
-        _assert_monitor_result(data, device="gpu")
+        # Tiny synthetic fixture: below PDH utilization-publish floor.
+        _assert_monitor_result(data, device="gpu", require_utilization=False)
 
     def test_benchmark_npu_monitor(self, tmp_path: Path, model_arg: str):
         """Benchmark on NPU with --monitor.
@@ -308,7 +317,8 @@ def test_benchmark_npu_monitor(self, tmp_path: Path, model_arg: str):
 
         assert output_file.exists(), f"Output file not created: {output_file}"
         data = json.loads(output_file.read_text())
-        _assert_monitor_result(data, device="npu")
+        # Tiny synthetic fixture: below PDH utilization-publish floor.
+        _assert_monitor_result(data, device="npu", require_utilization=False)
 
     def test_benchmark_auto(self, tmp_path: Path, model_arg: str):
         """Benchmark with --device auto.
@@ -408,7 +418,10 @@ def test_benchmark_ep_device_gpu(self, ep: str, tmp_path: Path, model_arg: str):
 
         assert output_file.exists()
         data = json.loads(output_file.read_text())
-        _assert_monitor_result(data, device="gpu", ep=EP_ALIASES[ep])
+        # Tiny synthetic fixture: below PDH utilization-publish floor.
+        _assert_monitor_result(
+            data, device="gpu", ep=EP_ALIASES[ep], require_utilization=False
+        )
 
     @pytest.mark.parametrize("ep", NPU_EPS)
     def test_benchmark_ep_device_npu(self, ep: str, tmp_path: Path, model_arg: str):
@@ -434,7 +447,10 @@ def test_benchmark_ep_device_npu(self, ep: str, tmp_path: Path, model_arg: str):
 
         assert output_file.exists()
         data = json.loads(output_file.read_text())
-        _assert_monitor_result(data, device="npu", ep=EP_ALIASES[ep])
+        # Tiny synthetic fixture: below PDH utilization-publish floor.
+        _assert_monitor_result(
+            data, device="npu", ep=EP_ALIASES[ep], require_utilization=False
+        )
 
 
 # ===========================================================================

From 4db0187696c73f1a2bd4f42557ed761344e843c2 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Tue, 2 Jun 2026 11:25:57 +0800
Subject: [PATCH 022/143] example: add microsoft/swin-large-patch4-window7-224
 (#787)

uv run
~\ModelKit\examples\microsoft-swin-large-patch4-window7-224\example.py
--onnx
~\.cache\winml\artifacts\microsoft_swin-large-patch4-window7-224\imgcls_ec485f4653d962b9_quantized.onnx
True label: house finch, linnet, Carpodacus mexicanus (synset=n01532829,
id=12)

Top 5 predictions:
  1. house finch, linnet, Carpodacus mexicanus (0.9127)
  2. brambling, Fringilla montifringilla (0.0122)
  3. goldfinch, Carduelis carduelis (0.0028)
  4. chickadee (0.0013)
  5. junco, snowbird (0.0013)

Verdict (top-1): PASS

Annotated image written to prediction.png

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
---
 .../README.md                                 | 144 ++++++++++
 .../example.py                                | 252 ++++++++++++++++++
 .../README.md                                 |   0
 .../example.py                                |   2 +-
 4 files changed, 397 insertions(+), 1 deletion(-)
 create mode 100644 examples/microsoft_swin-large-patch4-window7-224/README.md
 create mode 100644 examples/microsoft_swin-large-patch4-window7-224/example.py
 rename examples/{microsoft-table-transformer-detection => microsoft_table-transformer-detection}/README.md (100%)
 rename examples/{microsoft-table-transformer-detection => microsoft_table-transformer-detection}/example.py (99%)

diff --git a/examples/microsoft_swin-large-patch4-window7-224/README.md b/examples/microsoft_swin-large-patch4-window7-224/README.md
new file mode 100644
index 000000000..ff9836712
--- /dev/null
+++ b/examples/microsoft_swin-large-patch4-window7-224/README.md
@@ -0,0 +1,144 @@
+# microsoft/swin-large-patch4-window7-224
+
+End-to-end build + accuracy + latency walkthrough for
+`microsoft/swin-large-patch4-window7-224` (task: `image-classification`)
+on the NPU, using the `timm/mini-imagenet` `test` split as the dataset.
+
+Run all commands from the `ModelKit` repo root.
+
+---
+
+## 1. Build the model on NPU
+
+Two steps: `winml config` generates a build config JSON, then
+`winml build` consumes it. `--precision w8a16` is the default NPU
+precision; the build produces a QDQ-quantized ONNX that executes on
+the NPU.
+
+```powershell
+winml config `
+  -m microsoft/swin-large-patch4-window7-224 `
+  --task image-classification `
+  --device npu `
+  --ep openvino `
+  --precision w8a16 `
+  -o build_config.json
+```
+
+```powershell
+winml build `
+  -c build_config.json `
+  -m microsoft/swin-large-patch4-window7-224 `
+  --device npu `
+  --ep openvino `
+  --use-cache
+```
+
+Artifacts land under
+`~/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/` —
+the file to evaluate is `imgcls_*_quantized.onnx`.
+
+---
+
+## 2. Evaluate on NPU with `winml eval`
+
+The `timm/mini-imagenet` dataset is downloaded automatically from the
+HuggingFace Hub by `winml eval` — no separate dataset build step is
+needed.
+
+Pass the ONNX file to `-m` and the HuggingFace model ID to `--model-id`
+(needed for the image processor). `--output` writes a JSON file
+containing the parsed metrics:
+
+```powershell
+winml eval `
+  -m $HOME/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/imgcls_<hash>_quantized.onnx `
+  --model-id microsoft/swin-large-patch4-window7-224 `
+  --task image-classification `
+  --device npu `
+  --ep openvino `
+  --dataset timm/mini-imagenet `
+  --split test `
+  --samples 1000 `
+  --output winml_eval_output.json
+```
+
+Replace `<hash>` with the actual filename produced by step 1.
+
+The accuracy value is `metrics.accuracy` inside
+`winml_eval_output.json`.
+
+---
+
+## 3. Measure latency with `winml perf`
+
+`winml perf` benchmarks the quantized ONNX directly using random
+inputs derived from the model's I/O configuration. Point `-m` at the
+same `*_quantized.onnx` produced in step 1. `--warmup` iterations are
+excluded from the statistics; `--iterations` is the measured sample
+count.
+
+```powershell
+winml perf `
+  -m $HOME/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/imgcls_<hash>_quantized.onnx `
+  --device npu `
+  --ep openvino `
+  --warmup 10 `
+  --iterations 100 `
+  -o winml_perf_output.json
+```
+
+The output JSON contains `latency_ms` (`mean`, `min`, `max`, `p50`,
+`p90`, `p95`, `p99`, `std`) and `throughput` (`samples_per_sec`,
+`batches_per_sec`). Mean and p50 latency are the headline numbers;
+report them alongside the device and precision used.
+
+---
+
+## 4. Evaluate the original PyTorch model
+
+`run_pytorch_baseline.py` loads the HuggingFace checkpoint with native
+PyTorch on CPU and emits the same metric so the two runs are directly
+comparable. The last stdout line is a single JSON object:
+`{"metric": "accuracy", "value": <float>, "num_samples": <int>}`.
+
+Pass `--perf-iterations N` (and optionally `--perf-warmup K`, default
+`10`) to also measure PyTorch inference latency. When `N > 0`, the
+script reuses the HuggingFace pipeline on the first dataset sample,
+runs `K` untimed warmup iterations, then `N` timed iterations, and
+emits a latency JSON line on stdout immediately before the metric
+line. The metric line is still the final stdout line.
+
+```powershell
+uv run python scripts/e2e_eval/run_pytorch_baseline.py `
+  --model microsoft/swin-large-patch4-window7-224 `
+  --task image-classification `
+  --device cpu `
+  --num-samples 1000 `
+  --dataset timm/mini-imagenet `
+  --split test `
+  --winml-metric-key accuracy `
+  --perf-warmup 10 `
+  --perf-iterations 100
+```
+
+The latency JSON line has the same `mean_ms` / `min_ms` / `max_ms` /
+`p50_ms` / `p90_ms` / `p95_ms` / `p99_ms` keys as `winml perf` so the
+two runs can be compared directly.
+
+---
+
+## 5. Comparing the results
+
+For WinML, the accuracy value comes from `metrics.accuracy` in
+`winml_eval_output.json` while for the PyTorch baseline, it comes from
+the last stdout line. Latency comes from `latency_ms` in
+`winml_perf_output.json` for WinML and from the latency JSON line on
+stdout for the PyTorch baseline.
+
+Result on CPU Intel(R) Core(TM) Ultra 7 258V:
+
+| Model | Device | Precision | accuracy | mean latency (ms) | p50 latency (ms) | Size (MB) |
+|---|---|---|---|---|---|---|
+| PyTorch | CPU | fp32 | 0.837 | 662.3 | 647.9 | 750 |
+| WinML (ONNX) | OpenVINO NPU | w8a16 (QDQ) | 0.836 | 64.9 | 64.3 | 193 |
diff --git a/examples/microsoft_swin-large-patch4-window7-224/example.py b/examples/microsoft_swin-large-patch4-window7-224/example.py
new file mode 100644
index 000000000..0ade9ab60
--- /dev/null
+++ b/examples/microsoft_swin-large-patch4-window7-224/example.py
@@ -0,0 +1,252 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Run one image-classification inference with the WinML-built ONNX.
+
+Mirrors the HuggingFace Swin Transformer usage example
+(https://huggingface.co/docs/transformers/main/en/model_doc/swin) but
+loads the quantized ONNX produced by ``winml build`` (step 1 of the
+README) via :class:`WinMLAutoModel` instead of the original PyTorch
+checkpoint.
+
+The script preprocesses one image, runs inference, prints the top-5
+predicted classes (HF-docs format), and writes an annotated image with
+the top-1 label drawn in the corner so the result is visually
+verifiable.
+
+Usage::
+
+    uv run python examples/microsoft_swin-large-patch4-window7-224/example.py `
+      --onnx $HOME/.cache/winml/artifacts/microsoft_swin-large-patch4-window7-224/`
+            `imgcls_<hash>_quantized.onnx
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+import numpy as np
+import torch
+from PIL import Image, ImageDraw, ImageFont
+from transformers import AutoConfig, AutoImageProcessor
+
+from winml.modelkit import WinMLAutoModel
+
+
+HF_MODEL_ID = "microsoft/swin-large-patch4-window7-224"
+DEFAULT_DATASET = "timm/mini-imagenet"
+DEFAULT_DATASET_SPLIT = "test"
+
+
+def parse_args() -> argparse.Namespace:
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--onnx",
+        required=True,
+        type=Path,
+        help="Path to the quantized ONNX produced by step 1 of the README "
+        "(e.g. imgcls_<hash>_quantized.onnx).",
+    )
+    parser.add_argument(
+        "--device",
+        default="npu",
+        choices=["auto", "npu", "gpu", "cpu"],
+        help="Target device (default: npu).",
+    )
+    parser.add_argument(
+        "--ep",
+        default="openvino",
+        help="Execution provider alias (default: openvino).",
+    )
+    parser.add_argument(
+        "--image",
+        type=Path,
+        default=None,
+        help="Local image path. If omitted, streams the first image from "
+        f"the {DEFAULT_DATASET} {DEFAULT_DATASET_SPLIT} split.",
+    )
+    parser.add_argument(
+        "--top-k",
+        type=int,
+        default=5,
+        help="Number of top predictions to print (default: 5).",
+    )
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=Path("prediction.png"),
+        help="Where to write the annotated image (default: prediction.png).",
+    )
+    return parser.parse_args()
+
+
+def load_image(image_arg: Path | None) -> tuple[Image.Image, str | None]:
+    """Load an image and (when streamed from the eval dataset) its WordNet synset.
+
+    Returns ``(image, true_synset)``. ``true_synset`` is the WordNet ID
+    (e.g. ``"n01532829"``) for the dataset's labelled class, used as the
+    universal bridge between the dataset's class indexing and the model's.
+    ``None`` when the user supplied a custom ``--image``.
+    """
+    if image_arg is not None:
+        return Image.open(image_arg.expanduser()).convert("RGB"), None
+
+    from datasets import load_dataset
+
+    # streaming=True so we only fetch the first sample instead of downloading
+    # the whole split. The ClassLabel feature (and its .names list) is still
+    # available on the streamed dataset, so we can recover the WordNet synset
+    # for the sample's integer label. trust_remote_code=False refuses to run
+    # any dataset-bundled loading script.
+    dataset = load_dataset(
+        DEFAULT_DATASET,
+        split=DEFAULT_DATASET_SPLIT,
+        streaming=True,
+        trust_remote_code=False,
+    )
+    sample = next(iter(dataset))
+
+    image = sample["image"]
+    if not isinstance(image, Image.Image):
+        image = Image.fromarray(np.asarray(image))
+    image = image.convert("RGB")
+
+    label_value = sample.get("label")
+    label_feature = dataset.features.get("label")
+    if label_value is None or label_feature is None or not hasattr(label_feature, "names"):
+        return image, None
+    return image, label_feature.names[int(label_value)]
+
+
+def imagenet_synset_to_id() -> dict[str, int]:
+    """Map WordNet synset ID -> ImageNet-1k class id (0-999).
+
+    Uses ``timm.data.ImageNetInfo`` so we don't have to ship the 1000-entry
+    list inline. The mapping is the canonical ImageNet-1k ordering that
+    the model was trained against.
+
+    Requires the optional ``timm`` package (imported lazily here, like
+    ``datasets`` in ``load_image``); raises a clear error if it is missing.
+    """
+    try:
+        from timm.data import ImageNetInfo
+    except ImportError as e:
+        raise ImportError(
+            "imagenet_synset_to_id() requires the 'timm' package. "
+            "Install it with `pip install timm`."
+        ) from e
+
+    info = ImageNetInfo()
+    return {synset: idx for idx, synset in enumerate(info.label_names())}
+
+
+def draw_top_prediction(
+    image: Image.Image,
+    label: str,
+    score: float,
+) -> Image.Image:
+    """Draw the top-1 label + confidence on a copy of ``image``."""
+    annotated = image.copy()
+    draw = ImageDraw.Draw(annotated)
+    try:
+        font = ImageFont.truetype("arial.ttf", size=max(14, annotated.height // 30))
+    except OSError:
+        font = ImageFont.load_default()
+
+    caption = f"{label} ({score:.2f})"
+    tx0, ty0, tx1, ty1 = draw.textbbox((10, 10), caption, font=font)
+    pad = 6
+    draw.rectangle(
+        [(tx0 - pad, ty0 - pad), (tx1 + pad, ty1 + pad)],
+        fill=(0, 0, 0),
+    )
+    draw.text((10, 10), caption, fill=(255, 255, 255), font=font)
+    return annotated
+
+
+def main() -> None:
+    """Load the quantized ONNX, run one inference, print + save the result."""
+    args = parse_args()
+
+    image, true_synset = load_image(args.image)
+    image_processor = AutoImageProcessor.from_pretrained(HF_MODEL_ID)
+
+    # skip_build=True uses the ONNX as-is; it has already been optimized
+    # and quantized by `winml build`. use_cache=False avoids touching the
+    # winml artifact cache for this read-only example.
+    model = WinMLAutoModel.from_pretrained(
+        args.onnx.expanduser(),
+        task="image-classification",
+        device=args.device,
+        ep=args.ep,
+        skip_build=True,
+        use_cache=False,
+    )
+
+    # Match the processor's output size to the ONNX's static input shape so
+    # pixel_values matches (B, C, H, W) exactly.
+    input_shapes = (model.io_config.get("input_shapes") or [[]])[0]
+    # Only applies to 4D image inputs (B, C, H, W); skip for other shapes.
+    if len(input_shapes) == 4:
+        _, _, h, w = input_shapes
+        image_processor.size = {"height": h, "width": w}
+
+    inputs = image_processor(images=image, return_tensors="pt")
+    outputs = model(pixel_values=inputs["pixel_values"])
+
+    # logits: (1, num_classes). softmax → probabilities, then top-k.
+    logits = outputs.logits
+    probs = torch.softmax(logits, dim=-1)[0]
+    top_k = min(args.top_k, probs.numel())
+    top_scores, top_ids = torch.topk(probs, k=top_k)
+
+    # WinML's bare-ONNX path doesn't attach an HF config to the model, so
+    # pull id2label from the HF hub for human-readable label names.
+    id2label = AutoConfig.from_pretrained(HF_MODEL_ID).id2label
+
+    top_ids_list = top_ids.tolist()
+    top_label_names = [
+        id2label.get(label_id, str(label_id)) for label_id in top_ids_list
+    ]
+
+    # Resolve the dataset's WordNet synset to an ImageNet-1k class id so we
+    # can compare against the model's prediction. The dataset (e.g.
+    # timm/mini-imagenet) often uses its own 0..N indexing over a subset of
+    # ImageNet-1k, so the raw integer label from the dataset does NOT match
+    # the model's class id — the synset is the universal bridge.
+    true_label_id: int | None = None
+    if true_synset is not None:
+        synset_to_id = imagenet_synset_to_id()
+        true_label_id = synset_to_id.get(true_synset)
+
+    if true_synset is not None:
+        if true_label_id is not None:
+            true_label_name = id2label.get(true_label_id, str(true_label_id))
+            print(f"True label:  {true_label_name} (synset={true_synset}, id={true_label_id})")
+        else:
+            print(f"True label:  synset={true_synset} (not in ImageNet-1k vocabulary)")
+    else:
+        print("True label:  unknown (custom --image)")
+    print(f"\nTop {top_k} predictions:")
+    for rank, (label, score) in enumerate(
+        zip(top_label_names, top_scores.tolist(), strict=True), start=1,
+    ):
+        print(f"  {rank}. {label} ({score:.4f})")
+
+    if true_label_id is not None:
+        verdict = "PASS" if top_ids_list[0] == true_label_id else "FAIL"
+        print(f"\nVerdict (top-1): {verdict}")
+
+    annotated = draw_top_prediction(image, top_label_names[0], float(top_scores[0].item()))
+    output_path = args.output.expanduser()
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    annotated.save(output_path)
+    print(f"\nAnnotated image written to {output_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/examples/microsoft-table-transformer-detection/README.md b/examples/microsoft_table-transformer-detection/README.md
similarity index 100%
rename from examples/microsoft-table-transformer-detection/README.md
rename to examples/microsoft_table-transformer-detection/README.md
diff --git a/examples/microsoft-table-transformer-detection/example.py b/examples/microsoft_table-transformer-detection/example.py
similarity index 99%
rename from examples/microsoft-table-transformer-detection/example.py
rename to examples/microsoft_table-transformer-detection/example.py
index cea67b448..54d098b79 100644
--- a/examples/microsoft-table-transformer-detection/example.py
+++ b/examples/microsoft_table-transformer-detection/example.py
@@ -13,7 +13,7 @@
 
 Usage::
 
-    uv run python examples/microsoft-table-transformer-detection/example.py `
+    uv run python examples/microsoft_table-transformer-detection/example.py `
       --onnx $HOME/.cache/winml/artifacts/microsoft_table-transformer-detection/`
             `objdet_<hash>_quantized.onnx
 """

From 9c94bbc594ed791460a99f4dff5801e9d5ec0b60 Mon Sep 17 00:00:00 2001
From: vortex-captain <75063846+vortex-captain@users.noreply.github.com>
Date: Tue, 2 Jun 2026 13:08:46 +0800
Subject: [PATCH 023/143] feat(timm): enable timm image-classification models
 via library routing (#790)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

timm checkpoints load through transformers'' generic `TimmWrapper`
(`model_type="timm_wrapper"`) and previously failed in **every** `winml`
command with *"Cannot detect task: config has no ''architectures''
field"*. Two gaps:

1. **Task/class detection** — timm repos load as `TimmWrapperConfig`
with `architectures=None`, so auto-detection could not resolve a task or
class.
2. **OnnxConfig location** — Optimum registers timm''s config
(`TimmDefaultOnnxConfig`) only under `library_name="timm"`, but every
`winml` lookup defaults to `transformers`.

`timm_wrapper` is transformers'' generic bridge for the whole timm
library — not a model architecture — so it is resolved at the **shared
resolution layer**, not as a per-model config. Only the library is
recorded; the task is derived from Optimum.

## Changes (no `models/hf/` entry)

- **`loader/task.py`** — `WRAPPED_LIBRARY_MODEL_TYPES` (`model_type ->
optimum_library`) + `resolve_optimum_library()`. When a config has no
`architectures`, `_detect_task_and_class_from_config` derives the task
from Optimum''s task list for the library
(`get_supported_tasks("timm_wrapper", "timm")` ->
`["image-classification"]`) and the class from
`get_model_class_for_task` (generic `AutoModelForImageClassification`,
which transformers dispatches to `TimmWrapper` at load). The task is not
hardcoded; the branch imports `optimum.exporters.onnx.model_configs`
first to populate Optimum''s registry (scoped so normal model loading
never pays for it).
- **`export/io.py`** — `_get_onnx_config` routes the library via
`resolve_optimum_library`, so `timm_wrapper` resolves Optimum''s
`TimmDefaultOnnxConfig` from every call site
(config/build/export/inspect) with no `--library` flag.
- **`commands/inspect.py`** + **`inspect/resolver.py`** — route both the
CLI inspect path and the public `inspect_model` path the same way:
library routing for the OnnxConfig lookup, plus wrapped-library task
detection so the task is not mislabeled.
- Tests: `resolve_optimum_library` + wrapped-library architectures
fallback with task derivation (loader); timm library routing for
`resolve_io_specs` / `_get_onnx_config` (export); public inspect path
`detect_task` / `resolve_exporter` for timm (inspect).

## Validation

**Functional (end-to-end)** on a timm image-classification model:

| Command | Before | After |
|---|---|---|
| `winml config` | exit 2 — *no ''architectures'' field* |
task=image-classification, 1 input |
| `winml export` | exit 2 — same | `model.onnx` (pixel_values to logits)
|
| `winml inspect` | exit 1 — same | `AutoModelForImageClassification` +
`TimmDefaultOnnxConfig`, full I/O table |

`config` -> `export` -> `optimize` -> `model.onnx` validated end-to-end
for multiple timm CNN classifiers. Also resolves on a timm ViT backbone
(`num_labels=0`) -> task=image-classification, matching Optimum''s own
`infer_task_from_model`, so it generalizes across timm architectures
(CNN + ViT).

**No impact on existing models** — scanned all 439 entries / 401 unique
models in `scripts/e2e_eval/testsets/models_all.json`: **0** are
`timm_wrapper` (by JSON metadata and by loaded config; 330 loadable).
Since `timm_wrapper` is the only trigger of the new branch, no existing
model changes behavior. (71 fail to load a config — custom/GGUF/tabular
types that fail at `AutoConfig` regardless; 7 have empty `architectures`
but are not timm — a pre-existing "Cannot detect task", identical before
and after the PR.)

**No overhead for normal (non-timm) models** — `winml config` on a
standard non-timm model: this branch vs base, min ~12.6s vs ~12.5s
(within run-to-run noise). Non-timm configs have `architectures`, so
they skip the new branch; the only added cost is one dict lookup.

**Unit tests** — `tests/unit/loader` + `tests/unit/export` +
`tests/unit/inspect`: green.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Yi Ren <reny@microsoft.com>
---
 src/winml/modelkit/commands/inspect.py        |  8 +-
 src/winml/modelkit/export/io.py               | 10 ++-
 src/winml/modelkit/inspect/resolver.py        | 20 ++++-
 src/winml/modelkit/loader/__init__.py         |  2 +
 src/winml/modelkit/loader/task.py             | 80 ++++++++++++++++++-
 .../unit/export/test_timm_library_routing.py  | 56 +++++++++++++
 tests/unit/inspect/test_resolver_timm.py      | 48 +++++++++++
 .../unit/loader/test_detect_task_and_class.py | 58 +++++++++++++-
 8 files changed, 273 insertions(+), 9 deletions(-)
 create mode 100644 tests/unit/export/test_timm_library_routing.py
 create mode 100644 tests/unit/inspect/test_resolver_timm.py

diff --git a/src/winml/modelkit/commands/inspect.py b/src/winml/modelkit/commands/inspect.py
index 1eec26467..f0e94f0bf 100644
--- a/src/winml/modelkit/commands/inspect.py
+++ b/src/winml/modelkit/commands/inspect.py
@@ -40,9 +40,7 @@
 _LOCAL_FILE_EXTS = frozenset({".onnx", ".pt", ".pth", ".safetensors", ".bin"})
 
 
-def _validate_task(
-    ctx: click.Context, param: click.Parameter, value: str | None
-) -> str | None:
+def _validate_task(ctx: click.Context, param: click.Parameter, value: str | None) -> str | None:
     """Click-time validation for --task against the hand-coded KNOWN_TASKS set.
 
     Imports only ..loader.task to keep validation cheap — going through optimum
@@ -426,11 +424,13 @@ def _inspect_model_v2(
             import optimum.exporters.onnx.model_configs  # noqa: F401
             from optimum.exporters.tasks import TasksManager
 
+            from ..loader import resolve_optimum_library
+
             onnx_config_cls = TasksManager.get_exporter_config_constructor(
                 exporter="onnx",
                 model_type=model_type,
                 task=task,
-                library_name="transformers",
+                library_name=resolve_optimum_library(model_type),
             )
             if onnx_config_cls:
                 config_name = (
diff --git a/src/winml/modelkit/export/io.py b/src/winml/modelkit/export/io.py
index 99d9ac58f..c4f382409 100644
--- a/src/winml/modelkit/export/io.py
+++ b/src/winml/modelkit/export/io.py
@@ -210,11 +210,19 @@ def _get_onnx_config(
 
     normalized_task = _map_task_synonym(task)
 
+    # Route model_types whose Optimum OnnxConfig is registered under another
+    # library (e.g. timm via "timm_wrapper" -> "timm") so the lookup succeeds
+    # from every call site without an explicit --library flag.
+    from ..loader import resolve_optimum_library
+
+    library_name = resolve_optimum_library(model_type, library_name)
+
     logger.debug(
-        "Getting OnnxConfig: model_type=%s, task=%s -> %s",
+        "Getting OnnxConfig: model_type=%s, task=%s -> %s (library=%s)",
         model_type,
         task,
         normalized_task,
+        library_name,
     )
 
     try:
diff --git a/src/winml/modelkit/inspect/resolver.py b/src/winml/modelkit/inspect/resolver.py
index 51a6a9ccd..85f64cd19 100644
--- a/src/winml/modelkit/inspect/resolver.py
+++ b/src/winml/modelkit/inspect/resolver.py
@@ -15,8 +15,11 @@
 from ..loader.task import (
     HF_TASK_DEFAULTS,
     KNOWN_TASKS,
+    WRAPPED_LIBRARY_MODEL_TYPES,
+    _detect_task_and_class_from_config,
     _detect_task_from_config,
     _get_custom_model_class,
+    resolve_optimum_library,
 )
 from ..models import (
     HF_MODEL_CLASS_MAPPING,
@@ -46,6 +49,11 @@
 
 logger = logging.getLogger(__name__)
 
+# Task-detection provenance label returned by detect_task() for wrapped-library
+# model types (e.g. timm via "timm_wrapper"). Surfaced in `inspect` output as
+# "Task <task> (via <source>)" and in the JSON `task_source` field.
+WRAPPED_LIBRARY_SOURCE = "wrapped-library"
+
 # Mapping from pipeline stage verbs to the filenames build_hf_model() produces.
 # "export" is omitted because its stage name equals its filename — the
 # .get(stage, stage) fallback handles it.  Used only in the legacy
@@ -121,6 +129,16 @@ def detect_task(config: PretrainedConfig) -> tuple[str, str]:
         if mt == model_type_normalized:
             return task, "HF_MODEL_CLASS_MAPPING"
 
+    # Wrapped-library model types (e.g. timm via "timm_wrapper") carry no
+    # `architectures`; reuse the loader's resolution to derive the real task
+    # instead of falling through to the HF_TASK_DEFAULTS mislabel below.
+    if model_type in WRAPPED_LIBRARY_MODEL_TYPES and not getattr(config, "architectures", None):
+        try:
+            task, _ = _detect_task_and_class_from_config(config)
+            return task, WRAPPED_LIBRARY_SOURCE
+        except Exception:
+            logger.debug("wrapped-library task detection failed for %s", model_type, exc_info=True)
+
     # Use TasksManager detection
     try:
         task = _detect_task_from_config(config)
@@ -343,7 +361,7 @@ def resolve_exporter(
             exporter="onnx",
             model_type=model_type,
             task=task,
-            library_name="transformers",
+            library_name=resolve_optimum_library(model_type),
         )
         if onnx_config_cls:
             # Handle functools.partial returned by TasksManager
diff --git a/src/winml/modelkit/loader/__init__.py b/src/winml/modelkit/loader/__init__.py
index 1a6b9b2b7..89a8a0f55 100644
--- a/src/winml/modelkit/loader/__init__.py
+++ b/src/winml/modelkit/loader/__init__.py
@@ -32,6 +32,7 @@
     get_supported_tasks,
     get_task_abbrev,
     normalize_task,
+    resolve_optimum_library,
     resolve_task_and_model_class,
 )
 
@@ -46,6 +47,7 @@
     "normalize_task",
     "resolve_hf_model_class",
     "resolve_loader_config",
+    "resolve_optimum_library",
     "resolve_task_and_model_class",
 ]
 
diff --git a/src/winml/modelkit/loader/task.py b/src/winml/modelkit/loader/task.py
index 9aea05fce..08eb141aa 100644
--- a/src/winml/modelkit/loader/task.py
+++ b/src/winml/modelkit/loader/task.py
@@ -8,6 +8,7 @@
 
 Public API:
     resolve_task_and_model_class  - Main orchestrator (3 resolution cases)
+    resolve_optimum_library      - Route a model_type to the Optimum export library
     normalize_task               - Map task aliases to canonical names
     get_task_abbrev              - Abbreviated task name for cache keys
     get_supported_tasks          - List ONNX-exportable tasks for a model type
@@ -154,6 +155,40 @@
     ("prajjwal1/bert-tiny", None): "feature-extraction",
 }
 
+# Some transformers model_types are generic wrappers that expose an entire other
+# library through a single type (e.g. timm via "timm_wrapper"). Such configs
+# carry no `architectures` field, and their Optimum ONNX export config is
+# registered under the wrapped library, not "transformers". This is a
+# library-routing concern handled at the common resolution layer (the loader
+# below and export.io._get_onnx_config), not a per-model OnnxConfig.
+#
+# Only the library is recorded here -- it is the irreducible Optimum-taxonomy
+# fact. The export task is derived from Optimum's task list for that library
+# (get_supported_tasks), not hardcoded.
+# model_type -> optimum_library
+WRAPPED_LIBRARY_MODEL_TYPES: dict[str, str] = {
+    "timm_wrapper": "timm",
+}
+
+
+def resolve_optimum_library(model_type: str | None, library_name: str = "transformers") -> str:
+    """Route a transformers model_type to the Optimum library that owns its export.
+
+    Most models export under the library they were requested with. A few
+    transformers model_types are thin wrappers whose Optimum OnnxConfig lives in
+    another library (see :data:`WRAPPED_LIBRARY_MODEL_TYPES`); route those so the
+    OnnxConfig lookup succeeds without an explicit ``--library`` flag.
+
+    Only the ``"transformers"`` library is rerouted, so an explicit
+    non-``"transformers"`` library is returned unchanged. (An explicit
+    ``--library transformers`` is indistinguishable from the default and is
+    still rerouted for wrapped types -- harmless, since those types have no
+    OnnxConfig registered under transformers anyway.)
+    """
+    if library_name == "transformers" and model_type in WRAPPED_LIBRARY_MODEL_TYPES:
+        return WRAPPED_LIBRARY_MODEL_TYPES[model_type]
+    return library_name
+
 
 # =============================================================================
 # Internal Helpers
@@ -314,8 +349,49 @@ def _detect_task_and_class_from_config(config: PretrainedConfig) -> tuple[str, t
         return resolve_task_and_model_class(config, task=override_task)
 
     # [1] Resolve architecture class from config.
-    # If config.architectures is missing/empty, this raises ValueError and the
-    # caller should provide task explicitly.
+    # Some model_types (e.g. timm via "timm_wrapper") are generic library
+    # wrappers that carry no `architectures` field. Resolve those through their
+    # wrapped library: the task comes from Optimum's task list for that library
+    # (not hardcoded), and the class from get_model_class_for_task (a generic
+    # Auto* class that transformers dispatches to the wrapper at load).
+    if not getattr(config, "architectures", None):
+        model_type = getattr(config, "model_type", None)
+        library = WRAPPED_LIBRARY_MODEL_TYPES.get(model_type) if model_type else None
+        if library is not None:
+            # Populate Optimum's exporter registry (incl. the wrapped library's
+            # task list) before querying it; scoped to this rare branch so normal
+            # model loading never pays for the import.
+            import optimum.exporters.onnx.model_configs  # noqa: F401
+
+            supported = get_supported_tasks(model_type, library_name=library)
+            if supported:
+                # A wrapped library exposes a single ONNX export task today
+                # (timm -> "image-classification"), so supported[0] is the right
+                # default. If one ever exposes multiple, supported[0] is an
+                # arbitrary pick -- warn (listing the tasks) but still proceed;
+                # pass --task to choose a different one.
+                task = supported[0]
+                if len(supported) > 1:
+                    logger.warning(
+                        "config has no 'architectures' and the %s library exposes "
+                        "multiple export tasks for %s %s; defaulting to %r "
+                        "(pass --task to choose another).",
+                        library,
+                        model_type,
+                        supported,
+                        task,
+                    )
+                model_class = TasksManager.get_model_class_for_task(task, framework="pt")
+                logger.info(
+                    "config has no 'architectures'; resolved %s via %s library (task=%s, class=%s)",
+                    model_type,
+                    library,
+                    task,
+                    model_class.__name__,
+                )
+                return task, model_class
+    # If config.architectures is still missing/empty, this raises ValueError and
+    # the caller should provide task explicitly.
     arch_model_class = _resolve_model_class_from_config(config)
     arch_name = arch_model_class.__name__
 
diff --git a/tests/unit/export/test_timm_library_routing.py b/tests/unit/export/test_timm_library_routing.py
new file mode 100644
index 000000000..993ec85d0
--- /dev/null
+++ b/tests/unit/export/test_timm_library_routing.py
@@ -0,0 +1,56 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Tests for timm library routing during OnnxConfig resolution.
+
+timm checkpoints load through transformers' TimmWrapper (model_type=
+"timm_wrapper"), but Optimum registers their OnnxConfig (TimmDefaultOnnxConfig)
+only under library_name="timm". ``resolve_optimum_library`` reroutes the lookup
+so ``resolve_io_specs`` / ``_get_onnx_config`` resolve it under the default
+"transformers" library, with no --library flag. See loader/task.py and
+export/io.py.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+# Trigger OnnxConfig registration with TasksManager
+import winml.modelkit.models  # noqa: F401
+from winml.modelkit.export import resolve_io_specs
+from winml.modelkit.export.io import _get_onnx_config  # internal: routing under test
+
+
+@pytest.fixture(scope="module")
+def timm_wrapper_config():
+    """Minimal offline TimmWrapperConfig (no hub download)."""
+    from transformers import TimmWrapperConfig
+
+    return TimmWrapperConfig(num_labels=10)
+
+
+class TestTimmLibraryRouting:
+    """timm_wrapper resolves to Optimum's TimmDefaultOnnxConfig via library routing."""
+
+    def test_get_onnx_config_routes_to_timm_default(self, timm_wrapper_config) -> None:
+        """A default (transformers) lookup reroutes to Optimum's TimmDefaultOnnxConfig."""
+        from optimum.exporters.onnx.model_configs import TimmDefaultOnnxConfig
+
+        onnx_config = _get_onnx_config("timm_wrapper", "image-classification", timm_wrapper_config)
+        assert isinstance(onnx_config, TimmDefaultOnnxConfig), (
+            "timm_wrapper did not route to Optimum's timm OnnxConfig; "
+            "resolve_optimum_library routing may be inactive."
+        )
+
+    def test_io_specs_pixel_values_to_logits(self, timm_wrapper_config) -> None:
+        """resolve_io_specs yields the timm image-classifier I/O without a --library flag."""
+        specs = resolve_io_specs("timm_wrapper", "image-classification", timm_wrapper_config)
+        assert specs["input_names"] == ["pixel_values"]
+        assert "logits" in specs["output_names"]
+
+    def test_pixel_values_is_4d_nchw(self, timm_wrapper_config) -> None:
+        specs = resolve_io_specs("timm_wrapper", "image-classification", timm_wrapper_config)
+        shape = specs["input_shapes"][0]
+        assert len(shape) == 4, f"pixel_values should be 4D NCHW, got {shape}"
+        assert shape[1] == 3, f"expected 3 channels, got {shape[1]}"
diff --git a/tests/unit/inspect/test_resolver_timm.py b/tests/unit/inspect/test_resolver_timm.py
new file mode 100644
index 000000000..bb9f006b2
--- /dev/null
+++ b/tests/unit/inspect/test_resolver_timm.py
@@ -0,0 +1,48 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""timm (wrapped-library) resolution in the public `inspect` path.
+
+`inspect_model` resolves task/exporter via `resolver.detect_task` +
+`resolver.resolve_exporter` — a separate path from the CLI's `_inspect_model_v2`.
+timm checkpoints load as `TimmWrapperConfig` (model_type="timm_wrapper",
+architectures=None). Without wrapped-library handling, `detect_task` mislabels
+the task (HF_TASK_DEFAULTS fallback) and `resolve_exporter` hardcodes
+library_name="transformers" so the OnnxConfig lookup fails (UNSUPPORTED).
+
+These cover the fix that routes both through the timm library, matching the CLI.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from winml.modelkit.inspect import SupportLevel, detect_task, resolve_exporter
+
+
+@pytest.fixture(scope="module")
+def timm_wrapper_config():
+    """Minimal offline TimmWrapperConfig (no hub download)."""
+    from transformers import TimmWrapperConfig
+
+    return TimmWrapperConfig(num_labels=10)
+
+
+class TestDetectTaskTimm:
+    def test_timm_detects_image_classification(self, timm_wrapper_config) -> None:
+        """timm_wrapper (no architectures) resolves to image-classification, not a fallback."""
+        task, source = detect_task(timm_wrapper_config)
+        assert task == "image-classification", f"got task={task!r} source={source!r}"
+
+
+class TestResolveExporterTimm:
+    def test_timm_resolves_optimum_onnx_config(self, timm_wrapper_config) -> None:
+        """resolve_exporter routes timm_wrapper to Optimum's timm OnnxConfig + real I/O."""
+        info = resolve_exporter(
+            "timm_wrapper", "image-classification", hf_config=timm_wrapper_config
+        )
+        assert info.onnx_config_class == "TimmDefaultOnnxConfig", info.onnx_config_class
+        assert info.support_level is not SupportLevel.UNSUPPORTED
+        names = [t.name for t in info.input_tensors]
+        assert "pixel_values" in names, names
diff --git a/tests/unit/loader/test_detect_task_and_class.py b/tests/unit/loader/test_detect_task_and_class.py
index b2c1381ba..ec181a9ca 100644
--- a/tests/unit/loader/test_detect_task_and_class.py
+++ b/tests/unit/loader/test_detect_task_and_class.py
@@ -16,7 +16,11 @@
 
 import pytest
 
-from winml.modelkit.loader.task import _detect_task_and_class_from_config
+from winml.modelkit.loader.task import (
+    WRAPPED_LIBRARY_MODEL_TYPES,
+    _detect_task_and_class_from_config,
+    resolve_optimum_library,
+)
 
 
 class TestDetectTaskAndClassFromConfig:
@@ -148,3 +152,55 @@ def test_no_override_for_unrelated_model(self):
         assert task == "image-classification"
         # TasksManager returns AutoModelForImageClassification, not the arch class
         assert resolved_class is not ResNetForImageClassification or task == "image-classification"
+
+
+class TestResolveOptimumLibrary:
+    """Unit tests for the resolve_optimum_library wrapped-library router."""
+
+    def test_timm_wrapper_routes_to_timm(self):
+        """timm_wrapper under the default library routes to Optimum's 'timm'."""
+        assert resolve_optimum_library("timm_wrapper", "transformers") == "timm"
+
+    def test_unmapped_model_type_unchanged(self):
+        """A normal transformers model_type is not rerouted."""
+        assert resolve_optimum_library("bert", "transformers") == "transformers"
+
+    def test_none_model_type_unchanged(self):
+        assert resolve_optimum_library(None, "transformers") == "transformers"
+
+    def test_explicit_library_is_respected(self):
+        """An explicit (non-default) library always wins over the wrapper routing."""
+        assert resolve_optimum_library("timm_wrapper", "timm") == "timm"
+        assert resolve_optimum_library("timm_wrapper", "diffusers") == "diffusers"
+
+
+class TestWrappedLibraryArchitecturesFallback:
+    """Auto-detection for wrapper model_types that carry no `architectures`.
+
+    timm checkpoints load through transformers' TimmWrapper as TimmWrapperConfig
+    (architectures=None); the loader resolves them via WRAPPED_LIBRARY_MODEL_TYPES
+    instead of raising.
+    """
+
+    def test_timm_wrapper_resolves_without_architectures(self):
+        config = MagicMock()
+        config.architectures = None
+        config.model_type = "timm_wrapper"
+        config._name_or_path = ""
+
+        task, resolved_class = _detect_task_and_class_from_config(config)
+
+        # Task is derived from Optimum's task list for the timm library, not hardcoded.
+        assert WRAPPED_LIBRARY_MODEL_TYPES["timm_wrapper"] == "timm"
+        assert task == "image-classification"
+        # A generic Auto* class is used; it dispatches to TimmWrapper at load time.
+        assert resolved_class.__name__ == "AutoModelForImageClassification"
+
+    def test_missing_architectures_without_wrapper_still_raises(self):
+        config = MagicMock()
+        config.architectures = None
+        config.model_type = "totally-unknown-model-xyz"
+        config._name_or_path = ""
+
+        with pytest.raises(ValueError, match="no 'architectures' field"):
+            _detect_task_and_class_from_config(config)

From 679a11b77346a88f118f7ebdb5e72e9dd8040da3 Mon Sep 17 00:00:00 2001
From: Zhenchao Ni <zhenni@microsoft.com>
Date: Tue, 2 Jun 2026 13:50:56 +0800
Subject: [PATCH 024/143] Fix model-task inconsistency for vision
 feature-extraction models (#786)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Fix model-task inconsistency for vision feature-extraction models

Fixes #777, #778, #782.

### Principle

`winml inspect` is the source of truth for valid `(model_id, task)`
pairs. Both `feature-extraction` and `image-feature-extraction` are
valid ways to address an image-embedding model like
`facebook/dinov2-base`. Downstream commands must accept whichever name
`winml inspect` accepts, then use `(model_id, task)` to locate the
concrete class to act on.

### Root cause

Optimum's `TasksManager.get_exporter_config_constructor` only knows
canonical Optimum task names. Several call sites passed the raw
user-supplied task straight through, so HF aliases like
`image-feature-extraction` were rejected with "Unsupported". The
evaluator additionally needs to know which HF pipeline name to dispatch
on, which the canonical Optimum task name doesn't carry by itself for
bimodal tasks like `feature-extraction`.

### Fix

- **Inspect / export / HTP exporter**: normalize via
`_map_task_synonym(task)` (in `export/io.py`) before any `TasksManager`
lookup because it requires normalized task input. This is a single
function reused at each `TasksManager` boundary — no new global table.
- **Quantize**: `_resolve_dataset_class(task, io_config)` in
`datasets/__init__.py` dispatches to `TextDataset` / `ImageDataset`
based on the actual ONNX input names. No `AutoConfig.from_pretrained`
round-trip. Bimodal io_configs fall back to `RandomDataset` with a
warning.
- **Evaluate**: Because HF pipeline and evaluate library have their task
name convention, `to_hf_pipeline_task(task, model_id)` in
`eval/evaluate.py` translates to the HF pipeline name the underlying
`evaluate` library expects. Uses `OnnxConfig.inputs` (no weights loaded)
to pick the modality. Bimodal models (e.g. CLIP combined: both
`pixel_values` and `input_ids`) keep the task unchanged via a `len(hits)
== 1` guard, preserving the explicit user task.

### Validation

`facebook/dinov2-base`:

| Command | Before | After |
|---|---|---|
| `winml inspect -m facebook/dinov2-base --task
image-feature-extraction` | "Unsupported" | Resolves via
`Dinov2OnnxConfig` |
| `winml export -m facebook/dinov2-base -t image-feature-extraction` |
KeyError on TasksManager | Valid ONNX with `last_hidden_state` |
| `winml eval -m facebook/dinov2-base --task feature-extraction` |
`RuntimeError: Failed to create feature-extraction dataset` | kNN
metrics on mini-imagenet |
| `winml quantize <onnx> --task feature-extraction -m
facebook/dinov2-small` | Failure by using TextDataset | Routes to
`ImageDataset` |

`openai/clip-vit-base-patch32` (bimodal, regression check):

- `winml eval -m openai/clip-vit-base-patch32 --task feature-extraction`
→ stays `feature-extraction` (text STS evaluator); not silently rerouted
to image.
- `winml eval -m openai/clip-vit-base-patch32` (auto-detect) → resolves
to `feature-extraction` (text).

### Tests

Unit:
- `tests/unit/eval/test_eval.py::TestResolveTask` — auto-detect,
explicit task, bimodal guard, HF pipeline translation.
- test_random_dataset.py — `TASK_DATASET_MAPPING` covers all registered
tasks, including bimodal dict-of-dict.

E2E (`-m e2e`, dinov2 chosen because it isn't in `MODEL_BUILD_CONFIGS`
and so actually exercises the `TasksManager` path):
- `tests/e2e/test_inspect_e2e.py::TestInspectDinoV2` — both
`image-feature-extraction` and `feature-extraction` resolve.
-
`tests/e2e/test_export_e2e.py::TestExportDinoV2::test_image_feature_extraction`.
-
`tests/e2e/test_eval_e2e.py::TestEvalPerTask::test_image_feature_extraction`
parameterized over both task names.
-
`tests/e2e/test_quantize_e2e.py::test_feature_extraction_with_pixel_values_uses_image_dataset`.
---
 src/winml/modelkit/commands/build.py          |  2 +-
 src/winml/modelkit/commands/inspect.py        |  4 +-
 src/winml/modelkit/datasets/__init__.py       | 21 ++++-
 src/winml/modelkit/eval/evaluate.py           | 46 +++++++++-
 src/winml/modelkit/export/htp/exporter.py     |  5 +-
 src/winml/modelkit/export/io.py               |  4 +-
 src/winml/modelkit/inspect/resolver.py        |  5 +-
 tests/e2e/test_eval_e2e.py                    | 14 ++-
 tests/e2e/test_export_e2e.py                  | 26 ++++++
 tests/e2e/test_inspect_e2e.py                 | 28 ++++++
 tests/e2e/test_quantize_e2e.py                | 31 +++++++
 tests/unit/datasets/test_random_dataset.py    | 13 ++-
 tests/unit/eval/test_eval.py                  | 26 ++++++
 ...test_htp_exporter_patcher_task_synonyms.py | 85 +++++++++++++++++++
 tests/unit/export/test_io_specs.py            |  4 +-
 .../test_resolve_exporter_task_synonyms.py    | 52 ++++++++++++
 16 files changed, 347 insertions(+), 19 deletions(-)
 create mode 100644 tests/unit/export/test_htp_exporter_patcher_task_synonyms.py
 create mode 100644 tests/unit/inspect/test_resolve_exporter_task_synonyms.py

diff --git a/src/winml/modelkit/commands/build.py b/src/winml/modelkit/commands/build.py
index e55f15bfc..486d7e678 100644
--- a/src/winml/modelkit/commands/build.py
+++ b/src/winml/modelkit/commands/build.py
@@ -312,7 +312,7 @@ def _validate_task_supported_for_model(
     # [2] HF-pipeline-only task names that Optimum's TasksManager does not
     #     know but the rest of the CLI accepts (e.g. ``next-sentence-prediction``
     #     handled via HF_TASK_DEFAULTS, ``mask-generation`` preserved for SAM2).
-    #     These are routed downstream by export/io.py::_map_task_synonym, so
+    #     These are routed downstream by export/io.py::map_task_synonym, so
     #     rejecting here would break invocations that ``winml config`` and
     #     ``winml export`` accept.
     if task in TASK_SYNONYM_EXTENSIONS:
diff --git a/src/winml/modelkit/commands/inspect.py b/src/winml/modelkit/commands/inspect.py
index f0e94f0bf..1d3d735e9 100644
--- a/src/winml/modelkit/commands/inspect.py
+++ b/src/winml/modelkit/commands/inspect.py
@@ -424,12 +424,14 @@ def _inspect_model_v2(
             import optimum.exporters.onnx.model_configs  # noqa: F401
             from optimum.exporters.tasks import TasksManager
 
+            # TasksManager expects normalized task names
+            from ..export.io import map_task_synonym
             from ..loader import resolve_optimum_library
 
             onnx_config_cls = TasksManager.get_exporter_config_constructor(
                 exporter="onnx",
                 model_type=model_type,
-                task=task,
+                task=map_task_synonym(task),
                 library_name=resolve_optimum_library(model_type),
             )
             if onnx_config_cls:
diff --git a/src/winml/modelkit/datasets/__init__.py b/src/winml/modelkit/datasets/__init__.py
index 79a69dd95..e7caa4ab6 100644
--- a/src/winml/modelkit/datasets/__init__.py
+++ b/src/winml/modelkit/datasets/__init__.py
@@ -40,7 +40,7 @@
     "object-detection": ObjectDetectionDataset,
     "text-classification": TextDataset,
     "text-feature-extraction": TextDataset,
-    "feature-extraction": TextDataset,
+    "feature-extraction": {"input_ids": TextDataset, "pixel_values": ImageDataset},
     "sentence-similarity": TextDataset,
     "next-sentence-prediction": TextDataset,
     "fill-mask": TextDataset,
@@ -51,6 +51,23 @@
 }
 
 
+def _resolve_dataset_class(task: str, io_config: dict | None) -> tuple[type, str]:
+    """Resolve the dataset class for ``task``."""
+    dataset_class = TASK_DATASET_MAPPING[task]
+    if not isinstance(dataset_class, dict):
+        return dataset_class, task
+
+    hits = [name for name in (io_config or {}) if name in dataset_class]
+    if len(hits) == 1:
+        return dataset_class[hits[0]], task
+
+    logger.warning(
+        "Task '%s' is not supported for the model, falling back to RandomDataset",
+        task,
+    )
+    return RandomDataset, "random"
+
+
 def universal_calib_dataset(
     model_name: str,
     task: str,
@@ -98,7 +115,7 @@ def universal_calib_dataset(
 
     # Create dataset with error handling
     try:
-        dataset_class = TASK_DATASET_MAPPING[task]
+        dataset_class, task = _resolve_dataset_class(task, kwargs.get("io_config"))
 
         # Craft kwargs - only add optional parameters if provided
         dataset_kwargs = {
diff --git a/src/winml/modelkit/eval/evaluate.py b/src/winml/modelkit/eval/evaluate.py
index 3f9053b0c..0d6a728c7 100644
--- a/src/winml/modelkit/eval/evaluate.py
+++ b/src/winml/modelkit/eval/evaluate.py
@@ -207,24 +207,62 @@ def _load_model(config: WinMLEvaluationConfig) -> WinMLPreTrainedModel:
     )
 
 
+# Evaluator uses HF pipeline and evaluate library,
+# which have their own task naming conventions.
+# Inner dict maps an ONNX input name (from the model's IO config) to the
+# corresponding HF task name, so we can resolve ambiguous tasks by modality.
+HF_TASK_NAME_MAPPING: dict[str, dict[str, str]] = {
+    "feature-extraction": {
+        "input_ids": "feature-extraction",
+        "pixel_values": "image-feature-extraction",
+    },
+}
+
+
+def to_hf_pipeline_task(task: str, model_id: str | None) -> str:
+    """Convert task name to an HF-pipeline-recognizable name."""
+    mapping = HF_TASK_NAME_MAPPING.get(task)
+    if mapping is None or model_id is None:
+        return task
+
+    try:
+        from transformers import AutoConfig
+
+        from ..export.io import _get_onnx_config
+
+        hf_config = AutoConfig.from_pretrained(model_id)
+        io_config = _get_onnx_config(hf_config.model_type, task, hf_config).inputs
+    except Exception as e:
+        logger.debug("Static OnnxConfig probe failed for task %r: %s", task, e)
+        return task
+
+    hits = [mapping[n] for n in mapping if n in io_config]
+    if len(hits) != 1:
+        return task
+    return hits[0]
+
+
 def _resolve_task(config: WinMLEvaluationConfig) -> str:
     """Resolve task from config or model's HF config, and validate it is supported."""
+    console = Console()
+    console.print("[bold]Resolving task...[/bold]")
+
     if config.task is not None:
         task = config.task
     else:
         if config.model_id is None:
             raise ValueError("Cannot infer task without model_id. Provide --task.")
 
-        console = Console()
-        console.print("[bold]Detecting model task...[/bold]")
-
         from transformers import AutoConfig
 
         from ..loader.task import _detect_task_from_config
 
         hf_config = AutoConfig.from_pretrained(config.model_id)
         task = _detect_task_from_config(hf_config)
-        console.print(f"[dim]Detected task:[/dim] {task}")
+
+    # Convert to an HF-pipeline-recognizable task before evaluator lookup.
+    task = to_hf_pipeline_task(task, config.model_id)
+    console.print(f"[dim]Use[/dim] {task} [dim]to evaluate[/dim]")
 
     if task not in _EVALUATOR_REGISTRY:
         supported = ", ".join(sorted(_EVALUATOR_REGISTRY))
diff --git a/src/winml/modelkit/export/htp/exporter.py b/src/winml/modelkit/export/htp/exporter.py
index aa7c1fdeb..ce8109572 100644
--- a/src/winml/modelkit/export/htp/exporter.py
+++ b/src/winml/modelkit/export/htp/exporter.py
@@ -470,11 +470,14 @@ def _get_optimum_patcher(model: nn.Module, task: str | None) -> Any:
             logger.debug("Model has no config.model_type; skipping Optimum patcher.")
             return contextlib.nullcontext()
 
+        # TasksManager expects normalized task names
+        from ..io import map_task_synonym
+
         try:
             cfg_cls = TasksManager.get_exporter_config_constructor(
                 "onnx",
                 model_type=model_type,
-                task=task,
+                task=map_task_synonym(task),
                 library_name="transformers",
             )
             return cfg_cls(model_config).patch_model_for_export(model)
diff --git a/src/winml/modelkit/export/io.py b/src/winml/modelkit/export/io.py
index c4f382409..dcb374452 100644
--- a/src/winml/modelkit/export/io.py
+++ b/src/winml/modelkit/export/io.py
@@ -95,7 +95,7 @@ def ensure_hf_models_registered() -> None:
 }
 
 
-def _map_task_synonym(task: str) -> str:
+def map_task_synonym(task: str) -> str:
     """Map task name to canonical form, extending Optimum's synonym mapping.
 
     Our extensions take priority over Optimum's built-in synonym map.
@@ -208,7 +208,7 @@ def _get_onnx_config(
     """
     ensure_hf_models_registered()
 
-    normalized_task = _map_task_synonym(task)
+    normalized_task = map_task_synonym(task)
 
     # Route model_types whose Optimum OnnxConfig is registered under another
     # library (e.g. timm via "timm_wrapper" -> "timm") so the lookup succeeds
diff --git a/src/winml/modelkit/inspect/resolver.py b/src/winml/modelkit/inspect/resolver.py
index 85f64cd19..e42afdc5f 100644
--- a/src/winml/modelkit/inspect/resolver.py
+++ b/src/winml/modelkit/inspect/resolver.py
@@ -355,12 +355,15 @@ def resolve_exporter(
         import optimum.exporters.onnx.model_configs  # noqa: F401
         from optimum.exporters.tasks import TasksManager
 
+        # TasksManager expects normalized task names
+        from ..export.io import map_task_synonym
+
         # TasksManager uses underscores (sam2_video), not hyphens (sam2-video)
         # Use original model_type for TasksManager lookup
         onnx_config_cls = TasksManager.get_exporter_config_constructor(
             exporter="onnx",
             model_type=model_type,
-            task=task,
+            task=map_task_synonym(task),
             library_name=resolve_optimum_library(model_type),
         )
         if onnx_config_cls:
diff --git a/tests/e2e/test_eval_e2e.py b/tests/e2e/test_eval_e2e.py
index 2e0926aa6..3c66612c7 100644
--- a/tests/e2e/test_eval_e2e.py
+++ b/tests/e2e/test_eval_e2e.py
@@ -292,15 +292,25 @@ def test_sentence_similarity(self, runner: CliRunner, tmp_path: Path) -> None:
         if is_host("qnn"):
             _assert_in_range(data["metrics"], "cosine_spearman", 40.0, 100.0)
 
+    @pytest.mark.parametrize(
+        "task",
+        ["image-feature-extraction", "feature-extraction"],
+    )
     def test_image_feature_extraction(
-        self, runner: CliRunner, tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path, task: str,
     ) -> None:
         # kNN accuracies reported as percentages 0..100.
         # --streaming avoids caching mini-imagenet.
+        # Parameterized over both task names accepted on the CLI:
+        #   - "image-feature-extraction" is the HF pipeline task name
+        #     and dispatches to the image evaluator directly.
+        #   - "feature-extraction" is bimodal; for a vision model it is
+        #     mapped internally to the HF pipeline name so the image
+        #     dataset and evaluator are selected.
         out = tmp_path / "result.json"
         _invoke(runner, [
             "-m", "facebook/dinov2-small",
-            "--task", "image-feature-extraction",
+            "--task", task,
             "--streaming",
             "--samples", SAMPLES,
             "-o", str(out),
diff --git a/tests/e2e/test_export_e2e.py b/tests/e2e/test_export_e2e.py
index 2a713fb55..42cec6b40 100644
--- a/tests/e2e/test_export_e2e.py
+++ b/tests/e2e/test_export_e2e.py
@@ -219,6 +219,32 @@ def test_minimal_resnet50(self, tmp_path: Path):
         _assert_all_nodes_have(model, "winml.hierarchy.depth")
 
 
+class TestExportDinoV2:
+
+    MODEL = "facebook/dinov2-base"
+
+    def test_image_feature_extraction(self, tmp_path: Path):
+        """``-t image-feature-extraction`` must produce a valid ONNX export."""
+        onnx_path = tmp_path / "model.onnx"
+        result = _invoke(["-m", self.MODEL, "-o", str(onnx_path),
+                          "-t", "image-feature-extraction"])
+        assert result.exit_code == 0, (
+            f"export failed (exit {result.exit_code}):\n{result.output}"
+        )
+        assert onnx_path.exists(), f"ONNX model not found at {onnx_path}"
+
+        model = onnx.load(str(onnx_path))
+        # Optimum-driven OnnxConfig for dinov2/feature-extraction produces
+        # last_hidden_state. If the patcher had fallen back to nullcontext,
+        # the trace-inferred output names (last_hidden_state, pooler_output)
+        # would have been used instead.
+        assert _output_names(model) == ["last_hidden_state"], (
+            f"expected outputs ['last_hidden_state'], got {_output_names(model)} "
+            "— Optimum patcher likely fell back to nullcontext because the "
+            "task wasn't normalised before TasksManager lookup."
+        )
+
+
 # ===========================================================================
 # Required-option failures
 # ===========================================================================
diff --git a/tests/e2e/test_inspect_e2e.py b/tests/e2e/test_inspect_e2e.py
index 1531440a3..b91570cb4 100644
--- a/tests/e2e/test_inspect_e2e.py
+++ b/tests/e2e/test_inspect_e2e.py
@@ -286,3 +286,31 @@ def test_auto_detect_object_detection(self):
         assert data["model_id"] == self.MODEL
         assert data["model_type"] == "detr"
         assert data["task"] == "object-detection"
+
+
+@pytest.mark.network
+class TestInspectDinoV2:
+
+    MODEL = "facebook/dinov2-base"
+
+    def test_image_feature_extraction_override(self):
+        """HF synonym 'image-feature-extraction' must resolve via TasksManager."""
+        data = _run_network(self.MODEL, task="image-feature-extraction")
+        _assert_common_structure(data, self.MODEL, "image-feature-extraction")
+        assert data["model_type"] == "dinov2"
+        exporter = data["exporter"]
+        assert exporter["onnx_config_class"] == "Dinov2OnnxConfig", (
+            f"expected Dinov2OnnxConfig, got {exporter['onnx_config_class']!r} "
+            "— task likely wasn't normalised before TasksManager lookup."
+        )
+        assert exporter["onnx_config_source"] == "TasksManager"
+        assert exporter["support_level"] != "unsupported"
+
+    def test_feature_extraction_override(self):
+        """'feature-extraction' (the Optimum task) must also resolve (control)."""
+        data = _run_network(self.MODEL, task="feature-extraction")
+        _assert_common_structure(data, self.MODEL, "feature-extraction")
+        assert data["model_type"] == "dinov2"
+        exporter = data["exporter"]
+        assert exporter["onnx_config_class"] == "Dinov2OnnxConfig"
+        assert exporter["onnx_config_source"] == "TasksManager"
diff --git a/tests/e2e/test_quantize_e2e.py b/tests/e2e/test_quantize_e2e.py
index 745ca043b..842d4abad 100644
--- a/tests/e2e/test_quantize_e2e.py
+++ b/tests/e2e/test_quantize_e2e.py
@@ -165,6 +165,13 @@ def onnx_imgseg() -> Path:
     )
 
 
+@pytest.fixture(scope="session")
+def onnx_dinov2() -> Path:
+    return _export_hf_to_onnx(
+        "facebook/dinov2-small", "image-feature-extraction", "dinov2_small",
+    )
+
+
 # ---------------------------------------------------------------------------
 # Standard assertions
 # ---------------------------------------------------------------------------
@@ -564,6 +571,30 @@ def test_unsupported_task_falls_back_to_random_dataset(
             f"fallback warning not emitted in CLI output:\n{r.output}"
         )
 
+    @pytest.mark.network
+    def test_feature_extraction_with_pixel_values_uses_image_dataset(
+        self, runner: CliRunner, onnx_dinov2: Path, tmp_path: Path
+    ):
+        # For a vision model the bimodal task 'feature-extraction' must
+        # dispatch via the model's ONNX inputs (pixel_values) to
+        # ImageDataset for calibration. The task label in the log stays
+        # 'feature-extraction' (the resolver only swaps the dataset class).
+        out = tmp_path / "d7.onnx"
+
+        r = _invoke(
+            runner,
+            [
+                "-m", str(onnx_dinov2), "-o", str(out),
+                "--task", "feature-extraction",
+                "--model-name", "facebook/dinov2-small",
+                "--samples", "4", "-v",
+            ],
+        )
+        _assert_quantized_output(input_onnx=onnx_dinov2, output_onnx=out, stdout=r.output)
+        assert (
+            "Creating feature-extraction dataset with ImageDataset" in r.output
+        ), r.output
+
 
 # ===========================================================================
 # Output behavior
diff --git a/tests/unit/datasets/test_random_dataset.py b/tests/unit/datasets/test_random_dataset.py
index 752fadc82..f46f4f6c4 100644
--- a/tests/unit/datasets/test_random_dataset.py
+++ b/tests/unit/datasets/test_random_dataset.py
@@ -154,11 +154,18 @@ class TestTaskDatasetMapping:
     """Verify all supported tasks map to correct dataset classes."""
 
     def test_all_tasks_have_mappings(self) -> None:
-        """Every task in TASK_DATASET_MAPPING maps to a callable dataset class."""
+        """Every task maps to either a dataset class or an input-name dispatch dict."""
         from winml.modelkit.datasets import TASK_DATASET_MAPPING
 
-        for task, cls in TASK_DATASET_MAPPING.items():
-            assert callable(cls), f"Task {task!r} maps to non-callable {cls}"
+        for task, entry in TASK_DATASET_MAPPING.items():
+            if isinstance(entry, dict):
+                assert entry, f"Task {task!r} maps to empty dict"
+                for input_name, cls in entry.items():
+                    assert callable(cls), (
+                        f"Task {task!r}[{input_name!r}] maps to non-callable {cls}"
+                    )
+            else:
+                assert callable(entry), f"Task {task!r} maps to non-callable {entry}"
 
     @pytest.mark.parametrize(
         ("task", "module_path", "class_name"),
diff --git a/tests/unit/eval/test_eval.py b/tests/unit/eval/test_eval.py
index 242322794..c7ca4af6d 100644
--- a/tests/unit/eval/test_eval.py
+++ b/tests/unit/eval/test_eval.py
@@ -80,6 +80,32 @@ def test_infer_from_model_id(self):
         ):
             assert _resolve_task(config) == "image-classification"
 
+    def test_feature_extraction_mapped_to_hf_image_feature_extraction_for_vision_model(self):
+        """Vision FE model with --task feature-extraction is mapped to the HF
+        pipeline task image-feature-extraction so the evaluator registry
+        lookup succeeds."""
+        from winml.modelkit.eval.evaluate import _resolve_task
+
+        fake_hf_config = MagicMock()
+        fake_hf_config.model_type = "dinov2"
+        fake_onnx_config = MagicMock()
+        fake_onnx_config.inputs = {"pixel_values": object()}
+
+        config = WinMLEvaluationConfig(
+            model_id="facebook/dinov2-base", task="feature-extraction"
+        )
+        with (
+            patch(
+                "transformers.AutoConfig.from_pretrained",
+                return_value=fake_hf_config,
+            ),
+            patch(
+                "winml.modelkit.export.io._get_onnx_config",
+                return_value=fake_onnx_config,
+            ),
+        ):
+            assert _resolve_task(config) == "image-feature-extraction"
+
 
 class TestGetEvaluatorClass:
     """Tests for get_evaluator_class registry lookup."""
diff --git a/tests/unit/export/test_htp_exporter_patcher_task_synonyms.py b/tests/unit/export/test_htp_exporter_patcher_task_synonyms.py
new file mode 100644
index 000000000..83279449b
--- /dev/null
+++ b/tests/unit/export/test_htp_exporter_patcher_task_synonyms.py
@@ -0,0 +1,85 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Regression tests for `HTPExporter._get_optimum_patcher` task-synonym handling.
+
+Optimum's ``TasksManager.get_exporter_config_constructor`` only accepts
+canonical task names. When ``_get_optimum_patcher`` is invoked with a
+HuggingFace pipeline alias (e.g. ``image-feature-extraction``), the lookup
+raises and the patcher silently falls back to ``contextlib.nullcontext()``,
+producing ONNX exports without the Transformers >= 4.53 tracing patches.
+
+This test pins the contract that ``_get_optimum_patcher`` normalises the
+task argument via ``map_task_synonym`` before the TasksManager lookup.
+
+Regression for https://github.com/microsoft/winml-cli/issues/777.
+"""
+
+from __future__ import annotations
+
+from unittest.mock import patch
+
+import torch.nn as nn
+
+from winml.modelkit.export.htp import HTPExporter
+
+
+class _FakeConfig:
+    """Minimal HF-style config exposing the model_type the patcher checks."""
+
+    model_type = "dinov2"
+
+
+class _FakeModel(nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.config = _FakeConfig()
+
+
+class TestGetOptimumPatcherTaskSynonyms:
+    """_get_optimum_patcher must normalise HF-alias tasks before TasksManager lookup."""
+
+    def test_hf_alias_image_feature_extraction_is_normalized(self) -> None:
+        """Calling the patcher with 'image-feature-extraction' must pass the canonical
+        'feature-extraction' to ``TasksManager.get_exporter_config_constructor``.
+
+        We patch the TasksManager call and capture the ``task`` kwarg. The
+        spy raises ``KeyError`` to short-circuit the rest of the patcher
+        (no need to construct a real OnnxConfig); the patcher returns
+        ``nullcontext()`` on KeyError, which is fine — we assert on the
+        captured task argument.
+        """
+        captured: dict[str, object] = {}
+
+        def spy(*args: object, **kwargs: object) -> None:
+            captured["task"] = kwargs.get("task")
+            raise KeyError("test sentinel — short-circuit after capture")
+
+        with patch(
+            "optimum.exporters.tasks.TasksManager.get_exporter_config_constructor",
+            side_effect=spy,
+        ):
+            HTPExporter._get_optimum_patcher(_FakeModel(), task="image-feature-extraction")
+
+        assert captured.get("task") == "feature-extraction", (
+            f"Expected normalised task 'feature-extraction' to reach "
+            f"TasksManager.get_exporter_config_constructor, got {captured.get('task')!r}. "
+            "_get_optimum_patcher must call map_task_synonym on the task argument."
+        )
+
+    def test_canonical_task_passes_through_unchanged(self) -> None:
+        """Control: canonical 'feature-extraction' passes through unchanged."""
+        captured: dict[str, object] = {}
+
+        def spy(*args: object, **kwargs: object) -> None:
+            captured["task"] = kwargs.get("task")
+            raise KeyError("test sentinel")
+
+        with patch(
+            "optimum.exporters.tasks.TasksManager.get_exporter_config_constructor",
+            side_effect=spy,
+        ):
+            HTPExporter._get_optimum_patcher(_FakeModel(), task="feature-extraction")
+
+        assert captured.get("task") == "feature-extraction"
diff --git a/tests/unit/export/test_io_specs.py b/tests/unit/export/test_io_specs.py
index 8abf38638..06a6b8af2 100644
--- a/tests/unit/export/test_io_specs.py
+++ b/tests/unit/export/test_io_specs.py
@@ -25,7 +25,7 @@
 from winml.modelkit.export import resolve_io_specs
 from winml.modelkit.export.io import (  # Testing internal implementation
     _get_onnx_config,
-    _map_task_synonym,
+    map_task_synonym,
 )
 
 
@@ -299,7 +299,7 @@ class TestMapTaskSynonymExport:
     )
     def test_map_task_synonym(self, task: str, expected: str) -> None:
         """map_task_synonym returns the expected canonical task name."""
-        assert _map_task_synonym(task) == expected
+        assert map_task_synonym(task) == expected
 
 
 # =============================================================================
diff --git a/tests/unit/inspect/test_resolve_exporter_task_synonyms.py b/tests/unit/inspect/test_resolve_exporter_task_synonyms.py
new file mode 100644
index 000000000..8d9c1c44d
--- /dev/null
+++ b/tests/unit/inspect/test_resolve_exporter_task_synonyms.py
@@ -0,0 +1,52 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Regression tests for `inspect.resolver.resolve_exporter` task-synonym handling.
+
+Optimum's ``TasksManager.get_exporter_config_constructor`` only accepts
+canonical task names (e.g. ``feature-extraction``); HuggingFace pipeline
+aliases (e.g. ``image-feature-extraction``, ``sentence-similarity``) raise
+``KeyError``/``ValueError`` and cause the resolver to silently fall through
+to ``SupportLevel.UNSUPPORTED``.
+
+These tests pin the contract that ``resolve_exporter`` normalises HF
+aliases via ``map_task_synonym`` before the TasksManager lookup.
+
+Regression for https://github.com/microsoft/winml-cli/issues/782.
+"""
+
+from __future__ import annotations
+
+from winml.modelkit.inspect.resolver import resolve_exporter
+from winml.modelkit.inspect.types import SupportLevel
+
+
+class TestResolveExporterTaskSynonyms:
+    """resolve_exporter must accept HF-alias tasks and normalise them."""
+
+    def test_hf_alias_image_feature_extraction_resolves_for_dinov2(self) -> None:
+        """HF alias 'image-feature-extraction' must resolve to a TasksManager config.
+
+        Without normalisation, TasksManager raises and the resolver returns
+        ``SupportLevel.UNSUPPORTED`` with ``onnx_config_class=None``.
+        """
+        info = resolve_exporter("dinov2", "image-feature-extraction", hf_config=None)
+
+        assert info.support_level != SupportLevel.UNSUPPORTED, (
+            "dinov2/image-feature-extraction must resolve via TasksManager; "
+            "if this is UNSUPPORTED, the HF-alias task likely wasn't "
+            "normalised before TasksManager.get_exporter_config_constructor."
+        )
+        assert info.onnx_config_source == "TasksManager", (
+            f"Expected onnx_config_source='TasksManager', got {info.onnx_config_source!r}."
+        )
+        assert info.onnx_config_class is not None
+
+    def test_canonical_feature_extraction_resolves_for_dinov2(self) -> None:
+        """Control: canonical 'feature-extraction' resolves (no normalisation needed)."""
+        info = resolve_exporter("dinov2", "feature-extraction", hf_config=None)
+
+        assert info.support_level != SupportLevel.UNSUPPORTED
+        assert info.onnx_config_source == "TasksManager"
+        assert info.onnx_config_class is not None

From 2d967c1ecc28c79ffc6a1366e15db14abc5c84f4 Mon Sep 17 00:00:00 2001
From: Charles Zhang <zhangchao@microsoft.com>
Date: Tue, 2 Jun 2026 14:59:20 +0800
Subject: [PATCH 025/143] Fix integration tests. (#773)

---
 .../runtime_checker/not_qnn_results.json      |   560 +-
 .../reshape_qnn_results.actual.json           | 26191 +++++++++-------
 .../runtime_checker/reshape_qnn_results.json  |  6217 ++--
 .../analyze/runtime_checker/test_helper.py    |     6 +-
 4 files changed, 18732 insertions(+), 14242 deletions(-)

diff --git a/tests/integration/analyze/runtime_checker/not_qnn_results.json b/tests/integration/analyze/runtime_checker/not_qnn_results.json
index cea737487..fcc60b721 100644
--- a/tests/integration/analyze/runtime_checker/not_qnn_results.json
+++ b/tests/integration/analyze/runtime_checker/not_qnn_results.json
@@ -1,5 +1,40 @@
 {
   "check_results": [
+    {
+      "type_vars": {
+        "T_Not": "BOOL"
+      },
+      "input_constraints": {
+        "X": {
+          "type": "shape",
+          "shape": [],
+          "min_max": null
+        }
+      },
+      "attrs": {},
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "X": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (508 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (986 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (550 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (577 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (207 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2890 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (511 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1082 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (597 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (607 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (280 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (898 us)\nStarting stage: Completion\nCompleted stage: Completion (67 us)\nRun outputs: [array(False)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
     {
       "type_vars": {
         "T_Not": "BOOL"
@@ -9,10 +44,12 @@
           "type": "shape",
           "shape": [
             1
-          ]
+          ],
+          "min_max": null
         }
       },
       "attrs": {},
+      "dynamic_axes": {},
       "input_is_constant": {
         "X": false
       },
@@ -22,16 +59,16 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (264 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (621 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (448 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (351 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (34 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (24 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (878 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (604 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1030 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (562 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (592 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (209 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (259 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2797 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (384 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (999 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (450 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (575 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (45 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (27 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3628 us)\nStarting stage: Completion\nCompleted stage: Completion (13 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([False])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (559 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (948 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (681 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (599 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (210 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2789 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\nRun outputs: [array([False])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
@@ -44,10 +81,12 @@
           "type": "shape",
           "shape": [
             6
-          ]
+          ],
+          "min_max": null
         }
       },
       "attrs": {},
+      "dynamic_axes": {},
       "input_is_constant": {
         "X": false
       },
@@ -57,16 +96,16 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (312 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (926 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (449 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (500 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (51 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (31 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3500 us)\nStarting stage: Completion\nCompleted stage: Completion (14 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (551 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (994 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (548 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (695 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (267 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (265 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (891 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (496 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1071 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (474 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (480 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (45 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (873 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([ True, False, False, False, False,  True])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (745 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1152 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (568 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (584 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2770 us)\nStarting stage: Completion\nCompleted stage: Completion (67 us)\nRun outputs: [array([False, False, False, False, False, False])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
@@ -80,10 +119,12 @@
           "shape": [
             4,
             5
-          ]
+          ],
+          "min_max": null
         }
       },
       "attrs": {},
+      "dynamic_axes": {},
       "input_is_constant": {
         "X": false
       },
@@ -93,16 +134,16 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (338 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (524 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (354 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (336 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (33 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (828 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (557 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1404 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (643 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (692 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (223 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (915 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (312 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1126 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (482 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (515 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (53 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (31 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3391 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[ True, False, False, False, False],\n       [False,  True,  True, False,  True],\n       [ True, False,  True, False, False],\n       [False, False,  True, False,  True]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (522 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (908 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (556 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (627 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (255 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (889 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[ True, False,  True,  True, False],\n       [False,  True,  True, False,  True],\n       [False,  True,  True,  True,  True],\n       [False,  True,  True, False, False]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
@@ -117,10 +158,12 @@
             3,
             2,
             5
-          ]
+          ],
+          "min_max": null
         }
       },
       "attrs": {},
+      "dynamic_axes": {},
       "input_is_constant": {
         "X": false
       },
@@ -130,16 +173,16 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (378 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (745 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (350 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (410 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (28 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3359 us)\nStarting stage: Completion\nCompleted stage: Completion (31 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (563 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1090 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (592 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (871 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (692 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (333 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (763 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (266 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (922 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (328 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (351 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (46 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (21 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2439 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False, False, False,  True, False],\n        [False, False,  True,  True, False]],\n\n       [[ True,  True,  True,  True,  True],\n        [False,  True, False, False, False]],\n\n       [[ True,  True, False, False,  True],\n        [False,  True, False, False,  True]]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (664 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1371 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (574 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (600 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (270 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (267 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (996 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\nRun outputs: [array([[[ True,  True,  True, False,  True],\n        [False,  True, False, False, False]],\n\n       [[False, False, False,  True,  True],\n        [ True,  True,  True,  True, False]],\n\n       [[False,  True, False, False,  True],\n        [False, False,  True,  True, False]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
@@ -155,10 +198,12 @@
             4,
             5,
             6
-          ]
+          ],
+          "min_max": null
         }
       },
       "attrs": {},
+      "dynamic_axes": {},
       "input_is_constant": {
         "X": false
       },
@@ -168,16 +213,16 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (272 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (665 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (327 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (399 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (21 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=4096\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (896 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (489 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1125 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (592 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (830 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (477 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (290 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=4096\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (1147 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (218 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (641 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (345 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (377 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (21 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=4096\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (797 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[False,  True,  True, False,  True, False],\n         [False, False,  True, False,  True,  True],\n         [False, False, False,  True, False,  True],\n         [False,  True,  True, False, False,  True],\n         [False, False,  True, False,  True, False]],\n\n        [[ True,  True, False,  True,  True,  True],\n         [ True,  True,  True,  True, False,  True],\n         [False, False, False,  True, False, False],\n         [ True, False,  True, False,  True, False],\n         [False, False,  True,  True,  True,  True]],\n\n        [[ True, False,  True, False, False,  True],\n         [False,  True, False, False, False,  True],\n         [ True,  True, False,  True,  True,  True],\n         [False, False,  True,  True, False, False],\n         [ True,  True,  True,  True,  True,  True]],\n\n        [[ True, False, False, False,  True,  True],\n         [False,  True, False,  True, False,  True],\n         [False,  True,  True, False, False, False],\n         [ True, False, False,  True, False, False],\n         [ True,  True,  True,  True, False,  True]]],\n\n\n       [[[False, False,  True, False, False, False],\n         [ True, False,  True, False, False, False],\n         [ True, False, False,  True,  True, False],\n         [False, False, False, False,  True,  True],\n         [ True,  True, False,  True, False, False]],\n\n        [[ True,  True, False,  True, False, False],\n         [ True,  True,  True,  True, False,  True],\n         [ True,  True,  True, False, False,  True],\n         [False, False, False,  True,  True, False],\n         [ True,  True, False, False,  True, False]],\n\n        [[False, False,  True, False,  True,  True],\n         [ True, False, False, False, False, False],\n         [False,  True,  True, False, False,  True],\n         [False,  True,  True,  True, False,  True],\n         [ True, False, False, False, False, False]],\n\n        [[ True, False, False,  True,  True,  True],\n         [False,  True,  True, False, False, False],\n         [ True, False,  True,  True, False, False],\n         [False,  True,  True, False, False,  True],\n         [ True,  True, False,  True,  True,  True]]]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (502 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1147 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (694 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (640 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (212 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (289 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=4096\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (1813 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[[False, False, False,  True,  True,  True],\n         [False, False, False,  True,  True, False],\n         [False, False,  True, False, False, False],\n         [ True,  True,  True, False, False,  True],\n         [ True, False, False,  True, False,  True]],\n\n        [[ True,  True,  True, False,  True,  True],\n         [False,  True, False,  True, False, False],\n         [ True,  True, False,  True,  True, False],\n         [ True, False,  True,  True,  True,  True],\n         [False,  True, False, False, False, False]],\n\n        [[False, False,  True, False,  True, False],\n         [False, False, False, False, False,  True],\n         [ True,  True,  True, False, False, False],\n         [ True, False, False, False,  True,  True],\n         [ True,  True,  True,  True, False,  True]],\n\n        [[False,  True,  True,  True, False, False],\n         [ True,  True,  True,  True, False, False],\n         [False,  True, False, False,  True, False],\n         [False,  True,  True,  True, False, False],\n         [ True, False, False, False, False, False]]],\n\n\n       [[[False, False,  True,  True,  True,  True],\n         [ True, False,  True,  True, False,  True],\n         [ True, False,  True,  True,  True, False],\n         [False,  True,  True, False,  True, False],\n         [False,  True,  True, False, False,  True]],\n\n        [[False,  True, False, False, False, False],\n         [False,  True, False, False,  True, False],\n         [False,  True,  True,  True,  True, False],\n         [ True,  True,  True,  True, False, False],\n         [False,  True,  True, False, False,  True]],\n\n        [[False, False, False, False, False,  True],\n         [False, False,  True,  True,  True,  True],\n         [ True,  True,  True,  True, False, False],\n         [False, False,  True,  True, False, False],\n         [False,  True, False, False, False, False]],\n\n        [[ True, False, False,  True,  True,  True],\n         [ True,  True, False, False,  True,  True],\n         [ True, False,  True,  True, False, False],\n         [False, False,  True,  True, False, False],\n         [ True, False,  True, False, False, False]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
@@ -194,10 +239,12 @@
             3,
             4,
             5
-          ]
+          ],
+          "min_max": null
         }
       },
       "attrs": {},
+      "dynamic_axes": {},
       "input_is_constant": {
         "X": false
       },
@@ -205,18 +252,18 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-08 16:49:54.0737702 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n0` of type `ElementWiseNot` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[False, False,  True,  True, False],\n          [False,  True, False, False, False],\n          [ True, False, False, False, False],\n          [False, False,  True, False,  True]],\n\n         [[False, False,  True, False, False],\n          [ True, False,  True, False,  True],\n          [ True, False, False,  True,  True],\n          [False, False,  True,  True,  True]],\n\n         [[False,  True, False,  True, False],\n          [ True,  True,  True,  True,  True],\n          [ True, False,  True, False, False],\n          [ True, False, False, False,  True]]],\n\n\n        [[[False,  True, False, False,  True],\n          [False, False,  True,  True,  True],\n          [ True, False, False,  True, False],\n          [False,  True, False, False,  True]],\n\n         [[ True, False, False,  True,  True],\n          [False,  True,  True, False,  True],\n          [False, False,  True, False,  True],\n          [False, False,  True,  True,  True]],\n\n         [[False, False,  True,  True,  True],\n          [ True, False,  True,  True,  True],\n          [False,  True, False, False,  True],\n          [False,  True, False,  True,  True]]]],\n\n\n\n       [[[[False,  True,  True,  True, False],\n          [ True,  True,  True, False, False],\n          [False,  True, False,  True,  True],\n          [ True, False, False, False,  True]],\n\n         [[False,  True,  True, False,  True],\n          [ True,  True,  True, False,  True],\n          [False,  True, False,  True,  True],\n          [ True,  True,  True, False,  True]],\n\n         [[ True,  True,  True,  True, False],\n          [ True, False, False,  True,  True],\n          [ True,  True, False,  True,  True],\n          [False,  True, False,  True, False]]],\n\n\n        [[[ True,  True, False,  True,  True],\n          [ True,  True,  True,  True, False],\n          [ True, False, False,  True, False],\n          [False, False, False,  True,  True]],\n\n         [[False, False, False, False, False],\n          [False,  True,  True, False,  True],\n          [ True,  True,  True, False, False],\n          [ True,  True,  True, False, False]],\n\n         [[False,  True,  True, False, False],\n          [ True, False,  True,  True,  True],\n          [ True,  True,  True, False, False],\n          [ True,  True, False, False, False]]]]])]\n",
-          "stderr": "\u001b[0;93m2025-12-08 16:49:54.4804230 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n0` of type `ElementWiseNot` with error code 3110\n\u001b[m\n"
+          "stdout": "Run outputs: [array([[[[[False, False,  True,  True,  True],\n          [False, False, False, False, False],\n          [False,  True, False, False, False],\n          [False, False, False,  True, False]],\n\n         [[ True,  True, False,  True,  True],\n          [False,  True, False, False,  True],\n          [False, False, False,  True, False],\n          [ True, False,  True,  True, False]],\n\n         [[False, False,  True, False, False],\n          [False,  True,  True,  True, False],\n          [ True, False,  True,  True,  True],\n          [ True,  True, False,  True,  True]]],\n\n\n        [[[ True, False,  True,  True, False],\n          [False, False,  True, False, False],\n          [ True,  True,  True,  True,  True],\n          [False, False, False, False, False]],\n\n         [[False, False,  True, False, False],\n          [ True, False,  True,  True, False],\n          [False,  True, False, False, False],\n          [ True, False,  True,  True, False]],\n\n         [[ True,  True,  True, False,  True],\n          [False,  True,  True, False, False],\n          [ True, False, False, False, False],\n          [False,  True, False,  True,  True]]]],\n\n\n\n       [[[[False,  True,  True, False,  True],\n          [ True,  True,  True, False,  True],\n          [ True,  True, False,  True,  True],\n          [False, False, False, False, False]],\n\n         [[ True,  True,  True, False,  True],\n          [ True, False,  True, False,  True],\n          [False, False, False, False,  True],\n          [False, False, False,  True,  True]],\n\n         [[False,  True, False, False, False],\n          [ True, False,  True, False, False],\n          [False,  True,  True,  True,  True],\n          [ True,  True, False,  True,  True]]],\n\n\n        [[[ True, False,  True, False, False],\n          [False, False,  True,  True,  True],\n          [ True, False,  True,  True, False],\n          [ True, False, False, False, False]],\n\n         [[False,  True, False,  True,  True],\n          [ True, False, False, False, False],\n          [False,  True, False, False, False],\n          [ True,  True, False,  True, False]],\n\n         [[ True, False,  True,  True, False],\n          [ True, False,  True, False,  True],\n          [False, False,  True, False,  True],\n          [ True, False,  True, False,  True]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
@@ -234,10 +281,12 @@
             2,
             2,
             2
-          ]
+          ],
+          "min_max": null
         }
       },
       "attrs": {},
+      "dynamic_axes": {},
       "input_is_constant": {
         "X": false
       },
@@ -245,18 +294,18 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-08 16:49:54.9575328 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n0` of type `ElementWiseNot` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[[ True, False],\n           [ True, False]],\n\n          [[ True,  True],\n           [False,  True]]],\n\n\n         [[[ True, False],\n           [False, False]],\n\n          [[ True,  True],\n           [ True, False]]]],\n\n\n\n        [[[[ True, False],\n           [False, False]],\n\n          [[False, False],\n           [False,  True]]],\n\n\n         [[[ True, False],\n           [False,  True]],\n\n          [[ True, False],\n           [False, False]]]],\n\n\n\n        [[[[False, False],\n           [ True, False]],\n\n          [[False,  True],\n           [ True, False]]],\n\n\n         [[[ True,  True],\n           [False, False]],\n\n          [[False, False],\n           [ True, False]]]],\n\n\n\n        [[[[False, False],\n           [ True, False]],\n\n          [[False,  True],\n           [ True, False]]],\n\n\n         [[[ True,  True],\n           [False,  True]],\n\n          [[False,  True],\n           [ True,  True]]]],\n\n\n\n        [[[[False,  True],\n           [ True, False]],\n\n          [[False, False],\n           [False,  True]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[ True,  True],\n           [ True,  True]]]],\n\n\n\n        [[[[ True,  True],\n           [ True,  True]],\n\n          [[False, False],\n           [ True,  True]]],\n\n\n         [[[ True, False],\n           [ True, False]],\n\n          [[False, False],\n           [ True,  True]]]]],\n\n\n\n\n       [[[[[ True,  True],\n           [ True, False]],\n\n          [[False, False],\n           [ True, False]]],\n\n\n         [[[ True, False],\n           [False,  True]],\n\n          [[ True,  True],\n           [False,  True]]]],\n\n\n\n        [[[[ True, False],\n           [ True, False]],\n\n          [[False,  True],\n           [ True, False]]],\n\n\n         [[[ True, False],\n           [False,  True]],\n\n          [[ True, False],\n           [False, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False, False]],\n\n          [[ True,  True],\n           [False, False]]],\n\n\n         [[[False, False],\n           [False,  True]],\n\n          [[False,  True],\n           [False, False]]]],\n\n\n\n        [[[[False,  True],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False, False]]],\n\n\n         [[[ True,  True],\n           [False, False]],\n\n          [[ True, False],\n           [False,  True]]]],\n\n\n\n        [[[[False, False],\n           [ True,  True]],\n\n          [[False,  True],\n           [False, False]]],\n\n\n         [[[ True, False],\n           [ True, False]],\n\n          [[False,  True],\n           [False, False]]]],\n\n\n\n        [[[[ True, False],\n           [False, False]],\n\n          [[False,  True],\n           [ True,  True]]],\n\n\n         [[[False, False],\n           [False,  True]],\n\n          [[False, False],\n           [ True,  True]]]]],\n\n\n\n\n       [[[[[False, False],\n           [ True,  True]],\n\n          [[False,  True],\n           [False, False]]],\n\n\n         [[[ True,  True],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False, False]],\n\n          [[False,  True],\n           [False,  True]]],\n\n\n         [[[False, False],\n           [False, False]],\n\n          [[False,  True],\n           [False, False]]]],\n\n\n\n        [[[[False,  True],\n           [False,  True]],\n\n          [[False,  True],\n           [False,  True]]],\n\n\n         [[[ True, False],\n           [False, False]],\n\n          [[ True, False],\n           [ True, False]]]],\n\n\n\n        [[[[False,  True],\n           [ True, False]],\n\n          [[ True,  True],\n           [ True, False]]],\n\n\n         [[[ True, False],\n           [False, False]],\n\n          [[False,  True],\n           [ True,  True]]]],\n\n\n\n        [[[[ True, False],\n           [False, False]],\n\n          [[False,  True],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [False,  True]],\n\n          [[ True,  True],\n           [ True, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False, False]],\n\n          [[ True, False],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [ True,  True]],\n\n          [[ True, False],\n           [ True, False]]]]],\n\n\n\n\n       [[[[[ True, False],\n           [False,  True]],\n\n          [[ True,  True],\n           [False,  True]]],\n\n\n         [[[ True, False],\n           [False,  True]],\n\n          [[False, False],\n           [ True, False]]]],\n\n\n\n        [[[[False, False],\n           [False, False]],\n\n          [[ True, False],\n           [False,  True]]],\n\n\n         [[[ True,  True],\n           [ True,  True]],\n\n          [[False, False],\n           [ True,  True]]]],\n\n\n\n        [[[[ True, False],\n           [False, False]],\n\n          [[False,  True],\n           [ True, False]]],\n\n\n         [[[ True, False],\n           [False,  True]],\n\n          [[ True, False],\n           [ True,  True]]]],\n\n\n\n        [[[[False, False],\n           [False, False]],\n\n          [[False,  True],\n           [ True,  True]]],\n\n\n         [[[False, False],\n           [ True,  True]],\n\n          [[False, False],\n           [False,  True]]]],\n\n\n\n        [[[[False,  True],\n           [False,  True]],\n\n          [[False,  True],\n           [ True, False]]],\n\n\n         [[[False,  True],\n           [False,  True]],\n\n          [[False,  True],\n           [ True, False]]]],\n\n\n\n        [[[[False, False],\n           [False, False]],\n\n          [[ True,  True],\n           [False, False]]],\n\n\n         [[[ True, False],\n           [ True,  True]],\n\n          [[ True, False],\n           [ True,  True]]]]],\n\n\n\n\n       [[[[[False,  True],\n           [False, False]],\n\n          [[ True, False],\n           [ True,  True]]],\n\n\n         [[[ True, False],\n           [ True,  True]],\n\n          [[False, False],\n           [False, False]]]],\n\n\n\n        [[[[False,  True],\n           [False, False]],\n\n          [[False, False],\n           [False, False]]],\n\n\n         [[[ True,  True],\n           [False,  True]],\n\n          [[ True,  True],\n           [ True,  True]]]],\n\n\n\n        [[[[ True, False],\n           [False, False]],\n\n          [[False,  True],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [ True,  True]],\n\n          [[ True, False],\n           [ True,  True]]]],\n\n\n\n        [[[[False, False],\n           [ True, False]],\n\n          [[ True, False],\n           [False, False]]],\n\n\n         [[[ True, False],\n           [ True,  True]],\n\n          [[False, False],\n           [False, False]]]],\n\n\n\n        [[[[False, False],\n           [False, False]],\n\n          [[False, False],\n           [ True, False]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[False, False],\n           [ True,  True]]]],\n\n\n\n        [[[[False, False],\n           [False, False]],\n\n          [[ True, False],\n           [False, False]]],\n\n\n         [[[ True, False],\n           [ True, False]],\n\n          [[ True,  True],\n           [False,  True]]]]],\n\n\n\n\n       [[[[[ True, False],\n           [False,  True]],\n\n          [[False, False],\n           [False,  True]]],\n\n\n         [[[ True,  True],\n           [ True, False]],\n\n          [[ True,  True],\n           [False, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False, False]],\n\n          [[ True,  True],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [False,  True]],\n\n          [[ True,  True],\n           [False,  True]]]],\n\n\n\n        [[[[False,  True],\n           [ True,  True]],\n\n          [[False,  True],\n           [False, False]]],\n\n\n         [[[ True, False],\n           [False,  True]],\n\n          [[False, False],\n           [ True, False]]]],\n\n\n\n        [[[[ True,  True],\n           [ True,  True]],\n\n          [[False, False],\n           [False,  True]]],\n\n\n         [[[False,  True],\n           [False,  True]],\n\n          [[ True,  True],\n           [ True, False]]]],\n\n\n\n        [[[[False, False],\n           [ True,  True]],\n\n          [[False, False],\n           [False,  True]]],\n\n\n         [[[ True,  True],\n           [ True, False]],\n\n          [[ True, False],\n           [ True,  True]]]],\n\n\n\n        [[[[False,  True],\n           [ True, False]],\n\n          [[ True,  True],\n           [False,  True]]],\n\n\n         [[[False,  True],\n           [False, False]],\n\n          [[False,  True],\n           [False,  True]]]]]])]\n",
-          "stderr": "\u001b[0;93m2025-12-08 16:49:55.3357474 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n0` of type `ElementWiseNot` with error code 3110\n\u001b[m\n"
+          "stdout": "Run outputs: [array([[[[[[False,  True],\n           [False, False]],\n\n          [[False, False],\n           [ True, False]]],\n\n\n         [[[False,  True],\n           [ True,  True]],\n\n          [[ True, False],\n           [ True, False]]]],\n\n\n\n        [[[[ True, False],\n           [False,  True]],\n\n          [[False,  True],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [ True,  True]],\n\n          [[False,  True],\n           [False,  True]]]],\n\n\n\n        [[[[ True, False],\n           [False,  True]],\n\n          [[ True,  True],\n           [ True, False]]],\n\n\n         [[[ True, False],\n           [ True, False]],\n\n          [[False,  True],\n           [False,  True]]]],\n\n\n\n        [[[[False, False],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False, False]]],\n\n\n         [[[False, False],\n           [ True,  True]],\n\n          [[ True, False],\n           [False,  True]]]],\n\n\n\n        [[[[ True,  True],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False,  True]]],\n\n\n         [[[ True,  True],\n           [False, False]],\n\n          [[ True,  True],\n           [False, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False,  True]],\n\n          [[False, False],\n           [ True,  True]]],\n\n\n         [[[False, False],\n           [ True,  True]],\n\n          [[ True,  True],\n           [ True, False]]]]],\n\n\n\n\n       [[[[[ True, False],\n           [False, False]],\n\n          [[False, False],\n           [ True,  True]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[False, False],\n           [False, False]]]],\n\n\n\n        [[[[False,  True],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[ True, False],\n           [False, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False,  True]],\n\n          [[ True,  True],\n           [False,  True]]],\n\n\n         [[[ True,  True],\n           [ True, False]],\n\n          [[False,  True],\n           [ True, False]]]],\n\n\n\n        [[[[ True, False],\n           [ True,  True]],\n\n          [[False,  True],\n           [ True,  True]]],\n\n\n         [[[ True, False],\n           [ True, False]],\n\n          [[False, False],\n           [ True, False]]]],\n\n\n\n        [[[[False,  True],\n           [False,  True]],\n\n          [[False, False],\n           [ True, False]]],\n\n\n         [[[False,  True],\n           [False, False]],\n\n          [[False, False],\n           [ True, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False,  True]],\n\n          [[False,  True],\n           [ True, False]]],\n\n\n         [[[False, False],\n           [False,  True]],\n\n          [[ True, False],\n           [ True,  True]]]]],\n\n\n\n\n       [[[[[ True,  True],\n           [False, False]],\n\n          [[False, False],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[False,  True],\n           [False,  True]]]],\n\n\n\n        [[[[ True, False],\n           [ True, False]],\n\n          [[ True, False],\n           [ True,  True]]],\n\n\n         [[[False,  True],\n           [False,  True]],\n\n          [[ True,  True],\n           [ True, False]]]],\n\n\n\n        [[[[False, False],\n           [False,  True]],\n\n          [[False,  True],\n           [ True, False]]],\n\n\n         [[[False, False],\n           [False, False]],\n\n          [[False,  True],\n           [ True, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False, False]],\n\n          [[ True, False],\n           [False, False]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[False, False],\n           [ True,  True]]]],\n\n\n\n        [[[[ True,  True],\n           [ True, False]],\n\n          [[False,  True],\n           [False,  True]]],\n\n\n         [[[ True,  True],\n           [False, False]],\n\n          [[ True,  True],\n           [False,  True]]]],\n\n\n\n        [[[[ True, False],\n           [False, False]],\n\n          [[ True, False],\n           [ True, False]]],\n\n\n         [[[ True,  True],\n           [ True,  True]],\n\n          [[ True, False],\n           [False, False]]]]],\n\n\n\n\n       [[[[[ True, False],\n           [ True, False]],\n\n          [[ True,  True],\n           [ True, False]]],\n\n\n         [[[ True,  True],\n           [False,  True]],\n\n          [[False,  True],\n           [False,  True]]]],\n\n\n\n        [[[[ True,  True],\n           [False,  True]],\n\n          [[False, False],\n           [False,  True]]],\n\n\n         [[[False, False],\n           [False,  True]],\n\n          [[False,  True],\n           [False,  True]]]],\n\n\n\n        [[[[ True, False],\n           [ True, False]],\n\n          [[ True,  True],\n           [ True, False]]],\n\n\n         [[[False, False],\n           [ True,  True]],\n\n          [[False, False],\n           [False,  True]]]],\n\n\n\n        [[[[ True,  True],\n           [ True,  True]],\n\n          [[ True, False],\n           [False,  True]]],\n\n\n         [[[ True,  True],\n           [False,  True]],\n\n          [[False,  True],\n           [False, False]]]],\n\n\n\n        [[[[ True,  True],\n           [False,  True]],\n\n          [[ True, False],\n           [ True,  True]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[ True,  True],\n           [ True, False]]]],\n\n\n\n        [[[[False,  True],\n           [ True,  True]],\n\n          [[False, False],\n           [ True, False]]],\n\n\n         [[[ True,  True],\n           [False,  True]],\n\n          [[ True, False],\n           [ True, False]]]]],\n\n\n\n\n       [[[[[False, False],\n           [False, False]],\n\n          [[ True, False],\n           [False, False]]],\n\n\n         [[[ True,  True],\n           [ True, False]],\n\n          [[False, False],\n           [False, False]]]],\n\n\n\n        [[[[False, False],\n           [ True,  True]],\n\n          [[ True, False],\n           [ True, False]]],\n\n\n         [[[ True,  True],\n           [ True, False]],\n\n          [[False,  True],\n           [False,  True]]]],\n\n\n\n        [[[[ True, False],\n           [ True, False]],\n\n          [[False, False],\n           [ True,  True]]],\n\n\n         [[[False,  True],\n           [False, False]],\n\n          [[ True,  True],\n           [False,  True]]]],\n\n\n\n        [[[[False,  True],\n           [ True, False]],\n\n          [[ True,  True],\n           [ True, False]]],\n\n\n         [[[False, False],\n           [ True,  True]],\n\n          [[ True, False],\n           [False,  True]]]],\n\n\n\n        [[[[ True,  True],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False,  True]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[ True, False],\n           [False,  True]]]],\n\n\n\n        [[[[False, False],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False, False]]],\n\n\n         [[[ True, False],\n           [False, False]],\n\n          [[False,  True],\n           [False, False]]]]],\n\n\n\n\n       [[[[[False, False],\n           [ True,  True]],\n\n          [[ True, False],\n           [ True,  True]]],\n\n\n         [[[False,  True],\n           [ True,  True]],\n\n          [[False, False],\n           [ True, False]]]],\n\n\n\n        [[[[False,  True],\n           [ True, False]],\n\n          [[False, False],\n           [ True, False]]],\n\n\n         [[[ True, False],\n           [ True,  True]],\n\n          [[ True,  True],\n           [False, False]]]],\n\n\n\n        [[[[ True, False],\n           [ True,  True]],\n\n          [[False, False],\n           [False,  True]]],\n\n\n         [[[False, False],\n           [False,  True]],\n\n          [[False,  True],\n           [False, False]]]],\n\n\n\n        [[[[False,  True],\n           [False, False]],\n\n          [[ True, False],\n           [False,  True]]],\n\n\n         [[[False, False],\n           [ True,  True]],\n\n          [[False, False],\n           [False,  True]]]],\n\n\n\n        [[[[ True,  True],\n           [False, False]],\n\n          [[ True, False],\n           [False,  True]]],\n\n\n         [[[False,  True],\n           [ True, False]],\n\n          [[ True,  True],\n           [False,  True]]]],\n\n\n\n        [[[[False, False],\n           [ True, False]],\n\n          [[False,  True],\n           [False, False]]],\n\n\n         [[[False, False],\n           [False, False]],\n\n          [[False, False],\n           [False,  True]]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     }
@@ -264,7 +313,7 @@
   "sys_info": {
     "cpuList": [
       {
-        "name": "Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU",
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Oryon(TM) CPU",
         "manufacturer": "Qualcomm Technologies Inc",
         "coreCount": 12,
         "threadCount": 12,
@@ -273,9 +322,9 @@
     ],
     "gpuList": [
       {
-        "name": "Qualcomm(R) Adreno(TM) X1-85 GPU",
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Adreno(TM) GPU",
         "manufacturer": "Qualcomm Incorporated",
-        "driverVersion": "31.0.114.0",
+        "driverVersion": "31.0.57.0",
         "vramMib": 0,
         "vendorId": 1297040209,
         "deviceId": 909329200
@@ -283,9 +332,9 @@
     ],
     "npuList": [
       {
-        "name": "Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU",
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Hexagon(TM) NPU",
         "manufacturer": "Qualcomm Technologies, Inc.",
-        "driverVersion": "30.0.143.0",
+        "driverVersion": "30.0.220.3000",
         "vendorId": 1297040209,
         "deviceId": 1093682224
       }
@@ -293,47 +342,57 @@
     "ramList": [
       {
         "capacityMib": 32768,
-        "speedMt": 8448,
-        "manufacturer": "HYNIX"
+        "speedMt": 7372,
+        "manufacturer": ""
       }
     ],
     "os": {
       "caption": "Microsoft Windows 11 Enterprise",
       "version": "10.0.26200",
       "architecture": "ARM 64-bit Processor",
-      "sku": 4
+      "sku": 4,
+      "buildNumber": "26200",
+      "isWindows11": true
     },
     "pythonRuntime": {
-      "version": "3.12.10",
+      "version": "3.11.15",
       "implementation": "CPython",
-      "architecture": "ARM64",
-      "compiler": "MSC v.1943 64 bit (AMD64)",
-      "buildNumber": "May 30 2025 05:39:07"
+      "architecture": "AMD64",
+      "compiler": "MSC v.1944 64 bit (AMD64)",
+      "buildNumber": "Mar 20 2026 00:32:44"
     },
     "pipPackages": [
       {
-        "name": "winml-modelkit",
+        "name": "winml-cli",
         "version": "0.1.0"
       },
+      {
+        "name": "winml-modelkit",
+        "version": "0.0.2"
+      },
       {
         "name": "aiohappyeyeballs",
         "version": "2.6.1"
       },
       {
         "name": "aiohttp",
-        "version": "3.13.2"
+        "version": "3.13.5"
       },
       {
         "name": "aiosignal",
         "version": "1.4.0"
       },
+      {
+        "name": "annotated-doc",
+        "version": "0.0.4"
+      },
       {
         "name": "annotated-types",
         "version": "0.7.0"
       },
       {
         "name": "anyio",
-        "version": "4.12.0"
+        "version": "4.13.0"
       },
       {
         "name": "argon2-cffi",
@@ -351,17 +410,21 @@
         "name": "asttokens",
         "version": "3.0.1"
       },
+      {
+        "name": "ast_serialize",
+        "version": "0.5.0"
+      },
       {
         "name": "async-lru",
-        "version": "2.0.5"
+        "version": "2.3.0"
       },
       {
         "name": "attrs",
-        "version": "25.4.0"
+        "version": "26.1.0"
       },
       {
         "name": "babel",
-        "version": "2.17.0"
+        "version": "2.18.0"
       },
       {
         "name": "beautifulsoup4",
@@ -373,28 +436,28 @@
       },
       {
         "name": "certifi",
-        "version": "2025.11.12"
+        "version": "2026.2.25"
       },
       {
         "name": "cffi",
         "version": "2.0.0"
       },
+      {
+        "name": "cfgv",
+        "version": "3.5.0"
+      },
       {
         "name": "charset-normalizer",
-        "version": "3.4.4"
+        "version": "3.4.7"
       },
       {
         "name": "click",
-        "version": "8.3.1"
+        "version": "8.4.1"
       },
       {
         "name": "colorama",
         "version": "0.4.6"
       },
-      {
-        "name": "coloredlogs",
-        "version": "15.0.1"
-      },
       {
         "name": "comm",
         "version": "0.2.3"
@@ -405,7 +468,11 @@
       },
       {
         "name": "coverage",
-        "version": "7.12.0"
+        "version": "7.13.5"
+      },
+      {
+        "name": "cryptography",
+        "version": "46.0.7"
       },
       {
         "name": "cycler",
@@ -413,11 +480,11 @@
       },
       {
         "name": "datasets",
-        "version": "4.4.1"
+        "version": "4.8.4"
       },
       {
         "name": "debugpy",
-        "version": "1.8.17"
+        "version": "1.8.20"
       },
       {
         "name": "decorator",
@@ -427,8 +494,16 @@
         "name": "defusedxml",
         "version": "0.7.1"
       },
+      {
+        "name": "diffusers",
+        "version": "0.37.1"
+      },
       {
         "name": "dill",
+        "version": "0.4.1"
+      },
+      {
+        "name": "distlib",
         "version": "0.4.0"
       },
       {
@@ -439,21 +514,25 @@
         "name": "executing",
         "version": "2.2.1"
       },
+      {
+        "name": "fastapi",
+        "version": "0.136.0"
+      },
       {
         "name": "fastjsonschema",
         "version": "2.21.2"
       },
       {
         "name": "filelock",
-        "version": "3.20.0"
+        "version": "3.25.2"
       },
       {
         "name": "flatbuffers",
-        "version": "25.9.23"
+        "version": "25.12.19"
       },
       {
         "name": "fonttools",
-        "version": "4.61.0"
+        "version": "4.63.0"
       },
       {
         "name": "fqdn",
@@ -465,47 +544,59 @@
       },
       {
         "name": "fsspec",
-        "version": "2025.10.0"
+        "version": "2026.2.0"
       },
       {
         "name": "h11",
         "version": "0.16.0"
       },
+      {
+        "name": "hf-xet",
+        "version": "1.4.3"
+      },
       {
         "name": "httpcore",
         "version": "1.0.9"
       },
+      {
+        "name": "httptools",
+        "version": "0.7.1"
+      },
       {
         "name": "httpx",
         "version": "0.28.1"
       },
       {
-        "name": "huggingface-hub",
-        "version": "0.36.0"
+        "name": "httpx-sse",
+        "version": "0.4.3"
+      },
+      {
+        "name": "huggingface_hub",
+        "version": "0.36.2"
       },
       {
-        "name": "humanfriendly",
-        "version": "10.0"
+        "name": "identify",
+        "version": "2.6.18"
       },
       {
         "name": "idna",
         "version": "3.11"
       },
+      {
+        "name": "importlib_metadata",
+        "version": "8.7.1"
+      },
       {
         "name": "iniconfig",
         "version": "2.3.0"
       },
       {
         "name": "ipykernel",
-        "version": "7.1.0"
+        "version": "7.2.0"
       },
       {
         "name": "ipython",
-        "version": "9.7.0"
-      },
-      {
-        "name": "ipython_pygments_lexers",
-        "version": "1.1.1"
+        "version": "8.39.0"
       },
       {
         "name": "ipywidgets",
@@ -525,19 +616,19 @@
       },
       {
         "name": "joblib",
-        "version": "1.5.2"
+        "version": "1.5.3"
       },
       {
         "name": "json5",
-        "version": "0.12.1"
+        "version": "0.14.0"
       },
       {
         "name": "jsonpointer",
-        "version": "3.0.0"
+        "version": "3.1.1"
       },
       {
         "name": "jsonschema",
-        "version": "4.25.1"
+        "version": "4.26.0"
       },
       {
         "name": "jsonschema-specifications",
@@ -549,7 +640,7 @@
       },
       {
         "name": "jupyterlab",
-        "version": "4.5.0"
+        "version": "4.5.6"
       },
       {
         "name": "jupyterlab_pygments",
@@ -565,7 +656,7 @@
       },
       {
         "name": "jupyter_client",
-        "version": "8.6.3"
+        "version": "8.8.0"
       },
       {
         "name": "jupyter-console",
@@ -581,7 +672,7 @@
       },
       {
         "name": "jupyter-lsp",
-        "version": "2.3.0"
+        "version": "2.3.1"
       },
       {
         "name": "jupyter_server",
@@ -589,11 +680,11 @@
       },
       {
         "name": "jupyter_server_terminals",
-        "version": "0.5.3"
+        "version": "0.5.4"
       },
       {
         "name": "kiwisolver",
-        "version": "1.4.9"
+        "version": "1.5.0"
       },
       {
         "name": "lark",
@@ -601,7 +692,11 @@
       },
       {
         "name": "librt",
-        "version": "0.6.3"
+        "version": "0.11.0"
+      },
+      {
+        "name": "lightning-utilities",
+        "version": "0.15.3"
       },
       {
         "name": "markdown-it-py",
@@ -613,19 +708,23 @@
       },
       {
         "name": "matplotlib",
-        "version": "3.10.7"
+        "version": "3.10.9"
       },
       {
         "name": "matplotlib-inline",
         "version": "0.2.1"
       },
+      {
+        "name": "mcp",
+        "version": "1.27.0"
+      },
       {
         "name": "mdurl",
         "version": "0.1.2"
       },
       {
         "name": "mistune",
-        "version": "3.1.4"
+        "version": "3.2.0"
       },
       {
         "name": "ml_dtypes",
@@ -637,15 +736,15 @@
       },
       {
         "name": "multidict",
-        "version": "6.7.0"
+        "version": "6.7.1"
       },
       {
         "name": "multiprocess",
-        "version": "0.70.18"
+        "version": "0.70.19"
       },
       {
         "name": "mypy",
-        "version": "1.19.0"
+        "version": "2.1.0"
       },
       {
         "name": "mypy_extensions",
@@ -653,11 +752,11 @@
       },
       {
         "name": "nbclient",
-        "version": "0.10.2"
+        "version": "0.10.4"
       },
       {
         "name": "nbconvert",
-        "version": "7.16.6"
+        "version": "7.17.1"
       },
       {
         "name": "nbformat",
@@ -669,11 +768,15 @@
       },
       {
         "name": "networkx",
-        "version": "3.6"
+        "version": "3.4.2"
+      },
+      {
+        "name": "nodeenv",
+        "version": "1.10.0"
       },
       {
         "name": "notebook",
-        "version": "7.5.0"
+        "version": "7.5.5"
       },
       {
         "name": "notebook_shim",
@@ -685,27 +788,47 @@
       },
       {
         "name": "onnx",
-        "version": "1.20.0"
+        "version": "1.18.0"
       },
       {
-        "name": "onnxruntime",
-        "version": "1.23.2"
+        "name": "onnxruntime-windowsml",
+        "version": "1.24.5.202604171637"
       },
       {
         "name": "onnxscript",
-        "version": "0.5.6"
+        "version": "0.6.2"
       },
       {
         "name": "onnx-ir",
-        "version": "0.1.12"
+        "version": "0.2.0"
+      },
+      {
+        "name": "opentelemetry-api",
+        "version": "1.41.0"
+      },
+      {
+        "name": "opentelemetry-sdk",
+        "version": "1.41.0"
+      },
+      {
+        "name": "opentelemetry-semantic-conventions",
+        "version": "0.62b0"
       },
       {
         "name": "optimum",
-        "version": "2.0.0"
+        "version": "2.1.0"
+      },
+      {
+        "name": "optimum-onnx",
+        "version": "0.1.0"
+      },
+      {
+        "name": "overrides",
+        "version": "7.7.0"
       },
       {
         "name": "packaging",
-        "version": "25.0"
+        "version": "26.0"
       },
       {
         "name": "pandas",
@@ -717,27 +840,35 @@
       },
       {
         "name": "parso",
-        "version": "0.8.5"
+        "version": "0.8.6"
       },
       {
         "name": "pathspec",
-        "version": "0.12.1"
+        "version": "1.1.1"
       },
       {
         "name": "pillow",
-        "version": "12.0.0"
+        "version": "12.2.0"
       },
       {
         "name": "platformdirs",
-        "version": "4.5.0"
+        "version": "4.9.6"
+      },
+      {
+        "name": "plotext",
+        "version": "5.3.2"
       },
       {
         "name": "pluggy",
         "version": "1.6.0"
       },
+      {
+        "name": "pre_commit",
+        "version": "4.5.1"
+      },
       {
         "name": "prometheus_client",
-        "version": "0.23.1"
+        "version": "0.25.0"
       },
       {
         "name": "prompt_toolkit",
@@ -749,11 +880,11 @@
       },
       {
         "name": "protobuf",
-        "version": "6.33.1"
+        "version": "7.34.1"
       },
       {
         "name": "psutil",
-        "version": "7.1.3"
+        "version": "7.2.2"
       },
       {
         "name": "pure_eval",
@@ -761,55 +892,83 @@
       },
       {
         "name": "pyarrow",
-        "version": "22.0.0"
+        "version": "23.0.1"
+      },
+      {
+        "name": "pycocotools",
+        "version": "2.0.11"
       },
       {
         "name": "pycparser",
-        "version": "2.23"
+        "version": "3.0"
       },
       {
         "name": "pydantic",
-        "version": "2.12.5"
+        "version": "2.13.0"
       },
       {
         "name": "pydantic_core",
-        "version": "2.41.5"
+        "version": "2.46.0"
+      },
+      {
+        "name": "pydantic-settings",
+        "version": "2.14.0"
       },
       {
         "name": "Pygments",
-        "version": "2.19.2"
+        "version": "2.20.0"
       },
       {
-        "name": "pyparsing",
-        "version": "3.2.5"
+        "name": "PyJWT",
+        "version": "2.12.1"
       },
       {
-        "name": "pyreadline3",
-        "version": "3.5.4"
+        "name": "pyparsing",
+        "version": "3.3.2"
       },
       {
         "name": "pytest",
-        "version": "9.0.1"
+        "version": "9.0.3"
       },
       {
         "name": "pytest-cov",
-        "version": "7.0.0"
+        "version": "7.1.0"
+      },
+      {
+        "name": "pytest-timeout",
+        "version": "2.4.0"
       },
       {
         "name": "python-dateutil",
         "version": "2.9.0.post0"
       },
+      {
+        "name": "python-discovery",
+        "version": "1.2.2"
+      },
+      {
+        "name": "python-dotenv",
+        "version": "1.2.2"
+      },
       {
         "name": "python-json-logger",
-        "version": "4.0.0"
+        "version": "4.1.0"
+      },
+      {
+        "name": "python-multipart",
+        "version": "0.0.26"
       },
       {
         "name": "pytz",
-        "version": "2025.2"
+        "version": "2026.1.post1"
+      },
+      {
+        "name": "pywin32",
+        "version": "311"
       },
       {
         "name": "pywinpty",
-        "version": "3.0.2"
+        "version": "3.0.3"
       },
       {
         "name": "PyYAML",
@@ -819,17 +978,21 @@
         "name": "pyzmq",
         "version": "27.1.0"
       },
+      {
+        "name": "RapidFuzz",
+        "version": "3.14.5"
+      },
       {
         "name": "referencing",
         "version": "0.37.0"
       },
       {
         "name": "regex",
-        "version": "2025.11.3"
+        "version": "2026.4.4"
       },
       {
         "name": "requests",
-        "version": "2.32.5"
+        "version": "2.33.1"
       },
       {
         "name": "rfc3339-validator",
@@ -845,7 +1008,7 @@
       },
       {
         "name": "rich",
-        "version": "14.2.0"
+        "version": "15.0.0"
       },
       {
         "name": "rpds-py",
@@ -853,7 +1016,7 @@
       },
       {
         "name": "ruff",
-        "version": "0.14.7"
+        "version": "0.15.13"
       },
       {
         "name": "safetensors",
@@ -873,11 +1036,19 @@
       },
       {
         "name": "Send2Trash",
-        "version": "1.8.3"
+        "version": "2.1.0"
+      },
+      {
+        "name": "sentencepiece",
+        "version": "0.2.1"
+      },
+      {
+        "name": "seqeval",
+        "version": "1.2.2"
       },
       {
         "name": "setuptools",
-        "version": "80.9.0"
+        "version": "81.0.0"
       },
       {
         "name": "six",
@@ -889,12 +1060,20 @@
       },
       {
         "name": "soupsieve",
-        "version": "2.8"
+        "version": "2.8.3"
+      },
+      {
+        "name": "sse-starlette",
+        "version": "3.3.4"
       },
       {
         "name": "stack-data",
         "version": "0.6.3"
       },
+      {
+        "name": "starlette",
+        "version": "1.0.0"
+      },
       {
         "name": "sympy",
         "version": "1.14.0"
@@ -909,7 +1088,7 @@
       },
       {
         "name": "timm",
-        "version": "1.0.22"
+        "version": "1.0.26"
       },
       {
         "name": "tinycss2",
@@ -917,27 +1096,31 @@
       },
       {
         "name": "tokenizers",
-        "version": "0.22.1"
+        "version": "0.22.2"
       },
       {
         "name": "torch",
-        "version": "2.9.1"
+        "version": "2.11.0"
       },
       {
         "name": "torchinfo",
         "version": "1.8.0"
       },
+      {
+        "name": "torchmetrics",
+        "version": "1.9.0"
+      },
       {
         "name": "torchvision",
-        "version": "0.24.1"
+        "version": "0.26.0"
       },
       {
         "name": "tornado",
-        "version": "6.5.2"
+        "version": "6.5.5"
       },
       {
         "name": "tqdm",
-        "version": "4.67.1"
+        "version": "4.67.3"
       },
       {
         "name": "traitlets",
@@ -945,11 +1128,11 @@
       },
       {
         "name": "transformers",
-        "version": "4.57.3"
+        "version": "4.57.6"
       },
       {
         "name": "types-colorama",
-        "version": "0.4.15.20250801"
+        "version": "0.4.15.20260508"
       },
       {
         "name": "typing_extensions",
@@ -961,7 +1144,7 @@
       },
       {
         "name": "tzdata",
-        "version": "2025.2"
+        "version": "2026.1"
       },
       {
         "name": "uri-template",
@@ -969,19 +1152,23 @@
       },
       {
         "name": "urllib3",
-        "version": "2.5.0"
+        "version": "2.6.3"
+      },
+      {
+        "name": "uvicorn",
+        "version": "0.45.0"
       },
       {
-        "name": "wasdk-Microsoft.Windows.AI.MachineLearning",
-        "version": "1.8.251106002"
+        "name": "virtualenv",
+        "version": "21.2.3"
       },
       {
-        "name": "wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap",
-        "version": "1.8.251106002"
+        "name": "watchfiles",
+        "version": "1.1.1"
       },
       {
         "name": "wcwidth",
-        "version": "0.2.14"
+        "version": "0.6.0"
       },
       {
         "name": "webcolors",
@@ -991,6 +1178,10 @@
         "name": "webencodings",
         "version": "0.5.1"
       },
+      {
+        "name": "websockets",
+        "version": "16.0"
+      },
       {
         "name": "websocket-client",
         "version": "1.9.0"
@@ -1000,42 +1191,79 @@
         "version": "4.0.15"
       },
       {
-        "name": "winml-modelkit",
+        "name": "windowsml",
+        "version": "2.0.300"
+      },
+      {
+        "name": "winml-cli",
         "version": "0.1.0"
       },
       {
-        "name": "winrt-runtime",
-        "version": "3.2.1"
+        "name": "xxhash",
+        "version": "3.6.0"
       },
       {
-        "name": "winrt-Windows.Foundation",
-        "version": "3.2.1"
+        "name": "yarl",
+        "version": "1.23.0"
       },
       {
-        "name": "winrt-Windows.Foundation.Collections",
-        "version": "3.2.1"
+        "name": "zipp",
+        "version": "3.23.1"
       },
       {
-        "name": "xxhash",
-        "version": "3.6.0"
+        "name": "winml-cli",
+        "version": "0.1.0"
       },
       {
-        "name": "yarl",
-        "version": "1.22.0"
+        "name": "winml-modelkit",
+        "version": "0.0.2"
+      },
+      {
+        "name": "importlib_metadata",
+        "version": "8.7.1"
+      },
+      {
+        "name": "microvenv",
+        "version": "2025.0"
+      },
+      {
+        "name": "packaging",
+        "version": "26.0"
+      },
+      {
+        "name": "tomli",
+        "version": "2.4.0"
+      },
+      {
+        "name": "typing_extensions",
+        "version": "4.15.0"
+      },
+      {
+        "name": "zipp",
+        "version": "3.21.0"
       }
     ],
     "epPackages": [
       {
-        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.21.0_arm64__8wekyb3d8bbwe",
-        "version": "1.8.21.0",
+        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
+        "version": "1.8.30.0",
+        "publisher": "CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US",
+        "architecture": 12,
+        "signatureKind": "Developer",
+        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
+        "epHash": "0b4dd71044175fb927d3b44a50b7dee4b003a3dfe86a9b09c3ca83f11150979215c256b0301bced2c7e684f84e42ec964532215c147b8b770399d6b9441afc1a",
+        "status": 0
+      },
+      {
+        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.2_2.2450.47.0_arm64__8wekyb3d8bbwe",
+        "version": "2.2450.47.0",
         "publisher": "CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US",
         "architecture": 12,
         "signatureKind": "Developer",
-        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.21.0_arm64__8wekyb3d8bbwe",
-        "epHash": "c62cee3f6a7ca26b76390f5158cf450373ac6caca058db519bf89867bf4c713c495b33deb11018342742eefe2f559f52239c2fd62ff039537e221e4589dbcbcf",
+        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.2_2.2450.47.0_arm64__8wekyb3d8bbwe",
+        "epHash": "343f2e6da7490f6721e40942a86a40fa01322c354d784b024491d151ec511e6dba7a9041c3594aa97ff0c0379cf627b88414b25328f931f8ddaabe78a6784102",
         "status": 0
       }
-    ],
-    "windowsAppRuntimeVersion": "1.8.251106002"
+    ]
   }
-}
\ No newline at end of file
+}
diff --git a/tests/integration/analyze/runtime_checker/reshape_qnn_results.actual.json b/tests/integration/analyze/runtime_checker/reshape_qnn_results.actual.json
index fdfa45c82..77d6e145b 100644
--- a/tests/integration/analyze/runtime_checker/reshape_qnn_results.actual.json
+++ b/tests/integration/analyze/runtime_checker/reshape_qnn_results.actual.json
@@ -1,12211 +1,13980 @@
-{
-  "check_results": [
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 0
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (345 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (919 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (393 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (366 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (49 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (36 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (710 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (233 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (802 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (353 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (318 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (655 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 1
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 2
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 3
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 4
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 5
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 6
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (275 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1116 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (405 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (348 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (26 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (801 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (279 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (858 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (349 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (316 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (465 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 7
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 8
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 9
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 10
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 11
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 12
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 13
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 14
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 15
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 16
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 17
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 18
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (272 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (751 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (364 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (320 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (41 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (406 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (247 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (715 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (352 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (391 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (65 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (38 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (457 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 19
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 20
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 21
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
-          },
-          "stdout": null,
-          "stderr": null
-        },
-        "run": {
-          "result": {
-            "success": false,
-            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
-          },
-          "stdout": null,
-          "stderr": null
-        }
-      },
-      "case_index": 22
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 23
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.7104 , 0.52   ]],\n\n         [[0.4246 , 0.557  ]]],\n\n\n        [[[0.744  , 0.9097 ]],\n\n         [[0.3508 , 0.1466 ]]],\n\n\n        [[[0.8516 , 0.8647 ]],\n\n         [[0.6895 , 0.604  ]]]],\n\n\n\n       [[[[0.3484 , 0.3823 ]],\n\n         [[0.5474 , 0.1576 ]]],\n\n\n        [[[0.3896 , 0.8496 ]],\n\n         [[0.07336, 0.7676 ]]],\n\n\n        [[[0.6177 , 0.7407 ]],\n\n         [[0.2017 , 0.785  ]]]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 24
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (241 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (797 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (427 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (369 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (43 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (26 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (421 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (324 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (835 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (359 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (317 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (30 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1653 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.7104 , 0.52   ]],\n\n         [[0.4246 , 0.557  ]]],\n\n\n        [[[0.744  , 0.9097 ]],\n\n         [[0.3508 , 0.1466 ]]],\n\n\n        [[[0.8516 , 0.8647 ]],\n\n         [[0.6895 , 0.604  ]]]],\n\n\n\n       [[[[0.3484 , 0.3823 ]],\n\n         [[0.5474 , 0.1576 ]]],\n\n\n        [[[0.3896 , 0.8496 ]],\n\n         [[0.07336, 0.7676 ]]],\n\n\n        [[[0.6177 , 0.7407 ]],\n\n         [[0.2017 , 0.785  ]]]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 25
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.7104 , 0.52   ]],\n\n         [[0.4246 , 0.557  ]]],\n\n\n        [[[0.744  , 0.9097 ]],\n\n         [[0.3508 , 0.1466 ]]],\n\n\n        [[[0.8516 , 0.8647 ]],\n\n         [[0.6895 , 0.604  ]]]],\n\n\n\n       [[[[0.3484 , 0.3823 ]],\n\n         [[0.5474 , 0.1576 ]]],\n\n\n        [[[0.3896 , 0.8496 ]],\n\n         [[0.07336, 0.7676 ]]],\n\n\n        [[[0.6177 , 0.7407 ]],\n\n         [[0.2017 , 0.785  ]]]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 26
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.0393098 , 0.7948867 ]],\n\n         [[0.6216604 , 0.09813837]]],\n\n\n        [[[0.49734533, 0.9793009 ]],\n\n         [[0.69941443, 0.5874072 ]]],\n\n\n        [[[0.6417446 , 0.7435387 ]],\n\n         [[0.6843516 , 0.65657014]]]],\n\n\n\n       [[[[0.87456334, 0.78532034]],\n\n         [[0.21084462, 0.82523394]]],\n\n\n        [[[0.4677332 , 0.66544783]],\n\n         [[0.4691091 , 0.37875095]]],\n\n\n        [[[0.8668522 , 0.8903503 ]],\n\n         [[0.39932477, 0.3904355 ]]]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 27
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (309 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1527 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (422 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (542 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (45 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (32 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (415 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (244 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1421 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (404 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (469 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (39 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (41 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (2002 us)\nStarting stage: Completion\nCompleted stage: Completion (11 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.03930664, 0.79492193]],\n\n         [[0.6215821 , 0.09814454]]],\n\n\n        [[[0.49731448, 0.97949225]],\n\n         [[0.6992188 , 0.5874024 ]]],\n\n\n        [[[0.6416016 , 0.7436524 ]],\n\n         [[0.6845704 , 0.65673834]]]],\n\n\n\n       [[[[0.8745118 , 0.7851563 ]],\n\n         [[0.21081544, 0.8251954 ]]],\n\n\n        [[[0.46777347, 0.6655274 ]],\n\n         [[0.46899417, 0.37866214]]],\n\n\n        [[[0.8666993 , 0.8901368 ]],\n\n         [[0.3994141 , 0.3903809 ]]]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 28
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.0393098 , 0.7948867 ]],\n\n         [[0.6216604 , 0.09813837]]],\n\n\n        [[[0.49734533, 0.9793009 ]],\n\n         [[0.69941443, 0.5874072 ]]],\n\n\n        [[[0.6417446 , 0.7435387 ]],\n\n         [[0.6843516 , 0.65657014]]]],\n\n\n\n       [[[[0.87456334, 0.78532034]],\n\n         [[0.21084462, 0.82523394]]],\n\n\n        [[[0.4677332 , 0.66544783]],\n\n         [[0.4691091 , 0.37875095]]],\n\n\n        [[[0.8668522 , 0.8903503 ]],\n\n         [[0.39932477, 0.3904355 ]]]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 29
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.94180622, 0.329852  ]],\n\n         [[0.54705018, 0.75983199]]],\n\n\n        [[[0.74851817, 0.12602419]],\n\n         [[0.09905128, 0.57549229]]],\n\n\n        [[[0.77266316, 0.26270579]],\n\n         [[0.57776472, 0.99170772]]]],\n\n\n\n       [[[[0.16442415, 0.39302765]],\n\n         [[0.85948295, 0.50372579]]],\n\n\n        [[[0.10710679, 0.60623453]],\n\n         [[0.28677609, 0.20565963]]],\n\n\n        [[[0.64008218, 0.52265755]],\n\n         [[0.47659023, 0.24567916]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 30
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.94180622, 0.329852  ]],\n\n         [[0.54705018, 0.75983199]]],\n\n\n        [[[0.74851817, 0.12602419]],\n\n         [[0.09905128, 0.57549229]]],\n\n\n        [[[0.77266316, 0.26270579]],\n\n         [[0.57776472, 0.99170772]]]],\n\n\n\n       [[[[0.16442415, 0.39302765]],\n\n         [[0.85948295, 0.50372579]]],\n\n\n        [[[0.10710679, 0.60623453]],\n\n         [[0.28677609, 0.20565963]]],\n\n\n        [[[0.64008218, 0.52265755]],\n\n         [[0.47659023, 0.24567916]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 31
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.94180622, 0.329852  ]],\n\n         [[0.54705018, 0.75983199]]],\n\n\n        [[[0.74851817, 0.12602419]],\n\n         [[0.09905128, 0.57549229]]],\n\n\n        [[[0.77266316, 0.26270579]],\n\n         [[0.57776472, 0.99170772]]]],\n\n\n\n       [[[[0.16442415, 0.39302765]],\n\n         [[0.85948295, 0.50372579]]],\n\n\n        [[[0.10710679, 0.60623453]],\n\n         [[0.28677609, 0.20565963]]],\n\n\n        [[[0.64008218, 0.52265755]],\n\n         [[0.47659023, 0.24567916]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 32
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[False, False]],\n\n         [[False,  True]]],\n\n\n        [[[ True, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False, False]]],\n\n\n        [[[False,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 33
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (262 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (739 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (352 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (326 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (24 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1644 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (308 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (891 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (385 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (335 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (406 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[False, False]],\n\n         [[False,  True]]],\n\n\n        [[[ True, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False, False]]],\n\n\n        [[[False,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 34
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[False, False]],\n\n         [[False,  True]]],\n\n\n        [[[ True, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False, False]]],\n\n\n        [[[False,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 35
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 36
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 37
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 38
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 39
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 40
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 41
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 42
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 43
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 44
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 45
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 46
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 47
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 48
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 49
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 50
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 51
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 52
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 53
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 54
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 55
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 56
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 57
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 58
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 59
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.03058, 0.9526 ]],\n\n         [[0.7227 , 0.8516 ]]],\n\n\n        [[[0.4473 , 0.5376 ]],\n\n         [[0.2856 , 0.729  ]]],\n\n\n        [[[0.642  , 0.2861 ]],\n\n         [[0.2017 , 0.6636 ]]]],\n\n\n\n       [[[[0.2094 , 0.41   ]],\n\n         [[0.378  , 0.767  ]]],\n\n\n        [[[0.4746 , 0.803  ]],\n\n         [[0.7295 , 0.1387 ]]],\n\n\n        [[[0.05588, 0.505  ]],\n\n         [[0.1509 , 0.406  ]]]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 60
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.03058, 0.9526 ]],\n\n         [[0.7227 , 0.8516 ]]],\n\n\n        [[[0.4473 , 0.5376 ]],\n\n         [[0.2856 , 0.729  ]]],\n\n\n        [[[0.642  , 0.2861 ]],\n\n         [[0.2017 , 0.6636 ]]]],\n\n\n\n       [[[[0.2094 , 0.41   ]],\n\n         [[0.378  , 0.767  ]]],\n\n\n        [[[0.4746 , 0.803  ]],\n\n         [[0.7295 , 0.1387 ]]],\n\n\n        [[[0.05588, 0.505  ]],\n\n         [[0.1509 , 0.406  ]]]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 61
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.03058, 0.9526 ]],\n\n         [[0.7227 , 0.8516 ]]],\n\n\n        [[[0.4473 , 0.5376 ]],\n\n         [[0.2856 , 0.729  ]]],\n\n\n        [[[0.642  , 0.2861 ]],\n\n         [[0.2017 , 0.6636 ]]]],\n\n\n\n       [[[[0.2094 , 0.41   ]],\n\n         [[0.378  , 0.767  ]]],\n\n\n        [[[0.4746 , 0.803  ]],\n\n         [[0.7295 , 0.1387 ]]],\n\n\n        [[[0.05588, 0.505  ]],\n\n         [[0.1509 , 0.406  ]]]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 62
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.63922226, 0.17894042]],\n\n         [[0.08741882, 0.65493006]]],\n\n\n        [[[0.9698461 , 0.8760738 ]],\n\n         [[0.41246584, 0.8007294 ]]],\n\n\n        [[[0.05713157, 0.79523   ]],\n\n         [[0.60008955, 0.5768426 ]]]],\n\n\n\n       [[[[0.62097967, 0.10685302]],\n\n         [[0.95064485, 0.01417092]]],\n\n\n        [[[0.45014143, 0.8750445 ]],\n\n         [[0.6771011 , 0.04420584]]],\n\n\n        [[[0.0046921 , 0.9034181 ]],\n\n         [[0.1150645 , 0.15820271]]]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 63
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.63922226, 0.17894042]],\n\n         [[0.08741882, 0.65493006]]],\n\n\n        [[[0.9698461 , 0.8760738 ]],\n\n         [[0.41246584, 0.8007294 ]]],\n\n\n        [[[0.05713157, 0.79523   ]],\n\n         [[0.60008955, 0.5768426 ]]]],\n\n\n\n       [[[[0.62097967, 0.10685302]],\n\n         [[0.95064485, 0.01417092]]],\n\n\n        [[[0.45014143, 0.8750445 ]],\n\n         [[0.6771011 , 0.04420584]]],\n\n\n        [[[0.0046921 , 0.9034181 ]],\n\n         [[0.1150645 , 0.15820271]]]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 64
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.63922226, 0.17894042]],\n\n         [[0.08741882, 0.65493006]]],\n\n\n        [[[0.9698461 , 0.8760738 ]],\n\n         [[0.41246584, 0.8007294 ]]],\n\n\n        [[[0.05713157, 0.79523   ]],\n\n         [[0.60008955, 0.5768426 ]]]],\n\n\n\n       [[[[0.62097967, 0.10685302]],\n\n         [[0.95064485, 0.01417092]]],\n\n\n        [[[0.45014143, 0.8750445 ]],\n\n         [[0.6771011 , 0.04420584]]],\n\n\n        [[[0.0046921 , 0.9034181 ]],\n\n         [[0.1150645 , 0.15820271]]]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 65
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.9178509 , 0.40035466]],\n\n         [[0.88002886, 0.03290044]]],\n\n\n        [[[0.05224522, 0.62310831]],\n\n         [[0.90110776, 0.08791647]]],\n\n\n        [[[0.32965502, 0.70987051]],\n\n         [[0.64410169, 0.99882561]]]],\n\n\n\n       [[[[0.0578552 , 0.38522854]],\n\n         [[0.98587072, 0.11847896]]],\n\n\n        [[[0.60226632, 0.38976599]],\n\n         [[0.73962604, 0.79055144]]],\n\n\n        [[[0.1206523 , 0.88338741]],\n\n         [[0.56466072, 0.23340987]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 66
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.9178509 , 0.40035466]],\n\n         [[0.88002886, 0.03290044]]],\n\n\n        [[[0.05224522, 0.62310831]],\n\n         [[0.90110776, 0.08791647]]],\n\n\n        [[[0.32965502, 0.70987051]],\n\n         [[0.64410169, 0.99882561]]]],\n\n\n\n       [[[[0.0578552 , 0.38522854]],\n\n         [[0.98587072, 0.11847896]]],\n\n\n        [[[0.60226632, 0.38976599]],\n\n         [[0.73962604, 0.79055144]]],\n\n\n        [[[0.1206523 , 0.88338741]],\n\n         [[0.56466072, 0.23340987]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 67
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.9178509 , 0.40035466]],\n\n         [[0.88002886, 0.03290044]]],\n\n\n        [[[0.05224522, 0.62310831]],\n\n         [[0.90110776, 0.08791647]]],\n\n\n        [[[0.32965502, 0.70987051]],\n\n         [[0.64410169, 0.99882561]]]],\n\n\n\n       [[[[0.0578552 , 0.38522854]],\n\n         [[0.98587072, 0.11847896]]],\n\n\n        [[[0.60226632, 0.38976599]],\n\n         [[0.73962604, 0.79055144]]],\n\n\n        [[[0.1206523 , 0.88338741]],\n\n         [[0.56466072, 0.23340987]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 68
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 69
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 70
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            2,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            2,
-            3,
-            2,
-            1,
-            2
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 71
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0],\n       [0, 1, 1, 1],\n       [0, 1, 1, 0]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 72
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (247 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (644 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (359 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (358 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (33 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (428 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (253 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (791 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (417 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (398 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (41 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (415 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0],\n       [0, 1, 1, 1],\n       [0, 1, 1, 0]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 73
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0],\n       [0, 1, 1, 1],\n       [0, 1, 1, 0]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 74
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 1, 0],\n       [0, 1, 1, 1]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 75
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 1, 0],\n       [0, 1, 1, 1]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 76
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 1, 0],\n       [0, 1, 1, 1]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 77
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 0],\n       [1, 0, 1, 1],\n       [1, 1, 0, 0],\n       [1, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 1, 0, 0]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 78
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (231 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (670 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (413 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (331 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2182 us)\nStarting stage: Completion\nCompleted stage: Completion (15 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (300 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (918 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (407 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (333 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (32 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (31 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1735 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 0],\n       [1, 0, 1, 1],\n       [1, 1, 0, 0],\n       [1, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 1, 0, 0]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 79
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 0],\n       [1, 0, 1, 1],\n       [1, 1, 0, 0],\n       [1, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 1, 0, 0]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 80
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [0, 0, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 81
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [0, 0, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 82
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [0, 0, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 83
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [1, 0, 0, 0],\n       [1, 0, 1, 1],\n       [1, 0, 0, 1],\n       [1, 1, 1, 1],\n       [0, 1, 1, 1]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 84
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [1, 0, 0, 0],\n       [1, 0, 1, 1],\n       [1, 0, 0, 1],\n       [1, 1, 1, 1],\n       [0, 1, 1, 1]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 85
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [1, 0, 0, 0],\n       [1, 0, 1, 1],\n       [1, 0, 0, 1],\n       [1, 1, 1, 1],\n       [0, 1, 1, 1]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 86
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 0]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 87
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 0]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 88
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 0]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 89
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 1],\n       [0, 0, 0, 1],\n       [1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 1, 0],\n       [0, 0, 0, 1]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 90
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (220 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (692 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (394 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (365 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (43 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (36 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (440 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (255 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (727 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (395 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (364 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (27 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1725 us)\nStarting stage: Completion\nCompleted stage: Completion (12 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 1],\n       [0, 0, 0, 1],\n       [1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 1, 0],\n       [0, 0, 0, 1]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 91
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 1],\n       [0, 0, 0, 1],\n       [1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 1, 0],\n       [0, 0, 0, 1]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 92
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 0, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 1],\n       [0, 1, 1, 0],\n       [0, 0, 1, 1],\n       [0, 0, 0, 1]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 93
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (280 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (811 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (335 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (328 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (33 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (420 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (242 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1008 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (412 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (395 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (40 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (27 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (373 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 0, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 1],\n       [0, 1, 1, 0],\n       [0, 0, 1, 1],\n       [0, 0, 0, 1]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 94
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 0, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 1],\n       [0, 1, 1, 0],\n       [0, 0, 1, 1],\n       [0, 0, 0, 1]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 95
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.00456, 0.8516 , 0.0676 , 0.849  ],\n       [0.1794 , 0.69   , 0.2089 , 0.945  ],\n       [0.01103, 0.9155 , 0.2532 , 0.634  ],\n       [0.674  , 0.4946 , 0.856  , 0.03983],\n       [0.1182 , 0.1051 , 0.665  , 0.252  ],\n       [0.4841 , 0.6587 , 0.4473 , 0.4253 ]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 96
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (262 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (647 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (372 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (327 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (37 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1788 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (320 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (819 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (645 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (471 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (54 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (30 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (458 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.00456, 0.8516 , 0.0676 , 0.849  ],\n       [0.1794 , 0.69   , 0.2089 , 0.945  ],\n       [0.01103, 0.9155 , 0.2532 , 0.634  ],\n       [0.674  , 0.4946 , 0.856  , 0.03983],\n       [0.1182 , 0.1051 , 0.665  , 0.252  ],\n       [0.4841 , 0.6587 , 0.4473 , 0.4253 ]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 97
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.00456, 0.8516 , 0.0676 , 0.849  ],\n       [0.1794 , 0.69   , 0.2089 , 0.945  ],\n       [0.01103, 0.9155 , 0.2532 , 0.634  ],\n       [0.674  , 0.4946 , 0.856  , 0.03983],\n       [0.1182 , 0.1051 , 0.665  , 0.252  ],\n       [0.4841 , 0.6587 , 0.4473 , 0.4253 ]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 98
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.57721984, 0.6175653 , 0.33685857, 0.37910724],\n       [0.7542863 , 0.865875  , 0.9944359 , 0.13393559],\n       [0.4002901 , 0.04045516, 0.6937004 , 0.7391789 ],\n       [0.05844252, 0.76202255, 0.17620714, 0.04230648],\n       [0.17935614, 0.43033835, 0.7344435 , 0.44562268],\n       [0.5878331 , 0.8859885 , 0.870084  , 0.9414579 ]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 99
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (279 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1033 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (401 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (362 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1681 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (235 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (754 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (342 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (326 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (31 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1648 us)\nStarting stage: Completion\nCompleted stage: Completion (6 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.5771485 , 0.61767584, 0.3369141 , 0.37915042],\n       [0.7543946 , 0.8657227 , 0.99462897, 0.13391115],\n       [0.40039065, 0.04046631, 0.6938477 , 0.7392579 ],\n       [0.05844117, 0.7622071 , 0.17614748, 0.04229737],\n       [0.1793213 , 0.43041995, 0.73437506, 0.44555667],\n       [0.5878907 , 0.8862305 , 0.87011725, 0.9414063 ]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 100
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.57721984, 0.6175653 , 0.33685857, 0.37910724],\n       [0.7542863 , 0.865875  , 0.9944359 , 0.13393559],\n       [0.4002901 , 0.04045516, 0.6937004 , 0.7391789 ],\n       [0.05844252, 0.76202255, 0.17620714, 0.04230648],\n       [0.17935614, 0.43033835, 0.7344435 , 0.44562268],\n       [0.5878331 , 0.8859885 , 0.870084  , 0.9414579 ]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 101
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.60640412, 0.05799561, 0.17735087, 0.70129627],\n       [0.0457393 , 0.47337211, 0.48786106, 0.86883648],\n       [0.88236886, 0.3849495 , 0.2725411 , 0.70055755],\n       [0.83554663, 0.34790709, 0.22230966, 0.33308908],\n       [0.61698858, 0.47142033, 0.22995764, 0.61536449],\n       [0.60761402, 0.36613814, 0.7061954 , 0.02876156]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 102
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.60640412, 0.05799561, 0.17735087, 0.70129627],\n       [0.0457393 , 0.47337211, 0.48786106, 0.86883648],\n       [0.88236886, 0.3849495 , 0.2725411 , 0.70055755],\n       [0.83554663, 0.34790709, 0.22230966, 0.33308908],\n       [0.61698858, 0.47142033, 0.22995764, 0.61536449],\n       [0.60761402, 0.36613814, 0.7061954 , 0.02876156]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 103
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.60640412, 0.05799561, 0.17735087, 0.70129627],\n       [0.0457393 , 0.47337211, 0.48786106, 0.86883648],\n       [0.88236886, 0.3849495 , 0.2725411 , 0.70055755],\n       [0.83554663, 0.34790709, 0.22230966, 0.33308908],\n       [0.61698858, 0.47142033, 0.22995764, 0.61536449],\n       [0.60761402, 0.36613814, 0.7061954 , 0.02876156]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 104
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False, False, False,  True],\n       [False, False, False, False],\n       [ True, False,  True,  True],\n       [False, False, False, False],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 105
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (247 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (559 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (335 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (309 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (27 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1738 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (255 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (709 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (438 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (393 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (33 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (395 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False, False, False,  True],\n       [False, False, False, False],\n       [ True, False,  True,  True],\n       [False, False, False, False],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 106
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False, False, False,  True],\n       [False, False, False, False],\n       [ True, False,  True,  True],\n       [False, False, False, False],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 107
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 0],\n       [0, 1, 1, 0],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 108
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 0],\n       [0, 1, 1, 0],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 109
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 0],\n       [0, 1, 1, 0],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 110
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 111
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 112
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 113
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 1],\n       [1, 0, 1, 1],\n       [0, 0, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 0, 1],\n       [1, 0, 0, 0]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 114
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 1],\n       [1, 0, 1, 1],\n       [0, 0, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 0, 1],\n       [1, 0, 0, 0]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 115
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 1],\n       [1, 0, 1, 1],\n       [0, 0, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 0, 1],\n       [1, 0, 0, 0]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 116
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 1, 1],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [0, 1, 1, 1],\n       [1, 1, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 117
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 1, 1],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [0, 1, 1, 1],\n       [1, 1, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 118
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 1, 1],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [0, 1, 1, 1],\n       [1, 1, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 119
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 1],\n       [0, 1, 1, 1],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 120
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 1],\n       [0, 1, 1, 1],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 121
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 1],\n       [0, 1, 1, 1],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 122
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 1],\n       [0, 0, 0, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 0]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 123
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 1],\n       [0, 0, 0, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 0]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 124
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 1],\n       [0, 0, 0, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 0]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 125
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 0],\n       [1, 1, 0, 0],\n       [0, 0, 0, 0]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 126
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 0],\n       [1, 1, 0, 0],\n       [0, 0, 0, 0]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 127
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 0],\n       [1, 1, 0, 0],\n       [0, 0, 0, 0]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 128
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 1, 1],\n       [1, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 0, 0, 1],\n       [1, 1, 0, 0],\n       [1, 0, 0, 0]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 129
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 1, 1],\n       [1, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 0, 0, 1],\n       [1, 1, 0, 0],\n       [1, 0, 0, 0]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 130
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 1, 1],\n       [1, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 0, 0, 1],\n       [1, 1, 0, 0],\n       [1, 0, 0, 0]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 131
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.8096 , 0.716  , 0.3347 , 0.197  ],\n       [0.6357 , 0.7114 , 0.6396 , 0.8237 ],\n       [0.3135 , 0.5093 , 0.7793 , 0.684  ],\n       [0.605  , 0.0214 , 0.6367 , 0.8203 ],\n       [0.898  , 0.8525 , 0.8804 , 0.1649 ],\n       [0.5015 , 0.885  , 0.2358 , 0.01933]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 132
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.8096 , 0.716  , 0.3347 , 0.197  ],\n       [0.6357 , 0.7114 , 0.6396 , 0.8237 ],\n       [0.3135 , 0.5093 , 0.7793 , 0.684  ],\n       [0.605  , 0.0214 , 0.6367 , 0.8203 ],\n       [0.898  , 0.8525 , 0.8804 , 0.1649 ],\n       [0.5015 , 0.885  , 0.2358 , 0.01933]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 133
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.8096 , 0.716  , 0.3347 , 0.197  ],\n       [0.6357 , 0.7114 , 0.6396 , 0.8237 ],\n       [0.3135 , 0.5093 , 0.7793 , 0.684  ],\n       [0.605  , 0.0214 , 0.6367 , 0.8203 ],\n       [0.898  , 0.8525 , 0.8804 , 0.1649 ],\n       [0.5015 , 0.885  , 0.2358 , 0.01933]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 134
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.01977482, 0.06426259, 0.04470741, 0.83626044],\n       [0.7062268 , 0.6846669 , 0.26305023, 0.71305007],\n       [0.5540244 , 0.34656578, 0.39261878, 0.9526849 ],\n       [0.10212482, 0.14857323, 0.7955949 , 0.8064764 ],\n       [0.86753035, 0.3029296 , 0.74265486, 0.4485994 ],\n       [0.6753025 , 0.6427471 , 0.613621  , 0.7108333 ]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 135
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.01977482, 0.06426259, 0.04470741, 0.83626044],\n       [0.7062268 , 0.6846669 , 0.26305023, 0.71305007],\n       [0.5540244 , 0.34656578, 0.39261878, 0.9526849 ],\n       [0.10212482, 0.14857323, 0.7955949 , 0.8064764 ],\n       [0.86753035, 0.3029296 , 0.74265486, 0.4485994 ],\n       [0.6753025 , 0.6427471 , 0.613621  , 0.7108333 ]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 136
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.01977482, 0.06426259, 0.04470741, 0.83626044],\n       [0.7062268 , 0.6846669 , 0.26305023, 0.71305007],\n       [0.5540244 , 0.34656578, 0.39261878, 0.9526849 ],\n       [0.10212482, 0.14857323, 0.7955949 , 0.8064764 ],\n       [0.86753035, 0.3029296 , 0.74265486, 0.4485994 ],\n       [0.6753025 , 0.6427471 , 0.613621  , 0.7108333 ]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 137
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.18301385, 0.19550544, 0.10409003, 0.08363854],\n       [0.87981794, 0.33812733, 0.87205346, 0.62819755],\n       [0.09455447, 0.08901047, 0.56723161, 0.05968211],\n       [0.96150731, 0.1945697 , 0.15973391, 0.96513497],\n       [0.16760298, 0.70401933, 0.98163165, 0.31306567],\n       [0.33033851, 0.9911523 , 0.24430964, 0.87116251]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 138
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.18301385, 0.19550544, 0.10409003, 0.08363854],\n       [0.87981794, 0.33812733, 0.87205346, 0.62819755],\n       [0.09455447, 0.08901047, 0.56723161, 0.05968211],\n       [0.96150731, 0.1945697 , 0.15973391, 0.96513497],\n       [0.16760298, 0.70401933, 0.98163165, 0.31306567],\n       [0.33033851, 0.9911523 , 0.24430964, 0.87116251]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 139
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.18301385, 0.19550544, 0.10409003, 0.08363854],\n       [0.87981794, 0.33812733, 0.87205346, 0.62819755],\n       [0.09455447, 0.08901047, 0.56723161, 0.05968211],\n       [0.96150731, 0.1945697 , 0.15973391, 0.96513497],\n       [0.16760298, 0.70401933, 0.98163165, 0.31306567],\n       [0.33033851, 0.9911523 , 0.24430964, 0.87116251]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 140
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[ True,  True,  True,  True],\n       [False, False, False, False],\n       [ True, False, False, False],\n       [ True, False,  True,  True],\n       [ True, False,  True, False],\n       [ True, False,  True, False]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 141
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[ True,  True,  True,  True],\n       [False, False, False, False],\n       [ True, False, False, False],\n       [ True, False,  True,  True],\n       [ True, False,  True, False],\n       [ True, False,  True, False]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 142
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            2,
-            3,
-            4
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            6,
-            4
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[ True,  True,  True,  True],\n       [False, False, False, False],\n       [ True, False, False, False],\n       [ True, False,  True,  True],\n       [ True, False,  True, False],\n       [ True, False,  True, False]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 143
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 144
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (277 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (688 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (370 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (366 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (43 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (941 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (316 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (783 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (372 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (355 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (57 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (29 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (424 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 145
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 146
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 147
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 148
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 149
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 150
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (313 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (575 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (336 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (309 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (35 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (29 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (373 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (280 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (735 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (386 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (368 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (42 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (28 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (606 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 151
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 152
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 153
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 154
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 155
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 156
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 157
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 158
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 159
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 160
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 161
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 162
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (293 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (627 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (350 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (336 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (39 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (26 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (362 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (250 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (785 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (447 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (429 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (55 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (28 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (435 us)\nStarting stage: Completion\nCompleted stage: Completion (12 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 163
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 164
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 165
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (300 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (969 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (417 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (408 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (26 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1810 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (247 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (811 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (338 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (380 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (44 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (386 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 166
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 167
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.2825 ]],\n\n       [[0.08044]],\n\n       [[0.349  ]],\n\n       [[0.833  ]],\n\n       [[0.4724 ]],\n\n       [[0.1008 ]],\n\n       [[0.534  ]],\n\n       [[0.576  ]],\n\n       [[0.982  ]],\n\n       [[0.1023 ]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 168
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (274 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (908 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (536 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (383 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (48 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (29 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (543 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (363 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (754 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (396 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (358 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (26 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (957 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.2825 ]],\n\n       [[0.08044]],\n\n       [[0.349  ]],\n\n       [[0.833  ]],\n\n       [[0.4724 ]],\n\n       [[0.1008 ]],\n\n       [[0.534  ]],\n\n       [[0.576  ]],\n\n       [[0.982  ]],\n\n       [[0.1023 ]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 169
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.2825 ]],\n\n       [[0.08044]],\n\n       [[0.349  ]],\n\n       [[0.833  ]],\n\n       [[0.4724 ]],\n\n       [[0.1008 ]],\n\n       [[0.534  ]],\n\n       [[0.576  ]],\n\n       [[0.982  ]],\n\n       [[0.1023 ]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 170
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.11045645]],\n\n       [[0.20617837]],\n\n       [[0.20179154]],\n\n       [[0.58244705]],\n\n       [[0.9916574 ]],\n\n       [[0.5824676 ]],\n\n       [[0.9515329 ]],\n\n       [[0.65452904]],\n\n       [[0.95641637]],\n\n       [[0.3729659 ]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 171
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (243 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (984 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (397 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (409 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (42 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (28 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (700 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (259 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (830 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (354 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (381 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (37 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (469 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.11047364]],\n\n       [[0.20617677]],\n\n       [[0.20178224]],\n\n       [[0.5825196 ]],\n\n       [[0.9916993 ]],\n\n       [[0.5825196 ]],\n\n       [[0.9516602 ]],\n\n       [[0.65429693]],\n\n       [[0.956543  ]],\n\n       [[0.3730469 ]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 172
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.11045645]],\n\n       [[0.20617837]],\n\n       [[0.20179154]],\n\n       [[0.58244705]],\n\n       [[0.9916574 ]],\n\n       [[0.5824676 ]],\n\n       [[0.9515329 ]],\n\n       [[0.65452904]],\n\n       [[0.95641637]],\n\n       [[0.3729659 ]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 173
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.14319767]],\n\n       [[0.70493698]],\n\n       [[0.74885664]],\n\n       [[0.78289561]],\n\n       [[0.33843087]],\n\n       [[0.21122853]],\n\n       [[0.93091182]],\n\n       [[0.6251956 ]],\n\n       [[0.10556374]],\n\n       [[0.35250938]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 174
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.14319767]],\n\n       [[0.70493698]],\n\n       [[0.74885664]],\n\n       [[0.78289561]],\n\n       [[0.33843087]],\n\n       [[0.21122853]],\n\n       [[0.93091182]],\n\n       [[0.6251956 ]],\n\n       [[0.10556374]],\n\n       [[0.35250938]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 175
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.14319767]],\n\n       [[0.70493698]],\n\n       [[0.74885664]],\n\n       [[0.78289561]],\n\n       [[0.33843087]],\n\n       [[0.21122853]],\n\n       [[0.93091182]],\n\n       [[0.6251956 ]],\n\n       [[0.10556374]],\n\n       [[0.35250938]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 176
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 177
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (273 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (682 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (393 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (353 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (34 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (29 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1729 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (240 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (552 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (319 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (307 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (37 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1714 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 178
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 0
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 179
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 180
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 181
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 182
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 183
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 184
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 185
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 186
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 187
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 188
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 189
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 190
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "UINT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=uint64)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 191
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 192
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 193
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT8"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=int8)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 194
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 195
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 196
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 197
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 198
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 199
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT32"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 200
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 201
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 202
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "INT64"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 203
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.0869 ]],\n\n       [[0.214  ]],\n\n       [[0.584  ]],\n\n       [[0.517  ]],\n\n       [[0.6816 ]],\n\n       [[0.05176]],\n\n       [[0.6357 ]],\n\n       [[0.555  ]],\n\n       [[0.3464 ]],\n\n       [[0.1284 ]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 204
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.0869 ]],\n\n       [[0.214  ]],\n\n       [[0.584  ]],\n\n       [[0.517  ]],\n\n       [[0.6816 ]],\n\n       [[0.05176]],\n\n       [[0.6357 ]],\n\n       [[0.555  ]],\n\n       [[0.3464 ]],\n\n       [[0.1284 ]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 205
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT16"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.0869 ]],\n\n       [[0.214  ]],\n\n       [[0.584  ]],\n\n       [[0.517  ]],\n\n       [[0.6816 ]],\n\n       [[0.05176]],\n\n       [[0.6357 ]],\n\n       [[0.555  ]],\n\n       [[0.3464 ]],\n\n       [[0.1284 ]]], dtype=float16)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 206
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.60952276]],\n\n       [[0.79065   ]],\n\n       [[0.85073066]],\n\n       [[0.29458553]],\n\n       [[0.7404776 ]],\n\n       [[0.4325886 ]],\n\n       [[0.69595295]],\n\n       [[0.7883838 ]],\n\n       [[0.8335169 ]],\n\n       [[0.998671  ]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 207
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.60952276]],\n\n       [[0.79065   ]],\n\n       [[0.85073066]],\n\n       [[0.29458553]],\n\n       [[0.7404776 ]],\n\n       [[0.4325886 ]],\n\n       [[0.69595295]],\n\n       [[0.7883838 ]],\n\n       [[0.8335169 ]],\n\n       [[0.998671  ]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 208
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "FLOAT"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.60952276]],\n\n       [[0.79065   ]],\n\n       [[0.85073066]],\n\n       [[0.29458553]],\n\n       [[0.7404776 ]],\n\n       [[0.4325886 ]],\n\n       [[0.69595295]],\n\n       [[0.7883838 ]],\n\n       [[0.8335169 ]],\n\n       [[0.998671  ]]], dtype=float32)]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 209
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.86444616]],\n\n       [[0.14365307]],\n\n       [[0.77758983]],\n\n       [[0.27630252]],\n\n       [[0.62122999]],\n\n       [[0.4240276 ]],\n\n       [[0.856709  ]],\n\n       [[0.59285635]],\n\n       [[0.93268937]],\n\n       [[0.81407405]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 210
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.86444616]],\n\n       [[0.14365307]],\n\n       [[0.77758983]],\n\n       [[0.27630252]],\n\n       [[0.62122999]],\n\n       [[0.4240276 ]],\n\n       [[0.856709  ]],\n\n       [[0.59285635]],\n\n       [[0.93268937]],\n\n       [[0.81407405]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 211
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "DOUBLE"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.86444616]],\n\n       [[0.14365307]],\n\n       [[0.77758983]],\n\n       [[0.27630252]],\n\n       [[0.62122999]],\n\n       [[0.4240276 ]],\n\n       [[0.856709  ]],\n\n       [[0.59285635]],\n\n       [[0.93268937]],\n\n       [[0.81407405]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 212
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": true,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 213
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": true
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 214
-    },
-    {
-      "type_vars": {
-        "T_Reshape": "BOOL"
-      },
-      "input_constraints": {
-        "data": {
-          "type": "shape",
-          "shape": [
-            5,
-            1,
-            2
-          ],
-          "min_max": null
-        },
-        "shape": {
-          "type": "value",
-          "value": [
-            10,
-            1,
-            1
-          ],
-          "dtype": "int64"
-        }
-      },
-      "attrs": {
-        "allowzero": 1
-      },
-      "dynamic_axes": {},
-      "input_is_constant": {
-        "data": false,
-        "shape": false
-      },
-      "check_result": {
-        "compile": {
-          "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
-        },
-        "run": {
-          "result": {
-            "success": true,
-            "reason": null
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]]])]\n",
-          "stderr": ""
-        }
-      },
-      "case_index": 215
-    }
-  ],
-  "sys_info": {
-    "cpuList": [
-      {
-        "name": "Snapdragon(R) X 12-core X1E80100 @ 3.40 GHz",
-        "manufacturer": "Qualcomm Technologies Inc",
-        "coreCount": 12,
-        "threadCount": 12,
-        "architecture": "ARM64"
-      }
-    ],
-    "gpuList": [
-      {
-        "name": "Qualcomm(R) Adreno(TM) X1-85 GPU",
-        "manufacturer": "Qualcomm Incorporated",
-        "driverVersion": "31.0.112.0",
-        "vramMib": 0,
-        "vendorId": 1297040209,
-        "deviceId": 909329200
-      }
-    ],
-    "npuList": [
-      {
-        "name": "Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU",
-        "manufacturer": "Qualcomm Technologies, Inc.",
-        "driverVersion": "30.0.219.1000",
-        "vendorId": 1297040209,
-        "deviceId": 1093682224
-      }
-    ],
-    "ramList": [
-      {
-        "capacityMib": 32768,
-        "speedMt": 8448,
-        "manufacturer": "Hynix"
-      }
-    ],
-    "os": {
-      "caption": "Microsoft Windows 11 Enterprise",
-      "version": "10.0.26200",
-      "architecture": "ARM 64-bit Processor",
-      "sku": 4,
-      "buildNumber": "26200",
-      "isWindows11": true
-    },
-    "pythonRuntime": {
-      "version": "3.10.17",
-      "implementation": "CPython",
-      "architecture": "AMD64",
-      "compiler": "MSC v.1943 64 bit (AMD64)",
-      "buildNumber": "May 30 2025 05:32:15"
-    },
-    "pipPackages": [
-      {
-        "name": "winml-modelkit",
-        "version": "0.0.1.dev0"
-      },
-      {
-        "name": "aiohappyeyeballs",
-        "version": "2.6.1"
-      },
-      {
-        "name": "aiohttp",
-        "version": "3.13.3"
-      },
-      {
-        "name": "aiosignal",
-        "version": "1.4.0"
-      },
-      {
-        "name": "annotated-types",
-        "version": "0.7.0"
-      },
-      {
-        "name": "anyio",
-        "version": "4.12.1"
-      },
-      {
-        "name": "argon2-cffi",
-        "version": "25.1.0"
-      },
-      {
-        "name": "argon2-cffi-bindings",
-        "version": "25.1.0"
-      },
-      {
-        "name": "arrow",
-        "version": "1.4.0"
-      },
-      {
-        "name": "asttokens",
-        "version": "3.0.1"
-      },
-      {
-        "name": "async-lru",
-        "version": "2.2.0"
-      },
-      {
-        "name": "async-timeout",
-        "version": "5.0.1"
-      },
-      {
-        "name": "attrs",
-        "version": "25.4.0"
-      },
-      {
-        "name": "babel",
-        "version": "2.18.0"
-      },
-      {
-        "name": "beautifulsoup4",
-        "version": "4.14.3"
-      },
-      {
-        "name": "bleach",
-        "version": "6.3.0"
-      },
-      {
-        "name": "certifi",
-        "version": "2026.1.4"
-      },
-      {
-        "name": "cffi",
-        "version": "2.0.0"
-      },
-      {
-        "name": "charset-normalizer",
-        "version": "3.4.4"
-      },
-      {
-        "name": "click",
-        "version": "8.3.1"
-      },
-      {
-        "name": "colorama",
-        "version": "0.4.6"
-      },
-      {
-        "name": "coloredlogs",
-        "version": "15.0.1"
-      },
-      {
-        "name": "comm",
-        "version": "0.2.3"
-      },
-      {
-        "name": "contourpy",
-        "version": "1.3.2"
-      },
-      {
-        "name": "coverage",
-        "version": "7.13.4"
-      },
-      {
-        "name": "cycler",
-        "version": "0.12.1"
-      },
-      {
-        "name": "datasets",
-        "version": "4.5.0"
-      },
-      {
-        "name": "debugpy",
-        "version": "1.8.20"
-      },
-      {
-        "name": "decorator",
-        "version": "5.2.1"
-      },
-      {
-        "name": "defusedxml",
-        "version": "0.7.1"
-      },
-      {
-        "name": "diffusers",
-        "version": "0.36.0"
-      },
-      {
-        "name": "dill",
-        "version": "0.4.0"
-      },
-      {
-        "name": "evaluate",
-        "version": "0.4.6"
-      },
-      {
-        "name": "exceptiongroup",
-        "version": "1.3.1"
-      },
-      {
-        "name": "execnet",
-        "version": "2.1.2"
-      },
-      {
-        "name": "executing",
-        "version": "2.2.1"
-      },
-      {
-        "name": "fastjsonschema",
-        "version": "2.21.2"
-      },
-      {
-        "name": "filelock",
-        "version": "3.24.3"
-      },
-      {
-        "name": "flatbuffers",
-        "version": "25.12.19"
-      },
-      {
-        "name": "fonttools",
-        "version": "4.61.1"
-      },
-      {
-        "name": "fqdn",
-        "version": "1.5.1"
-      },
-      {
-        "name": "frozenlist",
-        "version": "1.8.0"
-      },
-      {
-        "name": "fsspec",
-        "version": "2025.10.0"
-      },
-      {
-        "name": "h11",
-        "version": "0.16.0"
-      },
-      {
-        "name": "httpcore",
-        "version": "1.0.9"
-      },
-      {
-        "name": "httpx",
-        "version": "0.28.1"
-      },
-      {
-        "name": "huggingface_hub",
-        "version": "0.36.2"
-      },
-      {
-        "name": "humanfriendly",
-        "version": "10.0"
-      },
-      {
-        "name": "idna",
-        "version": "3.11"
-      },
-      {
-        "name": "importlib_metadata",
-        "version": "8.7.1"
-      },
-      {
-        "name": "iniconfig",
-        "version": "2.3.0"
-      },
-      {
-        "name": "ipykernel",
-        "version": "7.2.0"
-      },
-      {
-        "name": "ipython",
-        "version": "8.38.0"
-      },
-      {
-        "name": "ipywidgets",
-        "version": "8.1.8"
-      },
-      {
-        "name": "isoduration",
-        "version": "20.11.0"
-      },
-      {
-        "name": "jedi",
-        "version": "0.19.2"
-      },
-      {
-        "name": "Jinja2",
-        "version": "3.1.6"
-      },
-      {
-        "name": "joblib",
-        "version": "1.5.3"
-      },
-      {
-        "name": "json5",
-        "version": "0.13.0"
-      },
-      {
-        "name": "jsonpointer",
-        "version": "3.0.0"
-      },
-      {
-        "name": "jsonschema",
-        "version": "4.26.0"
-      },
-      {
-        "name": "jsonschema-specifications",
-        "version": "2025.9.1"
-      },
-      {
-        "name": "jupyter",
-        "version": "1.1.1"
-      },
-      {
-        "name": "jupyterlab",
-        "version": "4.5.4"
-      },
-      {
-        "name": "jupyterlab_pygments",
-        "version": "0.3.0"
-      },
-      {
-        "name": "jupyterlab_server",
-        "version": "2.28.0"
-      },
-      {
-        "name": "jupyterlab_widgets",
-        "version": "3.0.16"
-      },
-      {
-        "name": "jupyter_client",
-        "version": "8.8.0"
-      },
-      {
-        "name": "jupyter-console",
-        "version": "6.6.3"
-      },
-      {
-        "name": "jupyter_core",
-        "version": "5.9.1"
-      },
-      {
-        "name": "jupyter-events",
-        "version": "0.12.0"
-      },
-      {
-        "name": "jupyter-lsp",
-        "version": "2.3.0"
-      },
-      {
-        "name": "jupyter_server",
-        "version": "2.17.0"
-      },
-      {
-        "name": "jupyter_server_terminals",
-        "version": "0.5.4"
-      },
-      {
-        "name": "kiwisolver",
-        "version": "1.4.9"
-      },
-      {
-        "name": "lark",
-        "version": "1.3.1"
-      },
-      {
-        "name": "librt",
-        "version": "0.8.1"
-      },
-      {
-        "name": "markdown-it-py",
-        "version": "4.0.0"
-      },
-      {
-        "name": "MarkupSafe",
-        "version": "3.0.3"
-      },
-      {
-        "name": "matplotlib",
-        "version": "3.10.8"
-      },
-      {
-        "name": "matplotlib-inline",
-        "version": "0.2.1"
-      },
-      {
-        "name": "mdurl",
-        "version": "0.1.2"
-      },
-      {
-        "name": "mistune",
-        "version": "3.2.0"
-      },
-      {
-        "name": "ml_dtypes",
-        "version": "0.5.4"
-      },
-      {
-        "name": "mpmath",
-        "version": "1.3.0"
-      },
-      {
-        "name": "multidict",
-        "version": "6.7.1"
-      },
-      {
-        "name": "multiprocess",
-        "version": "0.70.18"
-      },
-      {
-        "name": "mypy",
-        "version": "1.19.1"
-      },
-      {
-        "name": "mypy_extensions",
-        "version": "1.1.0"
-      },
-      {
-        "name": "nbclient",
-        "version": "0.10.4"
-      },
-      {
-        "name": "nbconvert",
-        "version": "7.17.0"
-      },
-      {
-        "name": "nbformat",
-        "version": "5.10.4"
-      },
-      {
-        "name": "nest-asyncio",
-        "version": "1.6.0"
-      },
-      {
-        "name": "networkx",
-        "version": "3.4.2"
-      },
-      {
-        "name": "notebook",
-        "version": "7.5.3"
-      },
-      {
-        "name": "notebook_shim",
-        "version": "0.2.4"
-      },
-      {
-        "name": "numpy",
-        "version": "2.2.6"
-      },
-      {
-        "name": "onnx",
-        "version": "1.18.0"
-      },
-      {
-        "name": "onnxruntime-windowsml",
-        "version": "1.23.3.202601221717"
-      },
-      {
-        "name": "onnxscript",
-        "version": "0.6.2"
-      },
-      {
-        "name": "onnx-ir",
-        "version": "0.1.16"
-      },
-      {
-        "name": "openvino",
-        "version": "2025.4.1"
-      },
-      {
-        "name": "openvino-telemetry",
-        "version": "2025.2.0"
-      },
-      {
-        "name": "optimum",
-        "version": "2.1.0"
-      },
-      {
-        "name": "optimum-onnx",
-        "version": "0.1.0"
-      },
-      {
-        "name": "overrides",
-        "version": "7.7.0"
-      },
-      {
-        "name": "packaging",
-        "version": "26.0"
-      },
-      {
-        "name": "pandas",
-        "version": "2.3.3"
-      },
-      {
-        "name": "pandocfilters",
-        "version": "1.5.1"
-      },
-      {
-        "name": "parso",
-        "version": "0.8.6"
-      },
-      {
-        "name": "pathspec",
-        "version": "1.0.4"
-      },
-      {
-        "name": "pillow",
-        "version": "12.1.1"
-      },
-      {
-        "name": "platformdirs",
-        "version": "4.9.2"
-      },
-      {
-        "name": "plotext",
-        "version": "5.3.2"
-      },
-      {
-        "name": "pluggy",
-        "version": "1.6.0"
-      },
-      {
-        "name": "prometheus_client",
-        "version": "0.24.1"
-      },
-      {
-        "name": "prompt_toolkit",
-        "version": "3.0.52"
-      },
-      {
-        "name": "propcache",
-        "version": "0.4.1"
-      },
-      {
-        "name": "protobuf",
-        "version": "6.33.5"
-      },
-      {
-        "name": "psutil",
-        "version": "7.2.2"
-      },
-      {
-        "name": "pure_eval",
-        "version": "0.2.3"
-      },
-      {
-        "name": "pyarrow",
-        "version": "23.0.1"
-      },
-      {
-        "name": "pycparser",
-        "version": "3.0"
-      },
-      {
-        "name": "pydantic",
-        "version": "2.12.5"
-      },
-      {
-        "name": "pydantic_core",
-        "version": "2.41.5"
-      },
-      {
-        "name": "Pygments",
-        "version": "2.19.2"
-      },
-      {
-        "name": "pyparsing",
-        "version": "3.3.2"
-      },
-      {
-        "name": "pyreadline3",
-        "version": "3.5.4"
-      },
-      {
-        "name": "pytest",
-        "version": "9.0.2"
-      },
-      {
-        "name": "pytest-cov",
-        "version": "7.0.0"
-      },
-      {
-        "name": "pytest-timeout",
-        "version": "2.4.0"
-      },
-      {
-        "name": "pytest-xdist",
-        "version": "3.8.0"
-      },
-      {
-        "name": "python-dateutil",
-        "version": "2.9.0.post0"
-      },
-      {
-        "name": "python-json-logger",
-        "version": "4.0.0"
-      },
-      {
-        "name": "pytz",
-        "version": "2025.2"
-      },
-      {
-        "name": "pywinpty",
-        "version": "3.0.3"
-      },
-      {
-        "name": "PyYAML",
-        "version": "6.0.3"
-      },
-      {
-        "name": "pyzmq",
-        "version": "27.1.0"
-      },
-      {
-        "name": "referencing",
-        "version": "0.37.0"
-      },
-      {
-        "name": "regex",
-        "version": "2026.2.19"
-      },
-      {
-        "name": "requests",
-        "version": "2.32.5"
-      },
-      {
-        "name": "rfc3339-validator",
-        "version": "0.1.4"
-      },
-      {
-        "name": "rfc3986-validator",
-        "version": "0.1.1"
-      },
-      {
-        "name": "rfc3987-syntax",
-        "version": "1.1.0"
-      },
-      {
-        "name": "rich",
-        "version": "14.3.3"
-      },
-      {
-        "name": "rpds-py",
-        "version": "0.30.0"
-      },
-      {
-        "name": "ruff",
-        "version": "0.15.2"
-      },
-      {
-        "name": "safetensors",
-        "version": "0.7.0"
-      },
-      {
-        "name": "scikit-learn",
-        "version": "1.7.2"
-      },
-      {
-        "name": "scipy",
-        "version": "1.15.3"
-      },
-      {
-        "name": "seaborn",
-        "version": "0.13.2"
-      },
-      {
-        "name": "Send2Trash",
-        "version": "2.1.0"
-      },
-      {
-        "name": "setuptools",
-        "version": "82.0.0"
-      },
-      {
-        "name": "six",
-        "version": "1.17.0"
-      },
-      {
-        "name": "SnakeMD",
-        "version": "2.4.0"
-      },
-      {
-        "name": "soupsieve",
-        "version": "2.8.3"
-      },
-      {
-        "name": "stack-data",
-        "version": "0.6.3"
-      },
-      {
-        "name": "sympy",
-        "version": "1.14.0"
-      },
-      {
-        "name": "terminado",
-        "version": "0.18.1"
-      },
-      {
-        "name": "threadpoolctl",
-        "version": "3.6.0"
-      },
-      {
-        "name": "timm",
-        "version": "1.0.24"
-      },
-      {
-        "name": "tinycss2",
-        "version": "1.4.0"
-      },
-      {
-        "name": "tokenizers",
-        "version": "0.22.2"
-      },
-      {
-        "name": "tomli",
-        "version": "2.4.0"
-      },
-      {
-        "name": "torch",
-        "version": "2.10.0"
-      },
-      {
-        "name": "torchinfo",
-        "version": "1.8.0"
-      },
-      {
-        "name": "torchvision",
-        "version": "0.25.0"
-      },
-      {
-        "name": "tornado",
-        "version": "6.5.4"
-      },
-      {
-        "name": "tqdm",
-        "version": "4.67.3"
-      },
-      {
-        "name": "traitlets",
-        "version": "5.14.3"
-      },
-      {
-        "name": "transformers",
-        "version": "4.57.6"
-      },
-      {
-        "name": "types-colorama",
-        "version": "0.4.15.20250801"
-      },
-      {
-        "name": "typing_extensions",
-        "version": "4.15.0"
-      },
-      {
-        "name": "typing-inspection",
-        "version": "0.4.2"
-      },
-      {
-        "name": "tzdata",
-        "version": "2025.3"
-      },
-      {
-        "name": "uri-template",
-        "version": "1.3.0"
-      },
-      {
-        "name": "urllib3",
-        "version": "2.6.3"
-      },
-      {
-        "name": "wasdk-Microsoft.Windows.AI.MachineLearning",
-        "version": "1.8.260209005"
-      },
-      {
-        "name": "wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap",
-        "version": "1.8.260209005"
-      },
-      {
-        "name": "wcwidth",
-        "version": "0.6.0"
-      },
-      {
-        "name": "webcolors",
-        "version": "25.10.0"
-      },
-      {
-        "name": "webencodings",
-        "version": "0.5.1"
-      },
-      {
-        "name": "websocket-client",
-        "version": "1.9.0"
-      },
-      {
-        "name": "widgetsnbextension",
-        "version": "4.0.15"
-      },
-      {
-        "name": "winml-modelkit",
-        "version": "0.0.1.dev0"
-      },
-      {
-        "name": "winrt-runtime",
-        "version": "3.2.1"
-      },
-      {
-        "name": "winrt-Windows.Foundation",
-        "version": "3.2.1"
-      },
-      {
-        "name": "winrt-Windows.Foundation.Collections",
-        "version": "3.2.1"
-      },
-      {
-        "name": "xxhash",
-        "version": "3.6.0"
-      },
-      {
-        "name": "yarl",
-        "version": "1.22.0"
-      },
-      {
-        "name": "zipp",
-        "version": "3.23.0"
-      }
-    ],
-    "epPackages": [
-      {
-        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
-        "version": "1.8.30.0",
-        "publisher": "CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US",
-        "architecture": 12,
-        "signatureKind": "Developer",
-        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
-        "epHash": "0b4dd71044175fb927d3b44a50b7dee4b003a3dfe86a9b09c3ca83f11150979215c256b0301bced2c7e684f84e42ec964532215c147b8b770399d6b9441afc1a",
-        "status": 0
-      }
-    ],
-    "windowsAppRuntimeVersion": "1.8.260209005"
-  }
-}
\ No newline at end of file
+{
+  "check_results": [
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (708 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1393 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (943 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (550 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (811 us)\nStarting stage: Completion\nCompleted stage: Completion (77 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (609 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1240 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (622 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (814 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (240 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (289 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2747 us)\nStarting stage: Completion\nCompleted stage: Completion (70 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (674 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1345 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (576 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (649 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (275 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (284 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1356 us)\nStarting stage: Completion\nCompleted stage: Completion (77 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (484 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1083 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (561 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (551 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (956 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (481 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1135 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (540 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (557 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (204 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (883 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (669 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1354 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (596 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (624 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (265 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (900 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+          },
+          "stdout": null,
+          "stderr": null
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (492 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1203 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (578 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (545 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (254 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (738 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (532 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1063 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (728 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (207 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (669 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (538 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1921 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (635 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (782 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (217 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (277 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (946 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (610 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1978 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (674 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (1102 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (273 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (295 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (2949 us)\nStarting stage: Completion\nCompleted stage: Completion (70 us)\nRun outputs: [array([[[[[0.9501954 , 0.24682619]],\n\n         [[0.20434572, 0.37768558]]],\n\n\n        [[[0.0900879 , 0.31445315]],\n\n         [[0.36254886, 0.24816896]]],\n\n\n        [[[0.03979493, 0.23046876]],\n\n         [[0.19238283, 0.8144532 ]]]],\n\n\n\n       [[[[0.40893558, 0.6416016 ]],\n\n         [[0.95898443, 0.38281253]]],\n\n\n        [[[0.7641602 , 0.24560548]],\n\n         [[0.34936526, 0.8457032 ]]],\n\n\n        [[[0.02116394, 0.43212894]],\n\n         [[0.7304688 , 0.78662115]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (473 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1071 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (639 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (539 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (207 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (280 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (794 us)\nStarting stage: Completion\nCompleted stage: Completion (73 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (642 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1136 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (543 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (530 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (211 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (292 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (900 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (500 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1254 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (547 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (651 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (294 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1099 us)\nStarting stage: Completion\nCompleted stage: Completion (76 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (552 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1149 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (608 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (642 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (212 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (284 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (992 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (588 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1250 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (587 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (223 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (325 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2756 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (632 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1343 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (707 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (695 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (357 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (340 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (627 us)\nStarting stage: Completion\nCompleted stage: Completion (80 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (483 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1054 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (541 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (572 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (936 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (483 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1024 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (679 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (563 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (872 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+          },
+          "stdout": null,
+          "stderr": null
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (482 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1007 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (553 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (536 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (208 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (355 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (880 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (639 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1067 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (560 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (530 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2792 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (513 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (2112 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (592 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (762 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (263 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (2909 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (665 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (2059 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (676 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (986 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (256 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (271 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (2754 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[[[0.9501954 , 0.24682619]],\n\n         [[0.20434572, 0.37768558]]],\n\n\n        [[[0.0900879 , 0.31445315]],\n\n         [[0.36254886, 0.24816896]]],\n\n\n        [[[0.03979493, 0.23046876]],\n\n         [[0.19238283, 0.8144532 ]]]],\n\n\n\n       [[[[0.40893558, 0.6416016 ]],\n\n         [[0.95898443, 0.38281253]]],\n\n\n        [[[0.7641602 , 0.24560548]],\n\n         [[0.34936526, 0.8457032 ]]],\n\n\n        [[[0.02116394, 0.43212894]],\n\n         [[0.7304688 , 0.78662115]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (497 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1108 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (562 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (558 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (235 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (328 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (790 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (499 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1135 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (615 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (593 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (322 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (272 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1046 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\nRun outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (493 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (828 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (533 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (579 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (281 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2722 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (504 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (898 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (546 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (565 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (202 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (952 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\nRun outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (469 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (874 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (541 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (536 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (259 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (867 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (471 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (892 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (539 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (617 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (204 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2685 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (528 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (868 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (584 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (639 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2702 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (478 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1210 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (846 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (616 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (262 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2688 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (785 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1466 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (630 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (885 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (327 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (271 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1678 us)\nStarting stage: Completion\nCompleted stage: Completion (69 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (508 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1200 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (553 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (588 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (259 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1016 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (622 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (914 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (784 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (680 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (347 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (932 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (531 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1094 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (640 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (579 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (294 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (331 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2768 us)\nStarting stage: Completion\nCompleted stage: Completion (76 us)\nRun outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (492 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1181 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (773 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (881 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (282 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (303 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1321 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (573 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1134 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (550 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (576 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (204 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (876 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[0.15112306, 0.27050784, 0.08752442, 0.3376465 ],\n       [0.91210943, 0.07196046, 0.8500977 , 0.06079102],\n       [0.487793  , 0.9228516 , 0.03723145, 0.769043  ],\n       [0.62744147, 0.9072266 , 0.6713868 , 0.43994144],\n       [0.18457033, 0.27758792, 0.04101563, 0.30590823],\n       [0.3500977 , 0.66992193, 0.9438477 , 0.4602051 ]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (636 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (917 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (562 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (585 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (200 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2834 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (504 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (857 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (536 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (530 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (200 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (4273 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (619 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1095 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (637 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (707 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (244 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (303 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (707 us)\nStarting stage: Completion\nCompleted stage: Completion (70 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (615 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1154 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (623 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (537 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (210 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (270 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2732 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (481 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1067 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (639 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (560 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (207 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (923 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (495 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (907 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (557 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (689 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (235 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2843 us)\nStarting stage: Completion\nCompleted stage: Completion (73 us)\nRun outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (693 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (864 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (678 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (609 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (240 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (264 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (959 us)\nStarting stage: Completion\nCompleted stage: Completion (78 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (593 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (913 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (571 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (539 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (255 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (852 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\nRun outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (622 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1140 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (571 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (668 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (241 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2772 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (520 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1186 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (599 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (674 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (277 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2741 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (598 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1163 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (623 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (599 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (220 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (286 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3053 us)\nStarting stage: Completion\nCompleted stage: Completion (74 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (667 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1148 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (573 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (543 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (256 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (271 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2570 us)\nStarting stage: Completion\nCompleted stage: Completion (78 us)\nRun outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (585 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1878 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (678 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (632 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (246 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (273 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1098 us)\nStarting stage: Completion\nCompleted stage: Completion (75 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (552 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1166 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (611 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (679 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (249 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (296 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2869 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\nRun outputs: [array([[0.15112306, 0.27050784, 0.08752442, 0.3376465 ],\n       [0.91210943, 0.07196046, 0.8500977 , 0.06079102],\n       [0.487793  , 0.9228516 , 0.03723145, 0.769043  ],\n       [0.62744147, 0.9072266 , 0.6713868 , 0.43994144],\n       [0.18457033, 0.27758792, 0.04101563, 0.30590823],\n       [0.3500977 , 0.66992193, 0.9438477 , 0.4602051 ]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (505 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (848 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (536 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (528 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (199 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (310 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (871 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (552 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (917 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (550 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (534 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (220 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (285 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (716 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\nRun outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (517 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (925 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (534 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (962 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (823 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1066 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (585 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (619 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (256 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (522 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (765 us)\nStarting stage: Completion\nCompleted stage: Completion (87 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (620 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1486 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (583 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (643 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (253 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1208 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (539 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (911 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (535 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (533 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (275 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (868 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (584 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1086 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (605 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (717 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (246 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (272 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1031 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (637 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1179 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (839 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (649 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (303 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (353 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (839 us)\nStarting stage: Completion\nCompleted stage: Completion (71 us)\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (485 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1288 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (713 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (716 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (209 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (276 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (871 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (488 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1114 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (541 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (585 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (267 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2683 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (543 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1136 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (614 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (584 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (201 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (273 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1216 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (482 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (970 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (580 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (537 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (210 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (943 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (494 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1154 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (737 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (573 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2705 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (483 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1209 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (569 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (574 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (201 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (284 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2711 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[0.21765138]],\n\n       [[0.13110353]],\n\n       [[0.20715334]],\n\n       [[0.40234378]],\n\n       [[0.44116214]],\n\n       [[0.8422852 ]],\n\n       [[0.40917972]],\n\n       [[0.41601565]],\n\n       [[0.6577149 ]],\n\n       [[0.11676026]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (525 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (973 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (584 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (537 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (200 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (264 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2805 us)\nStarting stage: Completion\nCompleted stage: Completion (67 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (560 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1076 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (681 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (551 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (203 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2694 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (602 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (924 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (577 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (809 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (210 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (258 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2774 us)\nStarting stage: Completion\nCompleted stage: Completion (69 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (594 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1262 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (593 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (637 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (286 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2793 us)\nStarting stage: Completion\nCompleted stage: Completion (69 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (503 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1018 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (576 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (562 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (215 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (299 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1041 us)\nStarting stage: Completion\nCompleted stage: Completion (123 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (626 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (983 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (656 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (862 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (276 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (264 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (977 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (518 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (893 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (538 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (573 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (211 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (259 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (998 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (602 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1279 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (564 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (557 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (219 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (265 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2765 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (586 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1123 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (675 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (781 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (262 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2795 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (747 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1459 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (625 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (871 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (282 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (995 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (484 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1046 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (708 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (243 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (267 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (901 us)\nStarting stage: Completion\nCompleted stage: Completion (87 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (495 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1000 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (542 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (557 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (260 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (861 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (479 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1232 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (699 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (1034 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (328 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (262 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (723 us)\nStarting stage: Completion\nCompleted stage: Completion (75 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (555 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1075 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (556 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (576 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (202 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2742 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\nRun outputs: [array([[[0.21765138]],\n\n       [[0.13110353]],\n\n       [[0.20715334]],\n\n       [[0.40234378]],\n\n       [[0.44116214]],\n\n       [[0.8422852 ]],\n\n       [[0.40917972]],\n\n       [[0.41601565]],\n\n       [[0.6577149 ]],\n\n       [[0.11676026]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (601 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (928 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (548 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (550 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (339 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2761 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (485 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (818 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (530 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (544 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (253 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (255 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1020 us)\nStarting stage: Completion\nCompleted stage: Completion (67 us)\nRun outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            5,
+            1,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            10,
+            1,
+            1
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    }
+  ],
+  "sys_info": {
+    "cpuList": [
+      {
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Oryon(TM) CPU",
+        "manufacturer": "Qualcomm Technologies Inc",
+        "coreCount": 12,
+        "threadCount": 12,
+        "architecture": "ARM64"
+      }
+    ],
+    "gpuList": [
+      {
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Adreno(TM) GPU",
+        "manufacturer": "Qualcomm Incorporated",
+        "driverVersion": "31.0.57.0",
+        "vramMib": 0,
+        "vendorId": 1297040209,
+        "deviceId": 909329200
+      }
+    ],
+    "npuList": [
+      {
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Hexagon(TM) NPU",
+        "manufacturer": "Qualcomm Technologies, Inc.",
+        "driverVersion": "30.0.220.3000",
+        "vendorId": 1297040209,
+        "deviceId": 1093682224
+      }
+    ],
+    "ramList": [
+      {
+        "capacityMib": 32768,
+        "speedMt": 7372,
+        "manufacturer": ""
+      }
+    ],
+    "os": {
+      "caption": "Microsoft Windows 11 Enterprise",
+      "version": "10.0.26200",
+      "architecture": "ARM 64-bit Processor",
+      "sku": 4,
+      "buildNumber": "26200",
+      "isWindows11": true
+    },
+    "pythonRuntime": {
+      "version": "3.11.15",
+      "implementation": "CPython",
+      "architecture": "AMD64",
+      "compiler": "MSC v.1944 64 bit (AMD64)",
+      "buildNumber": "Mar 20 2026 00:32:44"
+    },
+    "pipPackages": [
+      {
+        "name": "winml-cli",
+        "version": "0.1.0"
+      },
+      {
+        "name": "winml-modelkit",
+        "version": "0.0.2"
+      },
+      {
+        "name": "aiohappyeyeballs",
+        "version": "2.6.1"
+      },
+      {
+        "name": "aiohttp",
+        "version": "3.13.5"
+      },
+      {
+        "name": "aiosignal",
+        "version": "1.4.0"
+      },
+      {
+        "name": "annotated-doc",
+        "version": "0.0.4"
+      },
+      {
+        "name": "annotated-types",
+        "version": "0.7.0"
+      },
+      {
+        "name": "anyio",
+        "version": "4.13.0"
+      },
+      {
+        "name": "argon2-cffi",
+        "version": "25.1.0"
+      },
+      {
+        "name": "argon2-cffi-bindings",
+        "version": "25.1.0"
+      },
+      {
+        "name": "arrow",
+        "version": "1.4.0"
+      },
+      {
+        "name": "asttokens",
+        "version": "3.0.1"
+      },
+      {
+        "name": "ast_serialize",
+        "version": "0.5.0"
+      },
+      {
+        "name": "async-lru",
+        "version": "2.3.0"
+      },
+      {
+        "name": "attrs",
+        "version": "26.1.0"
+      },
+      {
+        "name": "babel",
+        "version": "2.18.0"
+      },
+      {
+        "name": "beautifulsoup4",
+        "version": "4.14.3"
+      },
+      {
+        "name": "bleach",
+        "version": "6.3.0"
+      },
+      {
+        "name": "certifi",
+        "version": "2026.2.25"
+      },
+      {
+        "name": "cffi",
+        "version": "2.0.0"
+      },
+      {
+        "name": "cfgv",
+        "version": "3.5.0"
+      },
+      {
+        "name": "charset-normalizer",
+        "version": "3.4.7"
+      },
+      {
+        "name": "click",
+        "version": "8.4.1"
+      },
+      {
+        "name": "colorama",
+        "version": "0.4.6"
+      },
+      {
+        "name": "comm",
+        "version": "0.2.3"
+      },
+      {
+        "name": "contourpy",
+        "version": "1.3.3"
+      },
+      {
+        "name": "coverage",
+        "version": "7.13.5"
+      },
+      {
+        "name": "cryptography",
+        "version": "46.0.7"
+      },
+      {
+        "name": "cycler",
+        "version": "0.12.1"
+      },
+      {
+        "name": "datasets",
+        "version": "4.8.4"
+      },
+      {
+        "name": "debugpy",
+        "version": "1.8.20"
+      },
+      {
+        "name": "decorator",
+        "version": "5.2.1"
+      },
+      {
+        "name": "defusedxml",
+        "version": "0.7.1"
+      },
+      {
+        "name": "diffusers",
+        "version": "0.37.1"
+      },
+      {
+        "name": "dill",
+        "version": "0.4.1"
+      },
+      {
+        "name": "distlib",
+        "version": "0.4.0"
+      },
+      {
+        "name": "evaluate",
+        "version": "0.4.6"
+      },
+      {
+        "name": "executing",
+        "version": "2.2.1"
+      },
+      {
+        "name": "fastapi",
+        "version": "0.136.0"
+      },
+      {
+        "name": "fastjsonschema",
+        "version": "2.21.2"
+      },
+      {
+        "name": "filelock",
+        "version": "3.25.2"
+      },
+      {
+        "name": "flatbuffers",
+        "version": "25.12.19"
+      },
+      {
+        "name": "fonttools",
+        "version": "4.63.0"
+      },
+      {
+        "name": "fqdn",
+        "version": "1.5.1"
+      },
+      {
+        "name": "frozenlist",
+        "version": "1.8.0"
+      },
+      {
+        "name": "fsspec",
+        "version": "2026.2.0"
+      },
+      {
+        "name": "h11",
+        "version": "0.16.0"
+      },
+      {
+        "name": "hf-xet",
+        "version": "1.4.3"
+      },
+      {
+        "name": "httpcore",
+        "version": "1.0.9"
+      },
+      {
+        "name": "httptools",
+        "version": "0.7.1"
+      },
+      {
+        "name": "httpx",
+        "version": "0.28.1"
+      },
+      {
+        "name": "httpx-sse",
+        "version": "0.4.3"
+      },
+      {
+        "name": "huggingface_hub",
+        "version": "0.36.2"
+      },
+      {
+        "name": "identify",
+        "version": "2.6.18"
+      },
+      {
+        "name": "idna",
+        "version": "3.11"
+      },
+      {
+        "name": "importlib_metadata",
+        "version": "8.7.1"
+      },
+      {
+        "name": "iniconfig",
+        "version": "2.3.0"
+      },
+      {
+        "name": "ipykernel",
+        "version": "7.2.0"
+      },
+      {
+        "name": "ipython",
+        "version": "8.39.0"
+      },
+      {
+        "name": "ipywidgets",
+        "version": "8.1.8"
+      },
+      {
+        "name": "isoduration",
+        "version": "20.11.0"
+      },
+      {
+        "name": "jedi",
+        "version": "0.19.2"
+      },
+      {
+        "name": "Jinja2",
+        "version": "3.1.6"
+      },
+      {
+        "name": "joblib",
+        "version": "1.5.3"
+      },
+      {
+        "name": "json5",
+        "version": "0.14.0"
+      },
+      {
+        "name": "jsonpointer",
+        "version": "3.1.1"
+      },
+      {
+        "name": "jsonschema",
+        "version": "4.26.0"
+      },
+      {
+        "name": "jsonschema-specifications",
+        "version": "2025.9.1"
+      },
+      {
+        "name": "jupyter",
+        "version": "1.1.1"
+      },
+      {
+        "name": "jupyterlab",
+        "version": "4.5.6"
+      },
+      {
+        "name": "jupyterlab_pygments",
+        "version": "0.3.0"
+      },
+      {
+        "name": "jupyterlab_server",
+        "version": "2.28.0"
+      },
+      {
+        "name": "jupyterlab_widgets",
+        "version": "3.0.16"
+      },
+      {
+        "name": "jupyter_client",
+        "version": "8.8.0"
+      },
+      {
+        "name": "jupyter-console",
+        "version": "6.6.3"
+      },
+      {
+        "name": "jupyter_core",
+        "version": "5.9.1"
+      },
+      {
+        "name": "jupyter-events",
+        "version": "0.12.0"
+      },
+      {
+        "name": "jupyter-lsp",
+        "version": "2.3.1"
+      },
+      {
+        "name": "jupyter_server",
+        "version": "2.17.0"
+      },
+      {
+        "name": "jupyter_server_terminals",
+        "version": "0.5.4"
+      },
+      {
+        "name": "kiwisolver",
+        "version": "1.5.0"
+      },
+      {
+        "name": "lark",
+        "version": "1.3.1"
+      },
+      {
+        "name": "librt",
+        "version": "0.11.0"
+      },
+      {
+        "name": "lightning-utilities",
+        "version": "0.15.3"
+      },
+      {
+        "name": "markdown-it-py",
+        "version": "4.0.0"
+      },
+      {
+        "name": "MarkupSafe",
+        "version": "3.0.3"
+      },
+      {
+        "name": "matplotlib",
+        "version": "3.10.9"
+      },
+      {
+        "name": "matplotlib-inline",
+        "version": "0.2.1"
+      },
+      {
+        "name": "mcp",
+        "version": "1.27.0"
+      },
+      {
+        "name": "mdurl",
+        "version": "0.1.2"
+      },
+      {
+        "name": "mistune",
+        "version": "3.2.0"
+      },
+      {
+        "name": "ml_dtypes",
+        "version": "0.5.4"
+      },
+      {
+        "name": "mpmath",
+        "version": "1.3.0"
+      },
+      {
+        "name": "multidict",
+        "version": "6.7.1"
+      },
+      {
+        "name": "multiprocess",
+        "version": "0.70.19"
+      },
+      {
+        "name": "mypy",
+        "version": "2.1.0"
+      },
+      {
+        "name": "mypy_extensions",
+        "version": "1.1.0"
+      },
+      {
+        "name": "nbclient",
+        "version": "0.10.4"
+      },
+      {
+        "name": "nbconvert",
+        "version": "7.17.1"
+      },
+      {
+        "name": "nbformat",
+        "version": "5.10.4"
+      },
+      {
+        "name": "nest-asyncio",
+        "version": "1.6.0"
+      },
+      {
+        "name": "networkx",
+        "version": "3.4.2"
+      },
+      {
+        "name": "nodeenv",
+        "version": "1.10.0"
+      },
+      {
+        "name": "notebook",
+        "version": "7.5.5"
+      },
+      {
+        "name": "notebook_shim",
+        "version": "0.2.4"
+      },
+      {
+        "name": "numpy",
+        "version": "2.2.6"
+      },
+      {
+        "name": "onnx",
+        "version": "1.18.0"
+      },
+      {
+        "name": "onnxruntime-windowsml",
+        "version": "1.24.5.202604171637"
+      },
+      {
+        "name": "onnxscript",
+        "version": "0.6.2"
+      },
+      {
+        "name": "onnx-ir",
+        "version": "0.2.0"
+      },
+      {
+        "name": "opentelemetry-api",
+        "version": "1.41.0"
+      },
+      {
+        "name": "opentelemetry-sdk",
+        "version": "1.41.0"
+      },
+      {
+        "name": "opentelemetry-semantic-conventions",
+        "version": "0.62b0"
+      },
+      {
+        "name": "optimum",
+        "version": "2.1.0"
+      },
+      {
+        "name": "optimum-onnx",
+        "version": "0.1.0"
+      },
+      {
+        "name": "overrides",
+        "version": "7.7.0"
+      },
+      {
+        "name": "packaging",
+        "version": "26.0"
+      },
+      {
+        "name": "pandas",
+        "version": "2.3.3"
+      },
+      {
+        "name": "pandocfilters",
+        "version": "1.5.1"
+      },
+      {
+        "name": "parso",
+        "version": "0.8.6"
+      },
+      {
+        "name": "pathspec",
+        "version": "1.1.1"
+      },
+      {
+        "name": "pillow",
+        "version": "12.2.0"
+      },
+      {
+        "name": "platformdirs",
+        "version": "4.9.6"
+      },
+      {
+        "name": "plotext",
+        "version": "5.3.2"
+      },
+      {
+        "name": "pluggy",
+        "version": "1.6.0"
+      },
+      {
+        "name": "pre_commit",
+        "version": "4.5.1"
+      },
+      {
+        "name": "prometheus_client",
+        "version": "0.25.0"
+      },
+      {
+        "name": "prompt_toolkit",
+        "version": "3.0.52"
+      },
+      {
+        "name": "propcache",
+        "version": "0.4.1"
+      },
+      {
+        "name": "protobuf",
+        "version": "7.34.1"
+      },
+      {
+        "name": "psutil",
+        "version": "7.2.2"
+      },
+      {
+        "name": "pure_eval",
+        "version": "0.2.3"
+      },
+      {
+        "name": "pyarrow",
+        "version": "23.0.1"
+      },
+      {
+        "name": "pycocotools",
+        "version": "2.0.11"
+      },
+      {
+        "name": "pycparser",
+        "version": "3.0"
+      },
+      {
+        "name": "pydantic",
+        "version": "2.13.0"
+      },
+      {
+        "name": "pydantic_core",
+        "version": "2.46.0"
+      },
+      {
+        "name": "pydantic-settings",
+        "version": "2.14.0"
+      },
+      {
+        "name": "Pygments",
+        "version": "2.20.0"
+      },
+      {
+        "name": "PyJWT",
+        "version": "2.12.1"
+      },
+      {
+        "name": "pyparsing",
+        "version": "3.3.2"
+      },
+      {
+        "name": "pytest",
+        "version": "9.0.3"
+      },
+      {
+        "name": "pytest-cov",
+        "version": "7.1.0"
+      },
+      {
+        "name": "pytest-timeout",
+        "version": "2.4.0"
+      },
+      {
+        "name": "python-dateutil",
+        "version": "2.9.0.post0"
+      },
+      {
+        "name": "python-discovery",
+        "version": "1.2.2"
+      },
+      {
+        "name": "python-dotenv",
+        "version": "1.2.2"
+      },
+      {
+        "name": "python-json-logger",
+        "version": "4.1.0"
+      },
+      {
+        "name": "python-multipart",
+        "version": "0.0.26"
+      },
+      {
+        "name": "pytz",
+        "version": "2026.1.post1"
+      },
+      {
+        "name": "pywin32",
+        "version": "311"
+      },
+      {
+        "name": "pywinpty",
+        "version": "3.0.3"
+      },
+      {
+        "name": "PyYAML",
+        "version": "6.0.3"
+      },
+      {
+        "name": "pyzmq",
+        "version": "27.1.0"
+      },
+      {
+        "name": "RapidFuzz",
+        "version": "3.14.5"
+      },
+      {
+        "name": "referencing",
+        "version": "0.37.0"
+      },
+      {
+        "name": "regex",
+        "version": "2026.4.4"
+      },
+      {
+        "name": "requests",
+        "version": "2.33.1"
+      },
+      {
+        "name": "rfc3339-validator",
+        "version": "0.1.4"
+      },
+      {
+        "name": "rfc3986-validator",
+        "version": "0.1.1"
+      },
+      {
+        "name": "rfc3987-syntax",
+        "version": "1.1.0"
+      },
+      {
+        "name": "rich",
+        "version": "15.0.0"
+      },
+      {
+        "name": "rpds-py",
+        "version": "0.30.0"
+      },
+      {
+        "name": "ruff",
+        "version": "0.15.13"
+      },
+      {
+        "name": "safetensors",
+        "version": "0.7.0"
+      },
+      {
+        "name": "scikit-learn",
+        "version": "1.7.2"
+      },
+      {
+        "name": "scipy",
+        "version": "1.15.3"
+      },
+      {
+        "name": "seaborn",
+        "version": "0.13.2"
+      },
+      {
+        "name": "Send2Trash",
+        "version": "2.1.0"
+      },
+      {
+        "name": "sentencepiece",
+        "version": "0.2.1"
+      },
+      {
+        "name": "seqeval",
+        "version": "1.2.2"
+      },
+      {
+        "name": "setuptools",
+        "version": "81.0.0"
+      },
+      {
+        "name": "six",
+        "version": "1.17.0"
+      },
+      {
+        "name": "SnakeMD",
+        "version": "2.4.0"
+      },
+      {
+        "name": "soupsieve",
+        "version": "2.8.3"
+      },
+      {
+        "name": "sse-starlette",
+        "version": "3.3.4"
+      },
+      {
+        "name": "stack-data",
+        "version": "0.6.3"
+      },
+      {
+        "name": "starlette",
+        "version": "1.0.0"
+      },
+      {
+        "name": "sympy",
+        "version": "1.14.0"
+      },
+      {
+        "name": "terminado",
+        "version": "0.18.1"
+      },
+      {
+        "name": "threadpoolctl",
+        "version": "3.6.0"
+      },
+      {
+        "name": "timm",
+        "version": "1.0.26"
+      },
+      {
+        "name": "tinycss2",
+        "version": "1.4.0"
+      },
+      {
+        "name": "tokenizers",
+        "version": "0.22.2"
+      },
+      {
+        "name": "torch",
+        "version": "2.11.0"
+      },
+      {
+        "name": "torchinfo",
+        "version": "1.8.0"
+      },
+      {
+        "name": "torchmetrics",
+        "version": "1.9.0"
+      },
+      {
+        "name": "torchvision",
+        "version": "0.26.0"
+      },
+      {
+        "name": "tornado",
+        "version": "6.5.5"
+      },
+      {
+        "name": "tqdm",
+        "version": "4.67.3"
+      },
+      {
+        "name": "traitlets",
+        "version": "5.14.3"
+      },
+      {
+        "name": "transformers",
+        "version": "4.57.6"
+      },
+      {
+        "name": "types-colorama",
+        "version": "0.4.15.20260508"
+      },
+      {
+        "name": "typing_extensions",
+        "version": "4.15.0"
+      },
+      {
+        "name": "typing-inspection",
+        "version": "0.4.2"
+      },
+      {
+        "name": "tzdata",
+        "version": "2026.1"
+      },
+      {
+        "name": "uri-template",
+        "version": "1.3.0"
+      },
+      {
+        "name": "urllib3",
+        "version": "2.6.3"
+      },
+      {
+        "name": "uvicorn",
+        "version": "0.45.0"
+      },
+      {
+        "name": "virtualenv",
+        "version": "21.2.3"
+      },
+      {
+        "name": "watchfiles",
+        "version": "1.1.1"
+      },
+      {
+        "name": "wcwidth",
+        "version": "0.6.0"
+      },
+      {
+        "name": "webcolors",
+        "version": "25.10.0"
+      },
+      {
+        "name": "webencodings",
+        "version": "0.5.1"
+      },
+      {
+        "name": "websockets",
+        "version": "16.0"
+      },
+      {
+        "name": "websocket-client",
+        "version": "1.9.0"
+      },
+      {
+        "name": "widgetsnbextension",
+        "version": "4.0.15"
+      },
+      {
+        "name": "windowsml",
+        "version": "2.0.300"
+      },
+      {
+        "name": "winml-cli",
+        "version": "0.1.0"
+      },
+      {
+        "name": "xxhash",
+        "version": "3.6.0"
+      },
+      {
+        "name": "yarl",
+        "version": "1.23.0"
+      },
+      {
+        "name": "zipp",
+        "version": "3.23.1"
+      },
+      {
+        "name": "winml-cli",
+        "version": "0.1.0"
+      },
+      {
+        "name": "winml-modelkit",
+        "version": "0.0.2"
+      },
+      {
+        "name": "importlib_metadata",
+        "version": "8.7.1"
+      },
+      {
+        "name": "microvenv",
+        "version": "2025.0"
+      },
+      {
+        "name": "packaging",
+        "version": "26.0"
+      },
+      {
+        "name": "tomli",
+        "version": "2.4.0"
+      },
+      {
+        "name": "typing_extensions",
+        "version": "4.15.0"
+      },
+      {
+        "name": "zipp",
+        "version": "3.21.0"
+      }
+    ],
+    "epPackages": [
+      {
+        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
+        "version": "1.8.30.0",
+        "publisher": "CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US",
+        "architecture": 12,
+        "signatureKind": "Developer",
+        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
+        "epHash": "0b4dd71044175fb927d3b44a50b7dee4b003a3dfe86a9b09c3ca83f11150979215c256b0301bced2c7e684f84e42ec964532215c147b8b770399d6b9441afc1a",
+        "status": 0
+      },
+      {
+        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.2_2.2450.47.0_arm64__8wekyb3d8bbwe",
+        "version": "2.2450.47.0",
+        "publisher": "CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US",
+        "architecture": 12,
+        "signatureKind": "Developer",
+        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.2_2.2450.47.0_arm64__8wekyb3d8bbwe",
+        "epHash": "343f2e6da7490f6721e40942a86a40fa01322c354d784b024491d151ec511e6dba7a9041c3594aa97ff0c0379cf627b88414b25328f931f8ddaabe78a6784102",
+        "status": 0
+      }
+    ]
+  }
+}
diff --git a/tests/integration/analyze/runtime_checker/reshape_qnn_results.json b/tests/integration/analyze/runtime_checker/reshape_qnn_results.json
index b08f31439..3b150eba6 100644
--- a/tests/integration/analyze/runtime_checker/reshape_qnn_results.json
+++ b/tests/integration/analyze/runtime_checker/reshape_qnn_results.json
@@ -2,7 +2,7 @@
   "check_results": [
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -12,7 +12,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -29,6 +30,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -37,24 +39,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[ True, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True, False]],\n\n         [[ True, False]]]],\n\n\n\n       [[[[False,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False, False]]],\n\n\n        [[[ True, False]],\n\n         [[ True,  True]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -64,7 +66,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -81,6 +84,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -91,22 +95,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (284 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1587 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (961 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (1356 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (112 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (65 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (384 us)\nStarting stage: Completion\nCompleted stage: Completion (21 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (708 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1393 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (943 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (550 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (811 us)\nStarting stage: Completion\nCompleted stage: Completion (77 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (215 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (748 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (330 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (290 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (34 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (251 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[ True, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True, False]],\n\n         [[ True, False]]]],\n\n\n\n       [[[[False,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False, False]]],\n\n\n        [[[ True, False]],\n\n         [[ True,  True]]]]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (609 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1240 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (622 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (814 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (240 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (289 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2747 us)\nStarting stage: Completion\nCompleted stage: Completion (70 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -116,7 +120,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -133,6 +138,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -141,24 +147,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[ True, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True, False]],\n\n         [[ True, False]]]],\n\n\n\n       [[[[False,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False, False]]],\n\n\n        [[[ True, False]],\n\n         [[ True,  True]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -168,7 +174,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -185,6 +192,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -193,24 +201,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.987917  , 0.25513449]],\n\n         [[0.47694917, 0.83572468]]],\n\n\n        [[[0.76387782, 0.4025907 ]],\n\n         [[0.87868388, 0.34866536]]],\n\n\n        [[[0.57085626, 0.89694446]],\n\n         [[0.63642498, 0.63965871]]]],\n\n\n\n       [[[[0.51069939, 0.98213299]],\n\n         [[0.42933215, 0.59071316]]],\n\n\n        [[[0.86885247, 0.15116338]],\n\n         [[0.85073914, 0.65639438]]],\n\n\n        [[[0.16017359, 0.9125194 ]],\n\n         [[0.92341703, 0.2728741 ]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -220,7 +228,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -237,6 +246,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -245,24 +255,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.987917  , 0.25513449]],\n\n         [[0.47694917, 0.83572468]]],\n\n\n        [[[0.76387782, 0.4025907 ]],\n\n         [[0.87868388, 0.34866536]]],\n\n\n        [[[0.57085626, 0.89694446]],\n\n         [[0.63642498, 0.63965871]]]],\n\n\n\n       [[[[0.51069939, 0.98213299]],\n\n         [[0.42933215, 0.59071316]]],\n\n\n        [[[0.86885247, 0.15116338]],\n\n         [[0.85073914, 0.65639438]]],\n\n\n        [[[0.16017359, 0.9125194 ]],\n\n         [[0.92341703, 0.2728741 ]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -272,7 +282,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -289,6 +300,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -297,24 +309,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.987917  , 0.25513449]],\n\n         [[0.47694917, 0.83572468]]],\n\n\n        [[[0.76387782, 0.4025907 ]],\n\n         [[0.87868388, 0.34866536]]],\n\n\n        [[[0.57085626, 0.89694446]],\n\n         [[0.63642498, 0.63965871]]]],\n\n\n\n       [[[[0.51069939, 0.98213299]],\n\n         [[0.42933215, 0.59071316]]],\n\n\n        [[[0.86885247, 0.15116338]],\n\n         [[0.85073914, 0.65639438]]],\n\n\n        [[[0.16017359, 0.9125194 ]],\n\n         [[0.92341703, 0.2728741 ]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -324,7 +336,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -341,6 +354,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -349,24 +363,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.77037936, 0.09290103]],\n\n         [[0.72918564, 0.45355973]]],\n\n\n        [[[0.91096956, 0.34206676]],\n\n         [[0.76346993, 0.9757843 ]]],\n\n\n        [[[0.18567617, 0.5841243 ]],\n\n         [[0.84840614, 0.00886352]]]],\n\n\n\n       [[[[0.5074946 , 0.5569054 ]],\n\n         [[0.64646775, 0.8351761 ]]],\n\n\n        [[[0.78530526, 0.0057318 ]],\n\n         [[0.7397096 , 0.29197797]]],\n\n\n        [[[0.27925733, 0.7858911 ]],\n\n         [[0.08760667, 0.48966888]]]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -376,7 +390,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -393,6 +408,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -403,22 +419,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (215 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1841 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (507 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (678 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (56 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (41 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (400 us)\nStarting stage: Completion\nCompleted stage: Completion (14 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (674 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1345 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (576 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (649 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (275 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (284 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1356 us)\nStarting stage: Completion\nCompleted stage: Completion (77 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (210 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1838 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (599 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (992 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (75 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (37 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (330 us)\nStarting stage: Completion\nCompleted stage: Completion (13 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.7705079 , 0.09289552]],\n\n         [[0.72900397, 0.4536133 ]]],\n\n\n        [[[0.9111329 , 0.34204105]],\n\n         [[0.76367193, 0.975586  ]]],\n\n\n        [[[0.18566896, 0.58398443]],\n\n         [[0.8486329 , 0.00886536]]]],\n\n\n\n       [[[[0.5073243 , 0.55712897]],\n\n         [[0.64648443, 0.834961  ]]],\n\n\n        [[[0.7851563 , 0.00573349]],\n\n         [[0.73974615, 0.29199222]]],\n\n\n        [[[0.2792969 , 0.7861329 ]],\n\n         [[0.08758546, 0.48974612]]]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (484 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1083 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (561 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (551 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (956 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -428,7 +444,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -445,6 +462,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -453,24 +471,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.77037936, 0.09290103]],\n\n         [[0.72918564, 0.45355973]]],\n\n\n        [[[0.91096956, 0.34206676]],\n\n         [[0.76346993, 0.9757843 ]]],\n\n\n        [[[0.18567617, 0.5841243 ]],\n\n         [[0.84840614, 0.00886352]]]],\n\n\n\n       [[[[0.5074946 , 0.5569054 ]],\n\n         [[0.64646775, 0.8351761 ]]],\n\n\n        [[[0.78530526, 0.0057318 ]],\n\n         [[0.7397096 , 0.29197797]]],\n\n\n        [[[0.27925733, 0.7858911 ]],\n\n         [[0.08760667, 0.48966888]]]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -480,7 +498,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -497,6 +516,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -505,24 +525,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.8877 , 0.6216 ]],\n\n         [[0.637  , 0.1967 ]]],\n\n\n        [[[0.07263, 0.3752 ]],\n\n         [[0.385  , 0.4514 ]]],\n\n\n        [[[0.4553 , 0.2527 ]],\n\n         [[0.2424 , 0.02464]]]],\n\n\n\n       [[[[0.0722 , 0.3506 ]],\n\n         [[0.9146 , 0.592  ]]],\n\n\n        [[[0.3774 , 0.268  ]],\n\n         [[0.6396 , 0.1536 ]]],\n\n\n        [[[0.5796 , 0.1262 ]],\n\n         [[0.2852 , 0.434  ]]]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -532,7 +552,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -549,6 +570,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -556,25 +578,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (280 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (845 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (408 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (332 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (889 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (258 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (879 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (408 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (383 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (43 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (61 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (217 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.8877 , 0.6216 ]],\n\n         [[0.637  , 0.1967 ]]],\n\n\n        [[[0.07263, 0.3752 ]],\n\n         [[0.385  , 0.4514 ]]],\n\n\n        [[[0.4553 , 0.2527 ]],\n\n         [[0.2424 , 0.02464]]]],\n\n\n\n       [[[[0.0722 , 0.3506 ]],\n\n         [[0.9146 , 0.592  ]]],\n\n\n        [[[0.3774 , 0.268  ]],\n\n         [[0.6396 , 0.1536 ]]],\n\n\n        [[[0.5796 , 0.1262 ]],\n\n         [[0.2852 , 0.434  ]]]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -584,7 +606,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -601,6 +624,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -609,24 +633,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.8877 , 0.6216 ]],\n\n         [[0.637  , 0.1967 ]]],\n\n\n        [[[0.07263, 0.3752 ]],\n\n         [[0.385  , 0.4514 ]]],\n\n\n        [[[0.4553 , 0.2527 ]],\n\n         [[0.2424 , 0.02464]]]],\n\n\n\n       [[[[0.0722 , 0.3506 ]],\n\n         [[0.9146 , 0.592  ]]],\n\n\n        [[[0.3774 , 0.268  ]],\n\n         [[0.6396 , 0.1536 ]]],\n\n\n        [[[0.5796 , 0.1262 ]],\n\n         [[0.2852 , 0.434  ]]]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -636,7 +660,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -653,6 +678,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -661,24 +687,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -688,7 +714,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -705,6 +732,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -713,24 +741,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:37:51.4757064 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]]], dtype=int16)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:37:51.9295201 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -740,7 +768,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -757,6 +786,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -765,24 +795,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -792,7 +822,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -809,6 +840,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -817,24 +849,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -844,7 +876,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -861,6 +894,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -868,25 +902,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (419 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (980 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (398 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (338 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (39 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (28 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1136 us)\nStarting stage: Completion\nCompleted stage: Completion (13 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (282 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (835 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (383 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (349 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (40 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (24 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (296 us)\nStarting stage: Completion\nCompleted stage: Completion (11 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -896,7 +930,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -913,6 +948,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -921,24 +957,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -948,7 +984,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -965,6 +1002,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -973,24 +1011,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -1000,7 +1038,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1017,6 +1056,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -1024,25 +1064,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "Timeout/crash/fail for 3 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+            "success": true,
+            "reason": null
           },
-          "stdout": null,
-          "stderr": null
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (481 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1135 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (540 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (557 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (204 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (883 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": false,
-            "reason": "Timeout/crash/fail for 3 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+            "success": true,
+            "reason": null
           },
-          "stdout": null,
-          "stderr": null
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (669 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1354 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (596 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (624 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (265 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (900 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -1052,7 +1092,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1069,6 +1110,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -1077,24 +1119,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -1104,7 +1146,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1121,6 +1164,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -1129,24 +1173,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -1156,7 +1200,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1173,6 +1218,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -1181,24 +1227,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:25.1678426 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": null,
+          "stderr": null
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int8)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:25.6683410 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -1208,7 +1254,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1225,6 +1272,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -1233,24 +1281,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -1260,7 +1308,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1277,6 +1326,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -1285,24 +1335,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -1312,7 +1362,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1329,6 +1380,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -1336,25 +1388,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:26.9902828 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (492 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1203 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (578 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (545 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (254 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (738 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]]], dtype=uint16)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:27.4455164 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (532 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1063 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (728 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (207 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (669 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -1364,7 +1416,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1381,6 +1434,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -1389,24 +1443,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -1416,7 +1470,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1433,6 +1488,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -1441,24 +1497,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -1468,7 +1524,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1485,6 +1542,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -1495,22 +1553,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (281 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (872 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (434 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (357 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (49 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (30 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (251 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (538 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1921 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (635 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (782 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (217 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (277 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (946 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (316 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (823 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (413 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (407 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (40 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (438 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (610 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1978 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (674 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (1102 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (273 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (295 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (2949 us)\nStarting stage: Completion\nCompleted stage: Completion (70 us)\nRun outputs: [array([[[[[0.9501954 , 0.24682619]],\n\n         [[0.20434572, 0.37768558]]],\n\n\n        [[[0.0900879 , 0.31445315]],\n\n         [[0.36254886, 0.24816896]]],\n\n\n        [[[0.03979493, 0.23046876]],\n\n         [[0.19238283, 0.8144532 ]]]],\n\n\n\n       [[[[0.40893558, 0.6416016 ]],\n\n         [[0.95898443, 0.38281253]]],\n\n\n        [[[0.7641602 , 0.24560548]],\n\n         [[0.34936526, 0.8457032 ]]],\n\n\n        [[[0.02116394, 0.43212894]],\n\n         [[0.7304688 , 0.78662115]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -1520,7 +1578,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1537,6 +1596,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -1545,24 +1605,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -1572,7 +1632,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1589,6 +1650,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -1597,24 +1659,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -1624,7 +1686,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1641,6 +1704,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -1649,24 +1713,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:30.5973491 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]]], dtype=uint64)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:31.0601942 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -1676,7 +1740,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1693,6 +1758,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -1701,24 +1767,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -1728,7 +1794,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1745,6 +1812,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -1753,24 +1821,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -1780,7 +1848,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1797,6 +1866,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -1807,22 +1877,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (295 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (788 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (387 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (351 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (863 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (473 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1071 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (639 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (539 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (207 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (280 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (794 us)\nStarting stage: Completion\nCompleted stage: Completion (73 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (274 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (773 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (392 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (340 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (800 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (642 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1136 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (543 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (530 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (211 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (292 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (900 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -1832,7 +1902,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1849,6 +1920,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -1857,24 +1929,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -1884,7 +1956,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1899,8 +1972,9 @@
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -1909,24 +1983,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[False,  True]],\n\n         [[False, False]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False,  True]]],\n\n\n        [[[ True,  True]],\n\n         [[ True, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]]]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -1936,7 +2010,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -1951,8 +2026,9 @@
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -1961,24 +2037,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[False,  True]],\n\n         [[False, False]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False,  True]]],\n\n\n        [[[ True,  True]],\n\n         [[ True, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]]]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -1988,7 +2064,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2003,8 +2080,9 @@
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -2013,24 +2091,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[ True, False]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[False,  True]],\n\n         [[False, False]]]],\n\n\n\n       [[[[ True, False]],\n\n         [[False,  True]]],\n\n\n        [[[ True,  True]],\n\n         [[ True, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]]]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -2040,7 +2118,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2055,8 +2134,9 @@
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -2065,24 +2145,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.7703097 , 0.4447216 ]],\n\n         [[0.16715159, 0.99701406]]],\n\n\n        [[[0.04602288, 0.13176387]],\n\n         [[0.64325871, 0.78555805]]],\n\n\n        [[[0.4166604 , 0.88514966]],\n\n         [[0.28618207, 0.42701358]]]],\n\n\n\n       [[[[0.75428793, 0.76602813]],\n\n         [[0.04029357, 0.9535589 ]]],\n\n\n        [[[0.46711444, 0.76439454]],\n\n         [[0.61870435, 0.97964806]]],\n\n\n        [[[0.92665191, 0.51428296]],\n\n         [[0.3973498 , 0.01367921]]]]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -2092,7 +2172,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2107,8 +2188,9 @@
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -2117,24 +2199,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.7703097 , 0.4447216 ]],\n\n         [[0.16715159, 0.99701406]]],\n\n\n        [[[0.04602288, 0.13176387]],\n\n         [[0.64325871, 0.78555805]]],\n\n\n        [[[0.4166604 , 0.88514966]],\n\n         [[0.28618207, 0.42701358]]]],\n\n\n\n       [[[[0.75428793, 0.76602813]],\n\n         [[0.04029357, 0.9535589 ]]],\n\n\n        [[[0.46711444, 0.76439454]],\n\n         [[0.61870435, 0.97964806]]],\n\n\n        [[[0.92665191, 0.51428296]],\n\n         [[0.3973498 , 0.01367921]]]]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -2144,7 +2226,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2159,8 +2242,9 @@
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -2169,24 +2253,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.7703097 , 0.4447216 ]],\n\n         [[0.16715159, 0.99701406]]],\n\n\n        [[[0.04602288, 0.13176387]],\n\n         [[0.64325871, 0.78555805]]],\n\n\n        [[[0.4166604 , 0.88514966]],\n\n         [[0.28618207, 0.42701358]]]],\n\n\n\n       [[[[0.75428793, 0.76602813]],\n\n         [[0.04029357, 0.9535589 ]]],\n\n\n        [[[0.46711444, 0.76439454]],\n\n         [[0.61870435, 0.97964806]]],\n\n\n        [[[0.92665191, 0.51428296]],\n\n         [[0.3973498 , 0.01367921]]]]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -2196,7 +2280,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2213,6 +2298,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -2221,24 +2307,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.0778326 , 0.36760816]],\n\n         [[0.9010404 , 0.36885676]]],\n\n\n        [[[0.90874445, 0.23003452]],\n\n         [[0.553229  , 0.51177216]]],\n\n\n        [[[0.7992253 , 0.9018119 ]],\n\n         [[0.689404  , 0.6994622 ]]]],\n\n\n\n       [[[[0.582026  , 0.35747382]],\n\n         [[0.7152774 , 0.05510924]]],\n\n\n        [[[0.34858698, 0.27468523]],\n\n         [[0.8509443 , 0.00628494]]],\n\n\n        [[[0.07876629, 0.14732139]],\n\n         [[0.15073965, 0.08870023]]]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -2248,7 +2334,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2265,6 +2352,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -2272,25 +2360,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (500 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1254 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (547 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (651 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (294 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1099 us)\nStarting stage: Completion\nCompleted stage: Completion (76 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.0778326 , 0.36760816]],\n\n         [[0.9010404 , 0.36885676]]],\n\n\n        [[[0.90874445, 0.23003452]],\n\n         [[0.553229  , 0.51177216]]],\n\n\n        [[[0.7992253 , 0.9018119 ]],\n\n         [[0.689404  , 0.6994622 ]]]],\n\n\n\n       [[[[0.582026  , 0.35747382]],\n\n         [[0.7152774 , 0.05510924]]],\n\n\n        [[[0.34858698, 0.27468523]],\n\n         [[0.8509443 , 0.00628494]]],\n\n\n        [[[0.07876629, 0.14732139]],\n\n         [[0.15073965, 0.08870023]]]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (552 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1149 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (608 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (642 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (212 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (284 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (992 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -2300,7 +2388,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2317,6 +2406,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -2325,24 +2415,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.0778326 , 0.36760816]],\n\n         [[0.9010404 , 0.36885676]]],\n\n\n        [[[0.90874445, 0.23003452]],\n\n         [[0.553229  , 0.51177216]]],\n\n\n        [[[0.7992253 , 0.9018119 ]],\n\n         [[0.689404  , 0.6994622 ]]]],\n\n\n\n       [[[[0.582026  , 0.35747382]],\n\n         [[0.7152774 , 0.05510924]]],\n\n\n        [[[0.34858698, 0.27468523]],\n\n         [[0.8509443 , 0.00628494]]],\n\n\n        [[[0.07876629, 0.14732139]],\n\n         [[0.15073965, 0.08870023]]]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -2352,7 +2442,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2369,6 +2460,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -2377,24 +2469,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.1866, 0.4656]],\n\n         [[0.527 , 0.5117]]],\n\n\n        [[[0.9536, 0.6035]],\n\n         [[0.5645, 0.6484]]],\n\n\n        [[[0.387 , 0.3386]],\n\n         [[0.1891, 0.2832]]]],\n\n\n\n       [[[[0.4172, 0.685 ]],\n\n         [[0.013 , 0.4067]]],\n\n\n        [[[0.396 , 0.9795]],\n\n         [[0.4631, 0.2429]]],\n\n\n        [[[0.914 , 0.1726]],\n\n         [[0.7563, 0.3691]]]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -2404,7 +2496,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2421,6 +2514,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -2429,24 +2523,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.1866, 0.4656]],\n\n         [[0.527 , 0.5117]]],\n\n\n        [[[0.9536, 0.6035]],\n\n         [[0.5645, 0.6484]]],\n\n\n        [[[0.387 , 0.3386]],\n\n         [[0.1891, 0.2832]]]],\n\n\n\n       [[[[0.4172, 0.685 ]],\n\n         [[0.013 , 0.4067]]],\n\n\n        [[[0.396 , 0.9795]],\n\n         [[0.4631, 0.2429]]],\n\n\n        [[[0.914 , 0.1726]],\n\n         [[0.7563, 0.3691]]]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -2456,7 +2550,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2473,6 +2568,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -2481,24 +2577,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0.1866, 0.4656]],\n\n         [[0.527 , 0.5117]]],\n\n\n        [[[0.9536, 0.6035]],\n\n         [[0.5645, 0.6484]]],\n\n\n        [[[0.387 , 0.3386]],\n\n         [[0.1891, 0.2832]]]],\n\n\n\n       [[[[0.4172, 0.685 ]],\n\n         [[0.013 , 0.4067]]],\n\n\n        [[[0.396 , 0.9795]],\n\n         [[0.4631, 0.2429]]],\n\n\n        [[[0.914 , 0.1726]],\n\n         [[0.7563, 0.3691]]]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -2508,7 +2604,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2525,6 +2622,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -2533,24 +2631,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -2560,7 +2658,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2577,6 +2676,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -2584,25 +2684,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (588 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1250 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (587 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (223 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (325 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2756 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (632 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1343 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (707 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (695 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (357 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (340 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (627 us)\nStarting stage: Completion\nCompleted stage: Completion (80 us)\nRun outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -2612,7 +2712,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2629,6 +2730,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -2637,24 +2739,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -2664,7 +2766,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2681,6 +2784,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -2689,24 +2793,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -2716,7 +2820,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2733,6 +2838,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -2741,24 +2847,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -2768,7 +2874,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2785,6 +2892,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -2793,24 +2901,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -2820,7 +2928,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2837,6 +2946,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -2845,24 +2955,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -2872,7 +2982,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2889,6 +3000,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -2897,24 +3009,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -2924,7 +3036,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2941,6 +3054,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -2949,24 +3063,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 0]]]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -2976,7 +3090,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -2993,6 +3108,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -3001,24 +3117,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -3028,7 +3144,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3045,6 +3162,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -3053,24 +3171,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -3080,7 +3198,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3097,6 +3216,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -3105,24 +3225,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -3132,7 +3252,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3149,6 +3270,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -3157,24 +3279,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -3184,7 +3306,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3201,6 +3324,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -3208,25 +3332,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
-          },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (483 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1054 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (541 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (572 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (936 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (483 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1024 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (679 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (563 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (872 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -3236,7 +3360,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3253,6 +3378,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -3261,24 +3387,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 1]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -3288,7 +3414,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3305,6 +3432,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -3313,24 +3441,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -3340,7 +3468,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3357,6 +3486,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -3365,24 +3495,186 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+          },
+          "stdout": null,
+          "stderr": null
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: A process in the process pool was terminated abruptly while the future was running or pending."
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 1]],\n\n         [[1, 0]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[1, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (482 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1007 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (553 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (536 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (208 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (355 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (880 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (639 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1067 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (560 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (530 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2792 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -3392,7 +3684,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3409,6 +3702,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -3417,24 +3711,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[1, 0]]],\n\n\n        [[[1, 0]],\n\n         [[0, 0]]]],\n\n\n\n       [[[[0, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.288  , 0.524  ]],\n\n         [[0.3127 , 0.4429 ]]],\n\n\n        [[[0.4634 , 0.2025 ]],\n\n         [[0.01224, 0.4246 ]]],\n\n\n        [[[0.366  , 0.82   ]],\n\n         [[0.268  , 0.8643 ]]]],\n\n\n\n       [[[[0.538  , 0.1625 ]],\n\n         [[0.4614 , 0.9854 ]]],\n\n\n        [[[0.7344 , 0.4775 ]],\n\n         [[0.1675 , 0.1559 ]]],\n\n\n        [[[0.342  , 0.704  ]],\n\n         [[0.756  , 0.408  ]]]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -3444,7 +3738,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3461,6 +3756,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -3469,24 +3765,78 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "FLOAT"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (513 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (2112 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (592 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (762 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (263 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (2909 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (665 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (2059 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (676 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (986 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (256 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (271 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=12288\nread_total_bytes=4096\n\nCompleted stage: Finalizing Graph Sequence (2754 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[[[0.9501954 , 0.24682619]],\n\n         [[0.20434572, 0.37768558]]],\n\n\n        [[[0.0900879 , 0.31445315]],\n\n         [[0.36254886, 0.24816896]]],\n\n\n        [[[0.03979493, 0.23046876]],\n\n         [[0.19238283, 0.8144532 ]]]],\n\n\n\n       [[[[0.40893558, 0.6416016 ]],\n\n         [[0.95898443, 0.38281253]]],\n\n\n        [[[0.7641602 , 0.24560548]],\n\n         [[0.34936526, 0.8457032 ]]],\n\n\n        [[[0.02116394, 0.43212894]],\n\n         [[0.7304688 , 0.78662115]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -3496,7 +3846,8 @@
             3,
             2,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3511,8 +3862,1657 @@
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.95028025, 0.2468104 ]],\n\n         [[0.20439683, 0.37763873]]],\n\n\n        [[[0.09010915, 0.31433827]],\n\n         [[0.36242837, 0.24815027]]],\n\n\n        [[[0.03979172, 0.2304278 ]],\n\n         [[0.19243203, 0.81435317]]]],\n\n\n\n       [[[[0.4089026 , 0.6417816 ]],\n\n         [[0.95892185, 0.38288617]]],\n\n\n        [[[0.7642732 , 0.245576  ]],\n\n         [[0.34932667, 0.8457854 ]]],\n\n\n        [[[0.02115926, 0.43220004]],\n\n         [[0.7304893 , 0.7867989 ]]]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "DOUBLE"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[0.79210038, 0.60177336]],\n\n         [[0.4632819 , 0.65442976]]],\n\n\n        [[[0.96849369, 0.07982261]],\n\n         [[0.39645548, 0.23665723]]],\n\n\n        [[[0.74176789, 0.57894562]],\n\n         [[0.12453678, 0.69615266]]]],\n\n\n\n       [[[[0.61764472, 0.75840641]],\n\n         [[0.96470109, 0.91475654]]],\n\n\n        [[[0.76691218, 0.46454851]],\n\n         [[0.7617497 , 0.94924577]]],\n\n\n        [[[0.19961647, 0.09542246]],\n\n         [[0.57588561, 0.85517519]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (497 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1108 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (562 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (558 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (235 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (328 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (790 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (499 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1135 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (615 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (593 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (322 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (272 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1046 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\nRun outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "BOOL"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[[[[ True,  True]],\n\n         [[False, False]]],\n\n\n        [[[ True,  True]],\n\n         [[False,  True]]],\n\n\n        [[[False, False]],\n\n         [[False,  True]]]],\n\n\n\n       [[[[ True,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False,  True]],\n\n         [[ True,  True]]],\n\n\n        [[[False, False]],\n\n         [[ True, False]]]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT4"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            2,
+            2
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            2,
+            3,
+            2,
+            1,
+            2
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 1
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
+          },
+          "stdout": null,
+          "stderr": null
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (493 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (828 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (533 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (579 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (281 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2722 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (504 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (898 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (546 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (565 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (202 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (952 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\nRun outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (469 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (874 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (541 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (536 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (259 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (867 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (471 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (892 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (539 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (617 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (204 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2685 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "UINT64"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT8"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": true
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT16"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": false,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
+      },
+      "dynamic_axes": {},
+      "input_is_constant": {
+        "data": true,
+        "shape": false
+      },
+      "check_result": {
+        "compile": {
+          "result": {
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+          },
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        },
+        "run": {
+          "result": {
+            "success": true,
+            "reason": null
+          },
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
+        }
+      }
+    },
+    {
+      "type_vars": {
+        "T_Reshape": "INT32"
+      },
+      "input_constraints": {
+        "data": {
+          "type": "shape",
+          "shape": [
+            2,
+            3,
+            4
+          ],
+          "min_max": null
+        },
+        "shape": {
+          "type": "value",
+          "value": [
+            6,
+            4
+          ],
+          "dtype": "int64"
+        }
+      },
+      "attrs": {
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -3520,25 +5520,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (528 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (868 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (584 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (639 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2702 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (478 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1210 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (846 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (616 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (262 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2688 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -3546,25 +5546,23 @@
           "shape": [
             2,
             3,
-            2,
-            2
-          ]
+            4
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            2,
-            3,
-            2,
-            1,
-            2
+            6,
+            4
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -3573,24 +5571,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]],\n\n\n        [[[1, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -3598,25 +5596,23 @@
           "shape": [
             2,
             3,
-            2,
-            2
-          ]
+            4
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            2,
-            3,
-            2,
-            1,
-            2
+            6,
+            4
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -3625,24 +5621,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -3650,25 +5646,23 @@
           "shape": [
             2,
             3,
-            2,
-            2
-          ]
+            4
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            2,
-            3,
-            2,
-            1,
-            2
+            6,
+            4
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -3676,25 +5670,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (785 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1466 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (630 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (885 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (327 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (271 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1678 us)\nStarting stage: Completion\nCompleted stage: Completion (69 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (508 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1200 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (553 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (588 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (259 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1016 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -3702,25 +5696,23 @@
           "shape": [
             2,
             3,
-            2,
-            2
-          ]
+            4
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            2,
-            3,
-            2,
-            1,
-            2
+            6,
+            4
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -3729,24 +5721,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[[[0, 1]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 0]],\n\n         [[0, 1]]]],\n\n\n\n       [[[[1, 0]],\n\n         [[0, 0]]],\n\n\n        [[[0, 0]],\n\n         [[1, 0]]],\n\n\n        [[[0, 1]],\n\n         [[0, 1]]]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -3755,7 +5747,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3769,6 +5762,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -3777,24 +5771,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False,  True, False, False],\n       [False,  True, False, False],\n       [False, False,  True, False],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -3803,7 +5797,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3817,6 +5812,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -3827,22 +5823,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (239 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (951 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (512 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (394 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (43 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (29 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (181 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (622 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (914 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (784 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (680 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (347 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (932 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (215 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (704 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (414 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (515 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (50 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (29 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (282 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False,  True, False, False],\n       [False,  True, False, False],\n       [False, False,  True, False],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (531 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1094 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (640 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (579 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (294 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (331 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2768 us)\nStarting stage: Completion\nCompleted stage: Completion (76 us)\nRun outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -3851,7 +5847,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3865,6 +5862,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -3873,24 +5871,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False,  True, False, False],\n       [False,  True, False, False],\n       [False, False,  True, False],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True],\n       [ True,  True,  True,  True]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -3899,7 +5897,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3913,6 +5912,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -3921,24 +5921,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.60983291, 0.2778478 , 0.1380285 , 0.3156136 ],\n       [0.81198878, 0.1938387 , 0.33014191, 0.12242782],\n       [0.25305774, 0.16749546, 0.96974912, 0.57184838],\n       [0.6548838 , 0.06171988, 0.1094998 , 0.97621954],\n       [0.66272206, 0.61239432, 0.59846803, 0.07054358],\n       [0.27664948, 0.60089015, 0.06975854, 0.52541525]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -3947,7 +5947,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -3961,6 +5962,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -3968,25 +5970,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (492 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1181 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (773 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (881 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (282 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (303 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1321 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.60983291, 0.2778478 , 0.1380285 , 0.3156136 ],\n       [0.81198878, 0.1938387 , 0.33014191, 0.12242782],\n       [0.25305774, 0.16749546, 0.96974912, 0.57184838],\n       [0.6548838 , 0.06171988, 0.1094998 , 0.97621954],\n       [0.66272206, 0.61239432, 0.59846803, 0.07054358],\n       [0.27664948, 0.60089015, 0.06975854, 0.52541525]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (573 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1134 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (550 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (576 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (204 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (261 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (876 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[0.15112306, 0.27050784, 0.08752442, 0.3376465 ],\n       [0.91210943, 0.07196046, 0.8500977 , 0.06079102],\n       [0.487793  , 0.9228516 , 0.03723145, 0.769043  ],\n       [0.62744147, 0.9072266 , 0.6713868 , 0.43994144],\n       [0.18457033, 0.27758792, 0.04101563, 0.30590823],\n       [0.3500977 , 0.66992193, 0.9438477 , 0.4602051 ]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -3995,7 +5997,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4009,6 +6012,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -4017,24 +6021,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.60983291, 0.2778478 , 0.1380285 , 0.3156136 ],\n       [0.81198878, 0.1938387 , 0.33014191, 0.12242782],\n       [0.25305774, 0.16749546, 0.96974912, 0.57184838],\n       [0.6548838 , 0.06171988, 0.1094998 , 0.97621954],\n       [0.66272206, 0.61239432, 0.59846803, 0.07054358],\n       [0.27664948, 0.60089015, 0.06975854, 0.52541525]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -4043,7 +6047,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4057,6 +6062,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -4065,24 +6071,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.87837684, 0.12496535, 0.6410792 , 0.8155647 ],\n       [0.20442708, 0.07853543, 0.17279339, 0.39599288],\n       [0.68938524, 0.46515653, 0.47645187, 0.05560424],\n       [0.6362691 , 0.1925105 , 0.75034213, 0.2018522 ],\n       [0.63343334, 0.7399463 , 0.7165239 , 0.20509113],\n       [0.53105617, 0.07783564, 0.00458732, 0.82393163]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -4091,7 +6097,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4105,6 +6112,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -4112,25 +6120,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (234 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (875 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (376 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (374 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (39 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (25 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (818 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (223 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (871 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (354 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (381 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (21 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (825 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.878418  , 0.12493897, 0.64111334, 0.81542975],\n       [0.20446779, 0.07855225, 0.17285158, 0.39599612],\n       [0.6894532 , 0.46508792, 0.47656253, 0.05560303],\n       [0.6362305 , 0.1925049 , 0.75048834, 0.20190431],\n       [0.63330084, 0.73974615, 0.71630865, 0.20507814],\n       [0.53125006, 0.07781983, 0.00458908, 0.8237305 ]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -4139,7 +6147,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4153,6 +6162,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -4161,24 +6171,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.87837684, 0.12496535, 0.6410792 , 0.8155647 ],\n       [0.20442708, 0.07853543, 0.17279339, 0.39599288],\n       [0.68938524, 0.46515653, 0.47645187, 0.05560424],\n       [0.6362691 , 0.1925105 , 0.75034213, 0.2018522 ],\n       [0.63343334, 0.7399463 , 0.7165239 , 0.20509113],\n       [0.53105617, 0.07783564, 0.00458732, 0.82393163]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -4187,7 +6197,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4201,6 +6212,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -4209,24 +6221,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.317  , 0.4302 , 0.3193 , 0.5054 ],\n       [0.10986, 0.58   , 0.3728 , 0.759  ],\n       [0.69   , 0.1855 , 0.2294 , 0.0855 ],\n       [0.2172 , 0.837  , 0.4014 , 0.5117 ],\n       [0.733  , 0.2405 , 0.2776 , 0.6704 ],\n       [0.84   , 0.667  , 0.4236 , 0.937  ]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -4235,7 +6247,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4249,6 +6262,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -4259,22 +6273,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (277 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (780 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (386 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (417 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (54 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (24 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (842 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (636 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (917 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (562 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (585 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (200 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2834 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (214 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (704 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (370 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (335 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (21 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (807 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.317  , 0.4302 , 0.3193 , 0.5054 ],\n       [0.10986, 0.58   , 0.3728 , 0.759  ],\n       [0.69   , 0.1855 , 0.2294 , 0.0855 ],\n       [0.2172 , 0.837  , 0.4014 , 0.5117 ],\n       [0.733  , 0.2405 , 0.2776 , 0.6704 ],\n       [0.84   , 0.667  , 0.4236 , 0.937  ]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (504 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (857 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (536 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (530 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (200 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (4273 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -4283,7 +6297,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4297,6 +6312,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -4305,24 +6321,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.317  , 0.4302 , 0.3193 , 0.5054 ],\n       [0.10986, 0.58   , 0.3728 , 0.759  ],\n       [0.69   , 0.1855 , 0.2294 , 0.0855 ],\n       [0.2172 , 0.837  , 0.4014 , 0.5117 ],\n       [0.733  , 0.2405 , 0.2776 , 0.6704 ],\n       [0.84   , 0.667  , 0.4236 , 0.937  ]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -4331,7 +6347,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4345,6 +6362,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -4353,24 +6371,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 1]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -4379,7 +6397,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4393,6 +6412,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -4401,24 +6421,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:55.6469635 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 1]], dtype=int16)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:38:56.0605084 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -4427,7 +6447,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4441,6 +6462,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -4449,24 +6471,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 1]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -4475,7 +6497,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4489,6 +6512,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -4497,24 +6521,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 0, 0]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -4523,7 +6547,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4537,6 +6562,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -4544,25 +6570,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (226 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (926 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (407 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (376 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (44 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (798 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (201 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (652 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (379 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (340 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (42 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (28 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1046 us)\nStarting stage: Completion\nCompleted stage: Completion (12 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 0, 0]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -4571,7 +6597,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4585,6 +6612,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -4593,24 +6621,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 0, 1],\n       [0, 1, 1, 0],\n       [0, 1, 0, 0]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -4619,7 +6647,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4631,8 +6660,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -4641,24 +6671,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 0],\n       [0, 1, 1, 0],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 1, 0],\n       [1, 1, 1, 0]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -4667,7 +6697,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4679,8 +6710,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -4691,22 +6723,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (227 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (994 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (385 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (391 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (39 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (224 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (619 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1095 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (637 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (707 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (244 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (303 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (707 us)\nStarting stage: Completion\nCompleted stage: Completion (70 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (189 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (925 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (378 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (381 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (41 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (164 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 0],\n       [0, 1, 1, 0],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 1, 0],\n       [1, 1, 1, 0]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (615 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1154 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (623 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (537 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (210 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (270 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2732 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -4715,7 +6747,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4727,8 +6760,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -4737,24 +6771,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 0],\n       [0, 1, 1, 0],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 1, 0],\n       [1, 1, 1, 0]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 0],\n       [0, 1, 1, 1],\n       [1, 0, 0, 1]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -4763,7 +6797,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4775,8 +6810,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -4785,24 +6821,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 0],\n       [1, 0, 1, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 1, 1],\n       [0, 0, 1, 1]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -4811,7 +6847,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4823,8 +6860,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -4833,24 +6871,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:00.7873910 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 0],\n       [1, 0, 1, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 1, 1],\n       [0, 0, 1, 1]], dtype=int8)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:01.1774004 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -4859,7 +6897,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4871,8 +6910,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -4881,24 +6921,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 0, 0],\n       [1, 0, 1, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0],\n       [1, 1, 1, 1],\n       [0, 0, 1, 1]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [1, 1, 0, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 1],\n       [1, 1, 0, 1]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -4907,7 +6947,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4919,8 +6960,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -4929,24 +6971,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 0, 0],\n       [1, 0, 0, 1],\n       [1, 0, 1, 0],\n       [0, 1, 1, 0],\n       [0, 0, 1, 1],\n       [0, 0, 1, 0]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -4955,7 +6997,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -4967,8 +7010,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -4976,25 +7020,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:02.4405851 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (481 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1067 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (639 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (560 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (207 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (923 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 0, 0],\n       [1, 0, 0, 1],\n       [1, 0, 1, 0],\n       [0, 1, 1, 0],\n       [0, 0, 1, 1],\n       [0, 0, 1, 0]], dtype=uint16)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:02.8550678 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (495 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (907 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (557 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (689 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (235 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2843 us)\nStarting stage: Completion\nCompleted stage: Completion (73 us)\nRun outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -5003,7 +7047,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5015,8 +7060,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -5025,24 +7071,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 0, 0],\n       [1, 0, 0, 1],\n       [1, 0, 1, 0],\n       [0, 1, 1, 0],\n       [0, 0, 1, 1],\n       [0, 0, 1, 0]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 1, 1],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 1],\n       [1, 1, 1, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -5051,7 +7097,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5063,8 +7110,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -5073,24 +7121,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 0, 0],\n       [1, 0, 0, 0],\n       [0, 1, 0, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -5099,7 +7147,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5111,8 +7160,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -5120,25 +7170,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (237 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (707 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (334 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (320 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (33 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (19 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (789 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (219 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (562 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (414 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (341 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (36 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (21 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (792 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 0, 0],\n       [1, 0, 0, 0],\n       [0, 1, 0, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -5147,7 +7197,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5159,8 +7210,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -5169,24 +7221,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [1, 0, 0, 0],\n       [1, 0, 0, 0],\n       [0, 1, 0, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 1]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 0, 1, 1],\n       [0, 0, 1, 0],\n       [1, 0, 0, 0],\n       [0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [1, 1, 1, 0]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -5195,7 +7247,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5207,8 +7260,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -5217,24 +7271,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [0, 1, 1, 1],\n       [0, 0, 0, 1],\n       [1, 1, 1, 0],\n       [0, 1, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -5243,7 +7297,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5255,8 +7310,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -5265,24 +7321,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:05.6863299 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [0, 1, 1, 1],\n       [0, 0, 0, 1],\n       [1, 1, 1, 0],\n       [0, 1, 0, 0]], dtype=uint64)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:06.0706942 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -5291,7 +7347,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5303,8 +7360,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -5313,24 +7371,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 0],\n       [0, 1, 1, 0],\n       [0, 1, 1, 1],\n       [0, 0, 0, 1],\n       [1, 1, 1, 0],\n       [0, 1, 0, 0]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 0, 1],\n       [0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 0, 0],\n       [0, 0, 1, 0]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -5339,7 +7397,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5351,8 +7410,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -5361,24 +7421,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 0],\n       [0, 1, 0, 0],\n       [1, 0, 0, 0]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -5387,7 +7447,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5399,8 +7460,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -5408,25 +7470,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (178 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (769 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (364 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (439 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (50 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (231 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (228 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (878 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (333 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (343 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (33 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (19 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (793 us)\nStarting stage: Completion\nCompleted stage: Completion (8 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 0],\n       [0, 1, 0, 0],\n       [1, 0, 0, 0]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -5435,7 +7497,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5447,8 +7510,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -5457,24 +7521,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [1, 1, 0, 1],\n       [1, 1, 0, 0],\n       [0, 1, 0, 0],\n       [1, 0, 0, 0]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[1, 1, 0, 0],\n       [1, 0, 0, 1],\n       [0, 0, 1, 1],\n       [0, 0, 1, 1],\n       [1, 0, 1, 0],\n       [1, 1, 0, 1]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -5483,7 +7547,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5497,6 +7562,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -5505,24 +7571,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False, False, False,  True],\n       [False,  True,  True,  True],\n       [False,  True,  True, False],\n       [ True, False,  True,  True],\n       [False,  True,  True, False],\n       [False,  True, False, False]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -5531,7 +7597,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5545,6 +7612,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -5552,25 +7620,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (693 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (864 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (678 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (609 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (240 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (264 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (959 us)\nStarting stage: Completion\nCompleted stage: Completion (78 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False, False, False,  True],\n       [False,  True,  True,  True],\n       [False,  True,  True, False],\n       [ True, False,  True,  True],\n       [False,  True,  True, False],\n       [False,  True, False, False]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (593 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (913 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (571 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (539 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (206 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (255 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (852 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\nRun outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -5579,7 +7647,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5593,6 +7662,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -5601,24 +7671,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[False, False, False,  True],\n       [False,  True,  True,  True],\n       [False,  True,  True, False],\n       [ True, False,  True,  True],\n       [False,  True,  True, False],\n       [False,  True, False, False]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 0, 1, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 0],\n       [0, 0, 1, 1]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -5627,7 +7697,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5641,6 +7712,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -5649,24 +7721,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.94668719, 0.07156254, 0.62799387, 0.8700416 ],\n       [0.2376107 , 0.93218801, 0.11087389, 0.95768188],\n       [0.72714199, 0.18984931, 0.72178009, 0.18397676],\n       [0.30381081, 0.15009548, 0.41786938, 0.60956376],\n       [0.09410743, 0.04554653, 0.14003383, 0.71575823],\n       [0.74410953, 0.93568437, 0.93866713, 0.67738864]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -5675,7 +7747,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5689,6 +7762,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -5696,25 +7770,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (622 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1140 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (571 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (668 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (241 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2772 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.94668719, 0.07156254, 0.62799387, 0.8700416 ],\n       [0.2376107 , 0.93218801, 0.11087389, 0.95768188],\n       [0.72714199, 0.18984931, 0.72178009, 0.18397676],\n       [0.30381081, 0.15009548, 0.41786938, 0.60956376],\n       [0.09410743, 0.04554653, 0.14003383, 0.71575823],\n       [0.74410953, 0.93568437, 0.93866713, 0.67738864]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (520 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1186 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (599 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (674 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (218 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (277 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2741 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -5723,7 +7797,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5737,6 +7812,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -5745,24 +7821,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.94668719, 0.07156254, 0.62799387, 0.8700416 ],\n       [0.2376107 , 0.93218801, 0.11087389, 0.95768188],\n       [0.72714199, 0.18984931, 0.72178009, 0.18397676],\n       [0.30381081, 0.15009548, 0.41786938, 0.60956376],\n       [0.09410743, 0.04554653, 0.14003383, 0.71575823],\n       [0.74410953, 0.93568437, 0.93866713, 0.67738864]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0, 0, 0, 0],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 0, 0],\n       [1, 1, 1, 1],\n       [1, 0, 0, 1]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -5771,7 +7847,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5785,6 +7862,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -5793,24 +7871,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.6014032 , 0.12629846, 0.69034153, 0.9281836 ],\n       [0.00974776, 0.1003448 , 0.54806864, 0.41706815],\n       [0.12492406, 0.41772294, 0.8334446 , 0.5056909 ],\n       [0.00798008, 0.43987218, 0.7958037 , 0.90345496],\n       [0.8349494 , 0.9501986 , 0.13919163, 0.27675086],\n       [0.84382546, 0.63057137, 0.29212403, 0.7484627 ]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -5819,7 +7897,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5833,6 +7912,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -5840,25 +7920,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (598 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1163 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (623 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (599 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (220 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (286 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3053 us)\nStarting stage: Completion\nCompleted stage: Completion (74 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.6014032 , 0.12629846, 0.69034153, 0.9281836 ],\n       [0.00974776, 0.1003448 , 0.54806864, 0.41706815],\n       [0.12492406, 0.41772294, 0.8334446 , 0.5056909 ],\n       [0.00798008, 0.43987218, 0.7958037 , 0.90345496],\n       [0.8349494 , 0.9501986 , 0.13919163, 0.27675086],\n       [0.84382546, 0.63057137, 0.29212403, 0.7484627 ]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (667 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1148 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (573 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (543 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (256 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (271 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2570 us)\nStarting stage: Completion\nCompleted stage: Completion (78 us)\nRun outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -5867,7 +7947,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5881,6 +7962,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -5889,24 +7971,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.6014032 , 0.12629846, 0.69034153, 0.9281836 ],\n       [0.00974776, 0.1003448 , 0.54806864, 0.41706815],\n       [0.12492406, 0.41772294, 0.8334446 , 0.5056909 ],\n       [0.00798008, 0.43987218, 0.7958037 , 0.90345496],\n       [0.8349494 , 0.9501986 , 0.13919163, 0.27675086],\n       [0.84382546, 0.63057137, 0.29212403, 0.7484627 ]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.56   , 0.6523 , 0.4167 , 0.1021 ],\n       [0.89   , 0.206  , 0.421  , 0.3638 ],\n       [0.51   , 0.12274, 0.1451 , 0.571  ],\n       [0.3245 , 0.496  , 0.1783 , 0.7173 ],\n       [0.4492 , 0.705  , 0.2454 , 0.03049],\n       [0.2345 , 0.891  , 0.1499 , 0.957  ]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -5915,7 +7997,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5929,6 +8012,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -5937,24 +8021,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.2773 , 0.5527 , 0.4353 , 0.9277 ],\n       [0.724  , 0.3755 , 0.07184, 0.2112 ],\n       [0.1699 , 0.6963 , 0.162  , 0.5293 ],\n       [0.1726 , 0.097  , 0.7734 , 0.0512 ],\n       [0.8384 , 0.1781 , 0.4622 , 0.834  ],\n       [0.3992 , 0.7397 , 0.6284 , 0.4368 ]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -5963,7 +8047,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -5977,6 +8062,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -5984,25 +8070,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (585 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1878 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (678 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (632 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (246 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (273 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1098 us)\nStarting stage: Completion\nCompleted stage: Completion (75 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.2773 , 0.5527 , 0.4353 , 0.9277 ],\n       [0.724  , 0.3755 , 0.07184, 0.2112 ],\n       [0.1699 , 0.6963 , 0.162  , 0.5293 ],\n       [0.1726 , 0.097  , 0.7734 , 0.0512 ],\n       [0.8384 , 0.1781 , 0.4622 , 0.834  ],\n       [0.3992 , 0.7397 , 0.6284 , 0.4368 ]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (552 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1166 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (611 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (679 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (249 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (296 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2869 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\nRun outputs: [array([[0.15112306, 0.27050784, 0.08752442, 0.3376465 ],\n       [0.91210943, 0.07196046, 0.8500977 , 0.06079102],\n       [0.487793  , 0.9228516 , 0.03723145, 0.769043  ],\n       [0.62744147, 0.9072266 , 0.6713868 , 0.43994144],\n       [0.18457033, 0.27758792, 0.04101563, 0.30590823],\n       [0.3500977 , 0.66992193, 0.9438477 , 0.4602051 ]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -6011,7 +8097,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6025,6 +8112,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -6033,24 +8121,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0.2773 , 0.5527 , 0.4353 , 0.9277 ],\n       [0.724  , 0.3755 , 0.07184, 0.2112 ],\n       [0.1699 , 0.6963 , 0.162  , 0.5293 ],\n       [0.1726 , 0.097  , 0.7734 , 0.0512 ],\n       [0.8384 , 0.1781 , 0.4622 , 0.834  ],\n       [0.3992 , 0.7397 , 0.6284 , 0.4368 ]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.15106224, 0.27046126, 0.08752598, 0.3377456 ],\n       [0.91206604, 0.07197218, 0.8500704 , 0.06078569],\n       [0.48790687, 0.9228181 , 0.03722728, 0.76907235],\n       [0.62741214, 0.9071317 , 0.67140186, 0.4399309 ],\n       [0.18454204, 0.27770287, 0.04102697, 0.30583474],\n       [0.35007593, 0.6697418 , 0.94376886, 0.46025437]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -6059,7 +8147,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6073,6 +8162,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -6081,24 +8171,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 1],\n       [0, 1, 1, 0],\n       [0, 1, 1, 1],\n       [1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 1]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -6107,7 +8197,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6121,6 +8212,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -6129,24 +8221,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 1],\n       [0, 1, 1, 0],\n       [0, 1, 1, 1],\n       [1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 1]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -6155,7 +8247,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6169,6 +8262,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -6177,24 +8271,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 1],\n       [0, 1, 1, 0],\n       [0, 1, 1, 1],\n       [1, 1, 0, 0],\n       [0, 0, 0, 0],\n       [0, 0, 1, 1]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[0.71700709, 0.19760002, 0.61780984, 0.31983466],\n       [0.59794199, 0.57715688, 0.6881818 , 0.67773427],\n       [0.50443168, 0.76637021, 0.07076356, 0.60439345],\n       [0.86926494, 0.9636245 , 0.58854585, 0.61047817],\n       [0.65700502, 0.34409379, 0.49143779, 0.56194767],\n       [0.405834  , 0.9617059 , 0.88996155, 0.06803201]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -6203,7 +8297,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6217,6 +8312,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -6225,24 +8321,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 1],\n       [0, 0, 1, 0],\n       [0, 1, 1, 0]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -6251,7 +8347,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6265,6 +8362,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -6272,25 +8370,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (505 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (848 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (536 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (528 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (199 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (310 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (871 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 1],\n       [0, 0, 1, 0],\n       [0, 1, 1, 0]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (552 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (917 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (550 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (534 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (220 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (285 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (716 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\nRun outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -6299,7 +8397,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6313,6 +8412,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -6321,24 +8421,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 1],\n       [0, 0, 0, 0],\n       [1, 0, 1, 0],\n       [0, 1, 0, 1],\n       [0, 0, 1, 0],\n       [0, 1, 1, 0]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[ True,  True, False, False],\n       [ True,  True,  True,  True],\n       [False,  True,  True, False],\n       [False, False,  True,  True],\n       [ True,  True, False,  True],\n       [ True,  True, False,  True]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -6347,7 +8447,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6361,6 +8462,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -6369,24 +8471,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 1],\n       [0, 0, 0, 1]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -6395,7 +8497,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6409,6 +8512,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -6417,24 +8521,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 1],\n       [0, 0, 0, 1]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -6443,7 +8547,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6457,6 +8562,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -6465,24 +8571,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [0, 1, 1, 0],\n       [1, 0, 0, 0],\n       [1, 1, 1, 0],\n       [1, 0, 1, 1],\n       [0, 0, 0, 1]])]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -6491,7 +8597,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6505,6 +8612,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -6513,24 +8621,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 0],\n       [0, 1, 0, 0],\n       [1, 0, 1, 1],\n       [1, 1, 1, 0],\n       [0, 1, 0, 1],\n       [1, 0, 0, 0]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -6539,7 +8647,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6553,6 +8662,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -6561,24 +8671,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 0],\n       [0, 1, 0, 0],\n       [1, 0, 1, 1],\n       [1, 1, 1, 0],\n       [0, 1, 0, 1],\n       [1, 0, 0, 0]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -6587,7 +8697,8 @@
             2,
             3,
             4
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -6601,6 +8712,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -6609,46 +8721,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 1, 1, 0],\n       [0, 1, 0, 0],\n       [1, 0, 1, 1],\n       [1, 1, 1, 0],\n       [0, 1, 0, 1],\n       [1, 0, 0, 0]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -6657,46 +8772,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 1, 0],\n       [0, 0, 0, 1],\n       [1, 1, 1, 1],\n       [1, 0, 0, 0],\n       [1, 0, 0, 0],\n       [0, 1, 1, 0]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -6704,47 +8822,50 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (517 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (925 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (534 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (962 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 1, 0],\n       [0, 0, 0, 1],\n       [1, 1, 1, 1],\n       [1, 0, 0, 0],\n       [1, 0, 0, 0],\n       [0, 1, 1, 0]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (823 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1066 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (585 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (619 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (256 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (522 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (765 us)\nStarting stage: Completion\nCompleted stage: Completion (87 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -6753,46 +8874,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 1, 1, 0],\n       [0, 0, 0, 1],\n       [1, 1, 1, 1],\n       [1, 0, 0, 0],\n       [1, 0, 0, 0],\n       [0, 1, 1, 0]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -6801,46 +8925,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [1, 1, 1, 1],\n       [1, 0, 1, 0],\n       [0, 1, 0, 1],\n       [0, 0, 0, 1],\n       [1, 1, 1, 0]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -6849,46 +8976,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [1, 1, 1, 1],\n       [1, 0, 1, 0],\n       [0, 1, 0, 1],\n       [0, 0, 0, 1],\n       [1, 1, 1, 0]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -6897,46 +9027,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 0, 1],\n       [1, 1, 1, 1],\n       [1, 0, 1, 0],\n       [0, 1, 0, 1],\n       [0, 0, 0, 1],\n       [1, 1, 1, 0]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -6945,46 +9078,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 0],\n       [0, 1, 0, 1],\n       [1, 0, 1, 1],\n       [0, 1, 1, 1]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -6992,47 +9128,50 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (620 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1486 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (583 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (643 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (253 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (269 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1208 us)\nStarting stage: Completion\nCompleted stage: Completion (68 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 0],\n       [0, 1, 0, 1],\n       [1, 0, 1, 1],\n       [0, 1, 1, 1]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (539 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (911 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (535 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (533 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (275 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (868 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -7041,46 +9180,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[1, 0, 1, 0],\n       [1, 0, 1, 0],\n       [0, 0, 1, 0],\n       [0, 1, 0, 1],\n       [1, 0, 1, 1],\n       [0, 1, 1, 1]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -7089,46 +9231,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 1],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 1, 1],\n       [0, 1, 1, 1],\n       [0, 1, 0, 1]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -7137,46 +9282,49 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 1],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 1, 1],\n       [0, 1, 1, 1],\n       [0, 1, 0, 1]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
           "type": "shape",
           "shape": [
-            2,
-            3,
-            4
-          ]
+            5,
+            1,
+            2
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
           "value": [
-            6,
-            4
+            10,
+            1,
+            1
           ],
           "dtype": "int64"
         }
       },
       "attrs": {
-        "allowzero": 1
+        "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -7185,24 +9333,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[0, 0, 0, 1],\n       [0, 0, 0, 1],\n       [0, 0, 0, 0],\n       [1, 1, 1, 1],\n       [0, 1, 1, 1],\n       [0, 1, 0, 1]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -7211,7 +9359,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7226,6 +9375,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -7234,24 +9384,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -7260,7 +9410,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7275,6 +9426,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -7282,25 +9434,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (351 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (699 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (350 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (341 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (38 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3407 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (216 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (553 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (310 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (286 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (34 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (18 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (316 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -7309,7 +9461,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7324,6 +9477,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -7332,24 +9486,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -7358,7 +9512,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7373,6 +9528,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -7381,24 +9537,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.49458908]],\n\n       [[0.38552561]],\n\n       [[0.95893672]],\n\n       [[0.27257961]],\n\n       [[0.63461326]],\n\n       [[0.40915997]],\n\n       [[0.48572873]],\n\n       [[0.66916279]],\n\n       [[0.54785745]],\n\n       [[0.01230883]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -7407,7 +9563,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7422,6 +9579,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -7430,24 +9588,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.49458908]],\n\n       [[0.38552561]],\n\n       [[0.95893672]],\n\n       [[0.27257961]],\n\n       [[0.63461326]],\n\n       [[0.40915997]],\n\n       [[0.48572873]],\n\n       [[0.66916279]],\n\n       [[0.54785745]],\n\n       [[0.01230883]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -7456,7 +9614,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7471,6 +9630,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -7479,24 +9639,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.49458908]],\n\n       [[0.38552561]],\n\n       [[0.95893672]],\n\n       [[0.27257961]],\n\n       [[0.63461326]],\n\n       [[0.40915997]],\n\n       [[0.48572873]],\n\n       [[0.66916279]],\n\n       [[0.54785745]],\n\n       [[0.01230883]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -7505,7 +9665,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7520,6 +9681,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -7528,24 +9690,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.8392793 ]],\n\n       [[0.03631059]],\n\n       [[0.16795743]],\n\n       [[0.3910664 ]],\n\n       [[0.4776759 ]],\n\n       [[0.70745665]],\n\n       [[0.88355815]],\n\n       [[0.92501724]],\n\n       [[0.84209317]],\n\n       [[0.39467472]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -7554,7 +9716,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7569,6 +9732,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -7579,22 +9743,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (354 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (888 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (376 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (363 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (39 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (840 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (584 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1086 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (605 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (717 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (246 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (272 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1031 us)\nStarting stage: Completion\nCompleted stage: Completion (72 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (252 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (915 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (368 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (366 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (43 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (22 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (885 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.8393555 ]],\n\n       [[0.03631592]],\n\n       [[0.16796876]],\n\n       [[0.3911133 ]],\n\n       [[0.47778323]],\n\n       [[0.7075196 ]],\n\n       [[0.8837891 ]],\n\n       [[0.92480475]],\n\n       [[0.8422852 ]],\n\n       [[0.39477542]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (637 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1179 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (839 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (649 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (303 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (353 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (839 us)\nStarting stage: Completion\nCompleted stage: Completion (71 us)\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -7603,7 +9767,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7618,6 +9783,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -7626,24 +9792,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.8392793 ]],\n\n       [[0.03631059]],\n\n       [[0.16795743]],\n\n       [[0.3910664 ]],\n\n       [[0.4776759 ]],\n\n       [[0.70745665]],\n\n       [[0.88355815]],\n\n       [[0.92501724]],\n\n       [[0.84209317]],\n\n       [[0.39467472]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -7652,7 +9818,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7667,6 +9834,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -7675,24 +9843,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.4658 ]],\n\n       [[0.555  ]],\n\n       [[0.04022]],\n\n       [[0.655  ]],\n\n       [[0.4797 ]],\n\n       [[0.6665 ]],\n\n       [[0.7876 ]],\n\n       [[0.11456]],\n\n       [[0.9424 ]],\n\n       [[0.1887 ]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -7701,7 +9869,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7716,6 +9885,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -7726,22 +9896,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (218 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (555 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (297 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (299 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (32 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (18 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (901 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (485 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1288 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (713 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (716 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (209 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (276 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (871 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (264 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (489 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (315 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (290 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (32 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (19 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3369 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.4658 ]],\n\n       [[0.555  ]],\n\n       [[0.04022]],\n\n       [[0.655  ]],\n\n       [[0.4797 ]],\n\n       [[0.6665 ]],\n\n       [[0.7876 ]],\n\n       [[0.11456]],\n\n       [[0.9424 ]],\n\n       [[0.1887 ]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (488 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1114 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (541 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (585 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (267 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2683 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -7750,7 +9920,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7765,6 +9936,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -7773,24 +9945,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.4658 ]],\n\n       [[0.555  ]],\n\n       [[0.04022]],\n\n       [[0.655  ]],\n\n       [[0.4797 ]],\n\n       [[0.6665 ]],\n\n       [[0.7876 ]],\n\n       [[0.11456]],\n\n       [[0.9424 ]],\n\n       [[0.1887 ]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -7799,7 +9971,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7814,6 +9987,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -7822,24 +9996,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -7848,7 +10022,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7863,6 +10038,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -7870,25 +10046,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:29.6220039 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (543 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1136 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (614 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (584 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (201 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (273 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1216 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:30.0069130 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (482 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (970 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (580 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (537 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (210 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (943 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -7897,7 +10073,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7912,6 +10089,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -7920,24 +10098,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -7946,7 +10124,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -7961,6 +10140,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -7969,24 +10149,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -7995,7 +10175,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8010,6 +10191,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -8020,22 +10202,22 @@
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (372 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (836 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (386 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (438 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (41 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (23 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (940 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (494 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1154 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (737 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (573 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2705 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (238 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (631 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (350 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (317 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (41 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (21 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (846 us)\nStarting stage: Completion\nCompleted stage: Completion (9 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (483 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1209 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (569 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (574 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (201 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (284 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2711 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[0.21765138]],\n\n       [[0.13110353]],\n\n       [[0.20715334]],\n\n       [[0.40234378]],\n\n       [[0.44116214]],\n\n       [[0.8422852 ]],\n\n       [[0.40917972]],\n\n       [[0.41601565]],\n\n       [[0.6577149 ]],\n\n       [[0.11676026]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -8044,7 +10226,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8059,6 +10242,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -8067,24 +10251,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -8093,7 +10277,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8108,6 +10293,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -8116,24 +10302,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -8142,7 +10328,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8157,6 +10344,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -8164,25 +10352,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (280 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1336 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (398 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (553 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (55 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (24 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3445 us)\nStarting stage: Completion\nCompleted stage: Completion (10 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (221 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (727 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (304 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (354 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (37 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (19 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3256 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -8191,7 +10379,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8206,6 +10395,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -8214,24 +10404,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -8240,7 +10430,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8255,6 +10446,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -8263,24 +10455,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -8289,7 +10481,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8304,6 +10497,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -8311,25 +10505,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:34.5906246 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (525 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (973 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (584 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (537 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (200 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (264 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2805 us)\nStarting stage: Completion\nCompleted stage: Completion (67 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int8)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:34.9527032 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (560 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1076 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (681 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (551 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (203 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (257 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2694 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -8338,7 +10532,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8353,6 +10548,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -8361,24 +10557,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -8387,7 +10583,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8402,6 +10599,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -8410,24 +10608,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -8436,7 +10634,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8451,6 +10650,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -8459,24 +10659,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:36.1186396 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:36.4991980 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -8485,7 +10685,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8500,6 +10701,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -8508,24 +10710,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -8534,7 +10736,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8549,6 +10752,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -8557,24 +10761,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -8583,7 +10787,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8598,6 +10803,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -8605,25 +10811,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (193 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (527 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (309 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (286 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (33 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (18 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (265 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (200 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (512 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (311 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (290 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (33 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (19 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (837 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -8632,7 +10838,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8647,6 +10854,7 @@
       "attrs": {
         "allowzero": 0
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -8655,24 +10863,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -8681,7 +10889,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8694,8 +10903,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -8704,24 +10914,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -8730,7 +10940,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8743,8 +10954,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -8752,25 +10964,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:39.3266467 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (602 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (924 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (577 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (809 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (210 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (258 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2774 us)\nStarting stage: Completion\nCompleted stage: Completion (69 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint64)]\n",
-          "stderr": "\u001b[0;93m2025-12-02 15:39:39.6727008 [W:onnxruntime:, qnn_model_wrapper.cc:263 onnxruntime::qnn::QnnModelWrapper::CreateQnnNode] QNN.backendValidateOpConfig() failed for node `n1` of type `Reshape` with error code 3110\n\u001b[m\n"
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (594 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1262 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (593 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (637 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (286 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2793 us)\nStarting stage: Completion\nCompleted stage: Completion (69 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT8"
       },
       "input_constraints": {
         "data": {
@@ -8779,7 +10991,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8792,8 +11005,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -8802,24 +11016,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -8828,7 +11042,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8841,8 +11056,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -8851,24 +11067,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -8877,7 +11093,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8890,8 +11107,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -8899,25 +11117,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (207 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (570 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (309 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (288 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (32 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (19 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (3447 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (308 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (534 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (296 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (285 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (30 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (18 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (814 us)\nStarting stage: Completion\nCompleted stage: Completion (7 us)\nAdding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "UINT16"
       },
       "input_constraints": {
         "data": {
@@ -8926,7 +11144,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8939,8 +11158,9 @@
         }
       },
       "attrs": {
-        "allowzero": 0
+        "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -8949,24 +11169,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -8975,7 +11195,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -8990,6 +11211,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -8998,24 +11220,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -9024,7 +11246,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9039,6 +11262,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -9046,25 +11270,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (503 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1018 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (576 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (562 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (215 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (299 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1041 us)\nStarting stage: Completion\nCompleted stage: Completion (123 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (626 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (983 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (656 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (862 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (276 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (264 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (977 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "BOOL"
+        "T_Reshape": "UINT32"
       },
       "input_constraints": {
         "data": {
@@ -9073,7 +11297,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9088,6 +11313,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -9096,24 +11322,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -9122,7 +11348,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9137,6 +11364,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -9145,24 +11373,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.03723769]],\n\n       [[0.15519571]],\n\n       [[0.1784701 ]],\n\n       [[0.01496151]],\n\n       [[0.34108702]],\n\n       [[0.55019347]],\n\n       [[0.82353872]],\n\n       [[0.77362745]],\n\n       [[0.12224218]],\n\n       [[0.37872222]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -9171,7 +11399,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9186,6 +11415,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -9194,24 +11424,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.03723769]],\n\n       [[0.15519571]],\n\n       [[0.1784701 ]],\n\n       [[0.01496151]],\n\n       [[0.34108702]],\n\n       [[0.55019347]],\n\n       [[0.82353872]],\n\n       [[0.77362745]],\n\n       [[0.12224218]],\n\n       [[0.37872222]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "DOUBLE"
+        "T_Reshape": "UINT64"
       },
       "input_constraints": {
         "data": {
@@ -9220,7 +11450,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9235,6 +11466,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -9243,24 +11475,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.03723769]],\n\n       [[0.15519571]],\n\n       [[0.1784701 ]],\n\n       [[0.01496151]],\n\n       [[0.34108702]],\n\n       [[0.55019347]],\n\n       [[0.82353872]],\n\n       [[0.77362745]],\n\n       [[0.12224218]],\n\n       [[0.37872222]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -9269,7 +11501,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9284,6 +11517,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -9292,24 +11526,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.3028835 ]],\n\n       [[0.9312545 ]],\n\n       [[0.5725771 ]],\n\n       [[0.20876952]],\n\n       [[0.8619418 ]],\n\n       [[0.68167454]],\n\n       [[0.2378018 ]],\n\n       [[0.47129697]],\n\n       [[0.45731708]],\n\n       [[0.80309325]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -9318,7 +11552,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9333,6 +11568,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -9341,24 +11577,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.3028835 ]],\n\n       [[0.9312545 ]],\n\n       [[0.5725771 ]],\n\n       [[0.20876952]],\n\n       [[0.8619418 ]],\n\n       [[0.68167454]],\n\n       [[0.2378018 ]],\n\n       [[0.47129697]],\n\n       [[0.45731708]],\n\n       [[0.80309325]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT"
+        "T_Reshape": "INT8"
       },
       "input_constraints": {
         "data": {
@@ -9367,7 +11603,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9382,6 +11619,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -9390,24 +11628,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.3028835 ]],\n\n       [[0.9312545 ]],\n\n       [[0.5725771 ]],\n\n       [[0.20876952]],\n\n       [[0.8619418 ]],\n\n       [[0.68167454]],\n\n       [[0.2378018 ]],\n\n       [[0.47129697]],\n\n       [[0.45731708]],\n\n       [[0.80309325]]], dtype=float32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]]], dtype=int8)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -9416,7 +11654,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9431,6 +11670,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -9439,24 +11679,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.55  ]],\n\n       [[0.736 ]],\n\n       [[0.8906]],\n\n       [[0.3862]],\n\n       [[0.164 ]],\n\n       [[0.9033]],\n\n       [[0.5938]],\n\n       [[0.2695]],\n\n       [[0.976 ]],\n\n       [[0.5654]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -9465,7 +11705,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9480,6 +11721,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -9488,24 +11730,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.55  ]],\n\n       [[0.736 ]],\n\n       [[0.8906]],\n\n       [[0.3862]],\n\n       [[0.164 ]],\n\n       [[0.9033]],\n\n       [[0.5938]],\n\n       [[0.2695]],\n\n       [[0.976 ]],\n\n       [[0.5654]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "FLOAT16"
+        "T_Reshape": "INT16"
       },
       "input_constraints": {
         "data": {
@@ -9514,7 +11756,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9529,6 +11772,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -9537,24 +11781,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0.55  ]],\n\n       [[0.736 ]],\n\n       [[0.8906]],\n\n       [[0.3862]],\n\n       [[0.164 ]],\n\n       [[0.9033]],\n\n       [[0.5938]],\n\n       [[0.2695]],\n\n       [[0.976 ]],\n\n       [[0.5654]]], dtype=float16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=int16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -9563,7 +11807,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9578,6 +11823,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -9586,24 +11832,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -9612,7 +11858,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9627,6 +11874,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -9634,25 +11882,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (518 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (893 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (538 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (573 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (211 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (259 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (998 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (602 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1279 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (564 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (557 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (219 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (265 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2765 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT16"
+        "T_Reshape": "INT32"
       },
       "input_constraints": {
         "data": {
@@ -9661,7 +11909,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9676,6 +11925,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -9684,24 +11934,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]]], dtype=int16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -9710,7 +11960,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9725,6 +11976,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -9733,24 +11985,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -9759,7 +12011,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9774,6 +12027,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -9781,25 +12035,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (586 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1123 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (675 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (781 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (222 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (262 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2795 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (747 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1459 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (625 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (871 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (282 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (995 us)\nStarting stage: Completion\nCompleted stage: Completion (66 us)\nRun outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT32"
+        "T_Reshape": "INT64"
       },
       "input_constraints": {
         "data": {
@@ -9808,7 +12062,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9823,6 +12078,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -9831,24 +12087,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=int32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -9857,7 +12113,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9872,6 +12129,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -9880,24 +12138,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -9906,7 +12164,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9921,6 +12180,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -9928,25 +12188,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (484 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1046 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (563 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (708 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (243 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (267 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (901 us)\nStarting stage: Completion\nCompleted stage: Completion (87 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]])]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (495 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1000 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (542 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (557 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (205 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (260 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (861 us)\nStarting stage: Completion\nCompleted stage: Completion (64 us)\nRun outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT64"
+        "T_Reshape": "FLOAT16"
       },
       "input_constraints": {
         "data": {
@@ -9955,7 +12215,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -9970,6 +12231,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -9978,24 +12240,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]])]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.6235 ]],\n\n       [[0.4824 ]],\n\n       [[0.2795 ]],\n\n       [[0.2053 ]],\n\n       [[0.4746 ]],\n\n       [[0.6553 ]],\n\n       [[0.728  ]],\n\n       [[0.01749]],\n\n       [[0.2054 ]],\n\n       [[0.5923 ]]], dtype=float16)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -10004,7 +12266,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10019,6 +12282,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -10027,24 +12291,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -10053,7 +12317,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10068,6 +12333,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -10075,25 +12341,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (479 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1232 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (699 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (1034 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (328 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (262 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (723 us)\nStarting stage: Completion\nCompleted stage: Completion (75 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (555 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (1075 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (556 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (576 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (202 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (256 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2742 us)\nStarting stage: Completion\nCompleted stage: Completion (63 us)\nRun outputs: [array([[[0.21765138]],\n\n       [[0.13110353]],\n\n       [[0.20715334]],\n\n       [[0.40234378]],\n\n       [[0.44116214]],\n\n       [[0.8422852 ]],\n\n       [[0.40917972]],\n\n       [[0.41601565]],\n\n       [[0.6577149 ]],\n\n       [[0.11676026]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "INT8"
+        "T_Reshape": "FLOAT"
       },
       "input_constraints": {
         "data": {
@@ -10102,7 +12368,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10117,6 +12384,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -10125,24 +12393,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=int8)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.21764947]],\n\n       [[0.13111596]],\n\n       [[0.2071834 ]],\n\n       [[0.4024154 ]],\n\n       [[0.44118935]],\n\n       [[0.84208393]],\n\n       [[0.40906036]],\n\n       [[0.41610724]],\n\n       [[0.6575011 ]],\n\n       [[0.1167326 ]]], dtype=float32)]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -10151,7 +12419,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10166,6 +12435,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -10174,24 +12444,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -10200,7 +12470,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10215,6 +12486,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -10223,24 +12495,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT16"
+        "T_Reshape": "DOUBLE"
       },
       "input_constraints": {
         "data": {
@@ -10249,7 +12521,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10264,6 +12537,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -10272,24 +12546,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]]], dtype=uint16)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[0.47765808]],\n\n       [[0.25493646]],\n\n       [[0.10328827]],\n\n       [[0.22459393]],\n\n       [[0.20876352]],\n\n       [[0.75425285]],\n\n       [[0.40839143]],\n\n       [[0.73856112]],\n\n       [[0.8227161 ]],\n\n       [[0.37703054]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -10298,7 +12572,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10313,6 +12588,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -10321,24 +12597,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -10347,7 +12623,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10362,6 +12639,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -10369,25 +12647,25 @@
       "check_result": {
         "compile": {
           "result": {
-            "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "success": true,
+            "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (601 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (928 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (548 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (550 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (216 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (339 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (2761 us)\nStarting stage: Completion\nCompleted stage: Completion (65 us)\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Starting stage: Graph Preparation Initializing\nCompleted stage: Graph Preparation Initializing (485 us)\nStarting stage: Graph Optimizations\nCompleted stage: Graph Optimizations (818 us)\nStarting stage: Post Graph Optimization\nCompleted stage: Post Graph Optimization (530 us)\nStarting stage: Graph Sequencing for Target\nCompleted stage: Graph Sequencing for Target (544 us)\nStarting stage: VTCM Allocation\nCompleted stage: VTCM Allocation (253 us)\nStarting stage: Parallelization Optimization\nCompleted stage: Parallelization Optimization (255 us)\nStarting stage: Finalizing Graph Sequence\n\n====== DDR bandwidth summary ======\nspill_bytes=0\nfill_bytes=0\nwrite_total_bytes=2048\nread_total_bytes=2048\n\nCompleted stage: Finalizing Graph Sequence (1020 us)\nStarting stage: Completion\nCompleted stage: Completion (67 us)\nRun outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT32"
+        "T_Reshape": "BOOL"
       },
       "input_constraints": {
         "data": {
@@ -10396,7 +12674,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10411,6 +12690,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -10419,24 +12699,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
             "success": true,
             "reason": null
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]]], dtype=uint32)]\n",
-          "stderr": ""
+          "stdout": "Run outputs: [array([[[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]],\n\n       [[False]],\n\n       [[ True]],\n\n       [[ True]]])]\n",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -10445,7 +12725,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10460,6 +12741,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -10468,24 +12750,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -10494,7 +12776,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10509,6 +12792,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -10517,24 +12801,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT64"
+        "T_Reshape": "UINT4"
       },
       "input_constraints": {
         "data": {
@@ -10543,7 +12827,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10558,6 +12843,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -10566,24 +12852,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]]], dtype=uint64)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -10592,7 +12878,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10607,6 +12894,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": true,
         "shape": false
@@ -10615,24 +12903,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -10641,7 +12929,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10656,6 +12945,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": true
@@ -10664,24 +12954,24 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     },
     {
       "type_vars": {
-        "T_Reshape": "UINT8"
+        "T_Reshape": "INT4"
       },
       "input_constraints": {
         "data": {
@@ -10690,7 +12980,8 @@
             5,
             1,
             2
-          ]
+          ],
+          "min_max": null
         },
         "shape": {
           "type": "value",
@@ -10705,6 +12996,7 @@
       "attrs": {
         "allowzero": 1
       },
+      "dynamic_axes": {},
       "input_is_constant": {
         "data": false,
         "shape": false
@@ -10713,18 +13005,18 @@
         "compile": {
           "result": {
             "success": false,
-            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:816 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
+            "reason": "[ONNXRuntimeError] : 1 : FAIL : graph_partitioner.cc:825 onnxruntime::CreateEpContextModel Unable to compile any nodes. Check that the session EPs support compilation and can execute at least one subgraph in the model."
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\n",
-          "stderr": ""
+          "stdout": "",
+          "stderr": "DSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\nDSP_INFO UNSUPPORTED_KEY: 49\nDSP_INFO UNSUPPORTED_KEY: 50\n"
         },
         "run": {
           "result": {
-            "success": true,
-            "reason": null
+            "success": false,
+            "reason": "Timeout/crash/fail for 1 attempts: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for Reshape(21) node with name ''"
           },
-          "stdout": "Adding QNNExecutionProvider for OrtHardwareDeviceType.NPU\nRun outputs: [array([[[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[1]],\n\n       [[1]],\n\n       [[1]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]],\n\n       [[0]]], dtype=uint8)]\n",
-          "stderr": ""
+          "stdout": null,
+          "stderr": null
         }
       }
     }
@@ -10732,7 +13024,7 @@
   "sys_info": {
     "cpuList": [
       {
-        "name": "Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU",
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Oryon(TM) CPU",
         "manufacturer": "Qualcomm Technologies Inc",
         "coreCount": 12,
         "threadCount": 12,
@@ -10741,9 +13033,9 @@
     ],
     "gpuList": [
       {
-        "name": "Qualcomm(R) Adreno(TM) X1-85 GPU",
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Adreno(TM) GPU",
         "manufacturer": "Qualcomm Incorporated",
-        "driverVersion": "31.0.114.0",
+        "driverVersion": "31.0.57.0",
         "vramMib": 0,
         "vendorId": 1297040209,
         "deviceId": 909329200
@@ -10751,9 +13043,9 @@
     ],
     "npuList": [
       {
-        "name": "Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Hexagon(TM) NPU",
+        "name": "Snapdragon(R) X Elite - X1E78100 - Qualcomm(R) Hexagon(TM) NPU",
         "manufacturer": "Qualcomm Technologies, Inc.",
-        "driverVersion": "30.0.143.0",
+        "driverVersion": "30.0.220.3000",
         "vendorId": 1297040209,
         "deviceId": 1093682224
       }
@@ -10761,43 +13053,57 @@
     "ramList": [
       {
         "capacityMib": 32768,
-        "speedMt": 8448,
-        "manufacturer": "HYNIX"
+        "speedMt": 7372,
+        "manufacturer": ""
       }
     ],
     "os": {
       "caption": "Microsoft Windows 11 Enterprise",
       "version": "10.0.26200",
       "architecture": "ARM 64-bit Processor",
-      "sku": 4
+      "sku": 4,
+      "buildNumber": "26200",
+      "isWindows11": true
     },
     "pythonRuntime": {
-      "version": "3.12.10",
+      "version": "3.11.15",
       "implementation": "CPython",
-      "architecture": "ARM64",
-      "compiler": "MSC v.1943 64 bit (AMD64)",
-      "buildNumber": "May 30 2025 05:39:07"
+      "architecture": "AMD64",
+      "compiler": "MSC v.1944 64 bit (AMD64)",
+      "buildNumber": "Mar 20 2026 00:32:44"
     },
     "pipPackages": [
       {
-        "name": "wml-modelkit",
+        "name": "winml-cli",
         "version": "0.1.0"
       },
+      {
+        "name": "winml-modelkit",
+        "version": "0.0.2"
+      },
       {
         "name": "aiohappyeyeballs",
         "version": "2.6.1"
       },
       {
         "name": "aiohttp",
-        "version": "3.13.2"
+        "version": "3.13.5"
       },
       {
         "name": "aiosignal",
         "version": "1.4.0"
       },
+      {
+        "name": "annotated-doc",
+        "version": "0.0.4"
+      },
+      {
+        "name": "annotated-types",
+        "version": "0.7.0"
+      },
       {
         "name": "anyio",
-        "version": "4.12.0"
+        "version": "4.13.0"
       },
       {
         "name": "argon2-cffi",
@@ -10815,17 +13121,21 @@
         "name": "asttokens",
         "version": "3.0.1"
       },
+      {
+        "name": "ast_serialize",
+        "version": "0.5.0"
+      },
       {
         "name": "async-lru",
-        "version": "2.0.5"
+        "version": "2.3.0"
       },
       {
         "name": "attrs",
-        "version": "25.4.0"
+        "version": "26.1.0"
       },
       {
         "name": "babel",
-        "version": "2.17.0"
+        "version": "2.18.0"
       },
       {
         "name": "beautifulsoup4",
@@ -10837,28 +13147,28 @@
       },
       {
         "name": "certifi",
-        "version": "2025.11.12"
+        "version": "2026.2.25"
       },
       {
         "name": "cffi",
         "version": "2.0.0"
       },
+      {
+        "name": "cfgv",
+        "version": "3.5.0"
+      },
       {
         "name": "charset-normalizer",
-        "version": "3.4.4"
+        "version": "3.4.7"
       },
       {
         "name": "click",
-        "version": "8.3.1"
+        "version": "8.4.1"
       },
       {
         "name": "colorama",
         "version": "0.4.6"
       },
-      {
-        "name": "coloredlogs",
-        "version": "15.0.1"
-      },
       {
         "name": "comm",
         "version": "0.2.3"
@@ -10869,7 +13179,11 @@
       },
       {
         "name": "coverage",
-        "version": "7.12.0"
+        "version": "7.13.5"
+      },
+      {
+        "name": "cryptography",
+        "version": "46.0.7"
       },
       {
         "name": "cycler",
@@ -10877,11 +13191,11 @@
       },
       {
         "name": "datasets",
-        "version": "4.4.1"
+        "version": "4.8.4"
       },
       {
         "name": "debugpy",
-        "version": "1.8.17"
+        "version": "1.8.20"
       },
       {
         "name": "decorator",
@@ -10891,8 +13205,16 @@
         "name": "defusedxml",
         "version": "0.7.1"
       },
+      {
+        "name": "diffusers",
+        "version": "0.37.1"
+      },
       {
         "name": "dill",
+        "version": "0.4.1"
+      },
+      {
+        "name": "distlib",
         "version": "0.4.0"
       },
       {
@@ -10903,21 +13225,25 @@
         "name": "executing",
         "version": "2.2.1"
       },
+      {
+        "name": "fastapi",
+        "version": "0.136.0"
+      },
       {
         "name": "fastjsonschema",
         "version": "2.21.2"
       },
       {
         "name": "filelock",
-        "version": "3.20.0"
+        "version": "3.25.2"
       },
       {
         "name": "flatbuffers",
-        "version": "25.9.23"
+        "version": "25.12.19"
       },
       {
         "name": "fonttools",
-        "version": "4.61.0"
+        "version": "4.63.0"
       },
       {
         "name": "fqdn",
@@ -10929,47 +13255,59 @@
       },
       {
         "name": "fsspec",
-        "version": "2025.10.0"
+        "version": "2026.2.0"
       },
       {
         "name": "h11",
         "version": "0.16.0"
       },
+      {
+        "name": "hf-xet",
+        "version": "1.4.3"
+      },
       {
         "name": "httpcore",
         "version": "1.0.9"
       },
+      {
+        "name": "httptools",
+        "version": "0.7.1"
+      },
       {
         "name": "httpx",
         "version": "0.28.1"
       },
       {
-        "name": "huggingface-hub",
-        "version": "0.36.0"
+        "name": "httpx-sse",
+        "version": "0.4.3"
+      },
+      {
+        "name": "huggingface_hub",
+        "version": "0.36.2"
       },
       {
-        "name": "humanfriendly",
-        "version": "10.0"
+        "name": "identify",
+        "version": "2.6.18"
       },
       {
         "name": "idna",
         "version": "3.11"
       },
+      {
+        "name": "importlib_metadata",
+        "version": "8.7.1"
+      },
       {
         "name": "iniconfig",
         "version": "2.3.0"
       },
       {
         "name": "ipykernel",
-        "version": "7.1.0"
+        "version": "7.2.0"
       },
       {
         "name": "ipython",
-        "version": "9.7.0"
-      },
-      {
-        "name": "ipython_pygments_lexers",
-        "version": "1.1.1"
+        "version": "8.39.0"
       },
       {
         "name": "ipywidgets",
@@ -10989,19 +13327,19 @@
       },
       {
         "name": "joblib",
-        "version": "1.5.2"
+        "version": "1.5.3"
       },
       {
         "name": "json5",
-        "version": "0.12.1"
+        "version": "0.14.0"
       },
       {
         "name": "jsonpointer",
-        "version": "3.0.0"
+        "version": "3.1.1"
       },
       {
         "name": "jsonschema",
-        "version": "4.25.1"
+        "version": "4.26.0"
       },
       {
         "name": "jsonschema-specifications",
@@ -11013,7 +13351,7 @@
       },
       {
         "name": "jupyterlab",
-        "version": "4.5.0"
+        "version": "4.5.6"
       },
       {
         "name": "jupyterlab_pygments",
@@ -11029,7 +13367,7 @@
       },
       {
         "name": "jupyter_client",
-        "version": "8.6.3"
+        "version": "8.8.0"
       },
       {
         "name": "jupyter-console",
@@ -11045,7 +13383,7 @@
       },
       {
         "name": "jupyter-lsp",
-        "version": "2.3.0"
+        "version": "2.3.1"
       },
       {
         "name": "jupyter_server",
@@ -11053,11 +13391,11 @@
       },
       {
         "name": "jupyter_server_terminals",
-        "version": "0.5.3"
+        "version": "0.5.4"
       },
       {
         "name": "kiwisolver",
-        "version": "1.4.9"
+        "version": "1.5.0"
       },
       {
         "name": "lark",
@@ -11065,7 +13403,11 @@
       },
       {
         "name": "librt",
-        "version": "0.6.3"
+        "version": "0.11.0"
+      },
+      {
+        "name": "lightning-utilities",
+        "version": "0.15.3"
       },
       {
         "name": "markdown-it-py",
@@ -11077,19 +13419,23 @@
       },
       {
         "name": "matplotlib",
-        "version": "3.10.7"
+        "version": "3.10.9"
       },
       {
         "name": "matplotlib-inline",
         "version": "0.2.1"
       },
+      {
+        "name": "mcp",
+        "version": "1.27.0"
+      },
       {
         "name": "mdurl",
         "version": "0.1.2"
       },
       {
         "name": "mistune",
-        "version": "3.1.4"
+        "version": "3.2.0"
       },
       {
         "name": "ml_dtypes",
@@ -11101,15 +13447,15 @@
       },
       {
         "name": "multidict",
-        "version": "6.7.0"
+        "version": "6.7.1"
       },
       {
         "name": "multiprocess",
-        "version": "0.70.18"
+        "version": "0.70.19"
       },
       {
         "name": "mypy",
-        "version": "1.19.0"
+        "version": "2.1.0"
       },
       {
         "name": "mypy_extensions",
@@ -11117,11 +13463,11 @@
       },
       {
         "name": "nbclient",
-        "version": "0.10.2"
+        "version": "0.10.4"
       },
       {
         "name": "nbconvert",
-        "version": "7.16.6"
+        "version": "7.17.1"
       },
       {
         "name": "nbformat",
@@ -11133,11 +13479,15 @@
       },
       {
         "name": "networkx",
-        "version": "3.6"
+        "version": "3.4.2"
+      },
+      {
+        "name": "nodeenv",
+        "version": "1.10.0"
       },
       {
         "name": "notebook",
-        "version": "7.5.0"
+        "version": "7.5.5"
       },
       {
         "name": "notebook_shim",
@@ -11149,27 +13499,47 @@
       },
       {
         "name": "onnx",
-        "version": "1.19.1"
+        "version": "1.18.0"
       },
       {
-        "name": "onnxruntime",
-        "version": "1.23.2"
+        "name": "onnxruntime-windowsml",
+        "version": "1.24.5.202604171637"
       },
       {
         "name": "onnxscript",
-        "version": "0.5.6"
+        "version": "0.6.2"
       },
       {
         "name": "onnx-ir",
-        "version": "0.1.12"
+        "version": "0.2.0"
+      },
+      {
+        "name": "opentelemetry-api",
+        "version": "1.41.0"
+      },
+      {
+        "name": "opentelemetry-sdk",
+        "version": "1.41.0"
+      },
+      {
+        "name": "opentelemetry-semantic-conventions",
+        "version": "0.62b0"
       },
       {
         "name": "optimum",
-        "version": "2.0.0"
+        "version": "2.1.0"
+      },
+      {
+        "name": "optimum-onnx",
+        "version": "0.1.0"
+      },
+      {
+        "name": "overrides",
+        "version": "7.7.0"
       },
       {
         "name": "packaging",
-        "version": "25.0"
+        "version": "26.0"
       },
       {
         "name": "pandas",
@@ -11181,27 +13551,35 @@
       },
       {
         "name": "parso",
-        "version": "0.8.5"
+        "version": "0.8.6"
       },
       {
         "name": "pathspec",
-        "version": "0.12.1"
+        "version": "1.1.1"
       },
       {
         "name": "pillow",
-        "version": "12.0.0"
+        "version": "12.2.0"
       },
       {
         "name": "platformdirs",
-        "version": "4.5.0"
+        "version": "4.9.6"
+      },
+      {
+        "name": "plotext",
+        "version": "5.3.2"
       },
       {
         "name": "pluggy",
         "version": "1.6.0"
       },
+      {
+        "name": "pre_commit",
+        "version": "4.5.1"
+      },
       {
         "name": "prometheus_client",
-        "version": "0.23.1"
+        "version": "0.25.0"
       },
       {
         "name": "prompt_toolkit",
@@ -11213,11 +13591,11 @@
       },
       {
         "name": "protobuf",
-        "version": "6.33.1"
+        "version": "7.34.1"
       },
       {
         "name": "psutil",
-        "version": "7.1.3"
+        "version": "7.2.2"
       },
       {
         "name": "pure_eval",
@@ -11225,47 +13603,83 @@
       },
       {
         "name": "pyarrow",
-        "version": "22.0.0"
+        "version": "23.0.1"
+      },
+      {
+        "name": "pycocotools",
+        "version": "2.0.11"
       },
       {
         "name": "pycparser",
-        "version": "2.23"
+        "version": "3.0"
+      },
+      {
+        "name": "pydantic",
+        "version": "2.13.0"
+      },
+      {
+        "name": "pydantic_core",
+        "version": "2.46.0"
+      },
+      {
+        "name": "pydantic-settings",
+        "version": "2.14.0"
       },
       {
         "name": "Pygments",
-        "version": "2.19.2"
+        "version": "2.20.0"
       },
       {
-        "name": "pyparsing",
-        "version": "3.2.5"
+        "name": "PyJWT",
+        "version": "2.12.1"
       },
       {
-        "name": "pyreadline3",
-        "version": "3.5.4"
+        "name": "pyparsing",
+        "version": "3.3.2"
       },
       {
         "name": "pytest",
-        "version": "9.0.1"
+        "version": "9.0.3"
       },
       {
         "name": "pytest-cov",
-        "version": "7.0.0"
+        "version": "7.1.0"
+      },
+      {
+        "name": "pytest-timeout",
+        "version": "2.4.0"
       },
       {
         "name": "python-dateutil",
         "version": "2.9.0.post0"
       },
+      {
+        "name": "python-discovery",
+        "version": "1.2.2"
+      },
+      {
+        "name": "python-dotenv",
+        "version": "1.2.2"
+      },
       {
         "name": "python-json-logger",
-        "version": "4.0.0"
+        "version": "4.1.0"
+      },
+      {
+        "name": "python-multipart",
+        "version": "0.0.26"
       },
       {
         "name": "pytz",
-        "version": "2025.2"
+        "version": "2026.1.post1"
+      },
+      {
+        "name": "pywin32",
+        "version": "311"
       },
       {
         "name": "pywinpty",
-        "version": "3.0.2"
+        "version": "3.0.3"
       },
       {
         "name": "PyYAML",
@@ -11275,17 +13689,21 @@
         "name": "pyzmq",
         "version": "27.1.0"
       },
+      {
+        "name": "RapidFuzz",
+        "version": "3.14.5"
+      },
       {
         "name": "referencing",
         "version": "0.37.0"
       },
       {
         "name": "regex",
-        "version": "2025.11.3"
+        "version": "2026.4.4"
       },
       {
         "name": "requests",
-        "version": "2.32.5"
+        "version": "2.33.1"
       },
       {
         "name": "rfc3339-validator",
@@ -11301,7 +13719,7 @@
       },
       {
         "name": "rich",
-        "version": "14.2.0"
+        "version": "15.0.0"
       },
       {
         "name": "rpds-py",
@@ -11309,7 +13727,7 @@
       },
       {
         "name": "ruff",
-        "version": "0.14.7"
+        "version": "0.15.13"
       },
       {
         "name": "safetensors",
@@ -11329,11 +13747,19 @@
       },
       {
         "name": "Send2Trash",
-        "version": "1.8.3"
+        "version": "2.1.0"
+      },
+      {
+        "name": "sentencepiece",
+        "version": "0.2.1"
+      },
+      {
+        "name": "seqeval",
+        "version": "1.2.2"
       },
       {
         "name": "setuptools",
-        "version": "80.9.0"
+        "version": "81.0.0"
       },
       {
         "name": "six",
@@ -11345,12 +13771,20 @@
       },
       {
         "name": "soupsieve",
-        "version": "2.8"
+        "version": "2.8.3"
+      },
+      {
+        "name": "sse-starlette",
+        "version": "3.3.4"
       },
       {
         "name": "stack-data",
         "version": "0.6.3"
       },
+      {
+        "name": "starlette",
+        "version": "1.0.0"
+      },
       {
         "name": "sympy",
         "version": "1.14.0"
@@ -11365,7 +13799,7 @@
       },
       {
         "name": "timm",
-        "version": "1.0.22"
+        "version": "1.0.26"
       },
       {
         "name": "tinycss2",
@@ -11373,27 +13807,31 @@
       },
       {
         "name": "tokenizers",
-        "version": "0.22.1"
+        "version": "0.22.2"
       },
       {
         "name": "torch",
-        "version": "2.9.1"
+        "version": "2.11.0"
       },
       {
         "name": "torchinfo",
         "version": "1.8.0"
       },
+      {
+        "name": "torchmetrics",
+        "version": "1.9.0"
+      },
       {
         "name": "torchvision",
-        "version": "0.24.1"
+        "version": "0.26.0"
       },
       {
         "name": "tornado",
-        "version": "6.5.2"
+        "version": "6.5.5"
       },
       {
         "name": "tqdm",
-        "version": "4.67.1"
+        "version": "4.67.3"
       },
       {
         "name": "traitlets",
@@ -11401,15 +13839,23 @@
       },
       {
         "name": "transformers",
-        "version": "4.57.3"
+        "version": "4.57.6"
+      },
+      {
+        "name": "types-colorama",
+        "version": "0.4.15.20260508"
       },
       {
         "name": "typing_extensions",
         "version": "4.15.0"
       },
+      {
+        "name": "typing-inspection",
+        "version": "0.4.2"
+      },
       {
         "name": "tzdata",
-        "version": "2025.2"
+        "version": "2026.1"
       },
       {
         "name": "uri-template",
@@ -11417,19 +13863,23 @@
       },
       {
         "name": "urllib3",
-        "version": "2.5.0"
+        "version": "2.6.3"
+      },
+      {
+        "name": "uvicorn",
+        "version": "0.45.0"
       },
       {
-        "name": "wasdk-Microsoft.Windows.AI.MachineLearning",
-        "version": "1.8.251106002"
+        "name": "virtualenv",
+        "version": "21.2.3"
       },
       {
-        "name": "wasdk-Microsoft.Windows.ApplicationModel.DynamicDependency.Bootstrap",
-        "version": "1.8.251106002"
+        "name": "watchfiles",
+        "version": "1.1.1"
       },
       {
         "name": "wcwidth",
-        "version": "0.2.14"
+        "version": "0.6.0"
       },
       {
         "name": "webcolors",
@@ -11439,6 +13889,10 @@
         "name": "webencodings",
         "version": "0.5.1"
       },
+      {
+        "name": "websockets",
+        "version": "16.0"
+      },
       {
         "name": "websocket-client",
         "version": "1.9.0"
@@ -11448,42 +13902,79 @@
         "version": "4.0.15"
       },
       {
-        "name": "winrt-runtime",
-        "version": "3.2.1"
+        "name": "windowsml",
+        "version": "2.0.300"
       },
       {
-        "name": "winrt-Windows.Foundation",
-        "version": "3.2.1"
+        "name": "winml-cli",
+        "version": "0.1.0"
+      },
+      {
+        "name": "xxhash",
+        "version": "3.6.0"
+      },
+      {
+        "name": "yarl",
+        "version": "1.23.0"
       },
       {
-        "name": "winrt-Windows.Foundation.Collections",
-        "version": "3.2.1"
+        "name": "zipp",
+        "version": "3.23.1"
       },
       {
-        "name": "wml-modelkit",
+        "name": "winml-cli",
         "version": "0.1.0"
       },
       {
-        "name": "xxhash",
-        "version": "3.6.0"
+        "name": "winml-modelkit",
+        "version": "0.0.2"
       },
       {
-        "name": "yarl",
-        "version": "1.22.0"
+        "name": "importlib_metadata",
+        "version": "8.7.1"
+      },
+      {
+        "name": "microvenv",
+        "version": "2025.0"
+      },
+      {
+        "name": "packaging",
+        "version": "26.0"
+      },
+      {
+        "name": "tomli",
+        "version": "2.4.0"
+      },
+      {
+        "name": "typing_extensions",
+        "version": "4.15.0"
+      },
+      {
+        "name": "zipp",
+        "version": "3.21.0"
       }
     ],
     "epPackages": [
       {
-        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.21.0_arm64__8wekyb3d8bbwe",
-        "version": "1.8.21.0",
+        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
+        "version": "1.8.30.0",
+        "publisher": "CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US",
+        "architecture": 12,
+        "signatureKind": "Developer",
+        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.30.0_arm64__8wekyb3d8bbwe",
+        "epHash": "0b4dd71044175fb927d3b44a50b7dee4b003a3dfe86a9b09c3ca83f11150979215c256b0301bced2c7e684f84e42ec964532215c147b8b770399d6b9441afc1a",
+        "status": 0
+      },
+      {
+        "name": "MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.2_2.2450.47.0_arm64__8wekyb3d8bbwe",
+        "version": "2.2450.47.0",
         "publisher": "CN=Microsoft Corporation, O=Microsoft Corporation, L=Redmond, S=Washington, C=US",
         "architecture": 12,
         "signatureKind": "Developer",
-        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.1.8_1.8.21.0_arm64__8wekyb3d8bbwe",
-        "epHash": "c62cee3f6a7ca26b76390f5158cf450373ac6caca058db519bf89867bf4c713c495b33deb11018342742eefe2f559f52239c2fd62ff039537e221e4589dbcbcf",
+        "installLocation": "C:\\Program Files\\WindowsApps\\MicrosoftCorporationII.WinML.Qualcomm.QNN.EP.2_2.2450.47.0_arm64__8wekyb3d8bbwe",
+        "epHash": "343f2e6da7490f6721e40942a86a40fa01322c354d784b024491d151ec511e6dba7a9041c3594aa97ff0c0379cf627b88414b25328f931f8ddaabe78a6784102",
         "status": 0
       }
-    ],
-    "windowsAppRuntimeVersion": "1.8.251106002"
+    ]
   }
-}
\ No newline at end of file
+}
diff --git a/tests/integration/analyze/runtime_checker/test_helper.py b/tests/integration/analyze/runtime_checker/test_helper.py
index 90d6de830..9e96372c8 100644
--- a/tests/integration/analyze/runtime_checker/test_helper.py
+++ b/tests/integration/analyze/runtime_checker/test_helper.py
@@ -60,6 +60,7 @@ def reshape_quick_helper(
         ep_checker,
         capture_output=True,
     )
+    # check_on_ep returns a lazy iterator; materialize it now so results are captured once.
     result = {"check_results": list(test_results_iter), "sys_info": sys_info}
     with truth_file.open() as f:
         truth_object = json.load(f)
@@ -99,11 +100,11 @@ def op_quick_helper(
     schema = ONNXDomain.AI_ONNX.get_op_schema(op_name, opset)
     gen = generator_class(schema)
 
-    test_results = gen.check_on_ep(
+    test_results_iter = gen.check_on_ep(
         ep_checker,
         capture_output=True,
     )
-    result = {"check_results": test_results, "sys_info": sys_info}
+    result = {"check_results": list(test_results_iter), "sys_info": sys_info}
 
     with truth_file.open() as f:
         truth_object = json.load(f)
@@ -120,6 +121,7 @@ def should_run_ep_test(ep_name: str, device_type, skip_message: str | None = Non
     # Run if hardware is available
     try:
         from winml.modelkit import winml
+
         winml.register_execution_providers(ort=True)
         import onnxruntime as ort
 

From 4923fd903e4c78bfa4e5910227a8f728c4748197 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Wed, 3 Jun 2026 10:17:23 +0800
Subject: [PATCH 026/143] chore: typechecking build, commands, compiler folder
 (#789)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Table for stub

```
  ┌──────────────┬──────────┬────────────────────────────────┬────────────────────────────────────────────┐
  │     Lib      │ py.typed │            Reality             │              Override status               │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ torch        │ yes      │ Has inline types (v2.11)       │ Override is a no-op — mypy uses real types │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ torchvision  │ no       │ No types, no community stubs   │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ onnx         │ yes      │ Has inline types (v1.18)       │ Override is a no-op                        │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ onnxruntime  │ no       │ Untyped; no community stubs    │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ transformers │ yes      │ Inline types but partial/loose │ Override is a no-op — types ARE used       │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ datasets     │ no       │ Untyped                        │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ optimum      │ no       │ Untyped                        │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ timm         │ yes      │ Has inline types (v1.0.26)     │ Override is a no-op                        │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ onnxscript   │ yes      │ Has inline types (v0.7)        │ Override is a no-op                        │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ snakemd      │ no       │ Untyped                        │ Genuinely needed                           │
  ├──────────────┼──────────┼────────────────────────────────┼────────────────────────────────────────────┤
  │ openvino     │ n/a      │ Not installed locally          │ n/a                                        │
  └──────────────┴──────────┴────────────────────────────────┴────────────────────────────────────────────┘
```

plotext added to ignore_missing_imports (no community stubs, untyped
library)

---------

Co-authored-by: Hualiang Xie <hualxie@microsoft.com>
---
 pyproject.toml                                | 22 +----
 src/winml/modelkit/analyze/analyzer.py        | 10 +--
 src/winml/modelkit/build/__init__.py          |  4 +-
 src/winml/modelkit/build/common.py            | 20 +++--
 src/winml/modelkit/build/hf.py                |  5 +-
 src/winml/modelkit/commands/analyze.py        | 56 ++++++++-----
 src/winml/modelkit/commands/build.py          | 50 +++++++-----
 src/winml/modelkit/commands/config.py         | 43 +++++-----
 src/winml/modelkit/commands/eval.py           | 58 +++++++------
 src/winml/modelkit/commands/inspect.py        | 12 ++-
 src/winml/modelkit/commands/optimize.py       |  8 +-
 src/winml/modelkit/commands/perf.py           | 23 ++++--
 src/winml/modelkit/commands/quantize.py       | 15 +++-
 src/winml/modelkit/commands/run.py            | 26 +++---
 src/winml/modelkit/commands/sys.py            | 50 +++++++-----
 src/winml/modelkit/compiler/__init__.py       | 15 +++-
 src/winml/modelkit/compiler/compiler.py       |  2 +-
 src/winml/modelkit/compiler/configs.py        |  4 +-
 src/winml/modelkit/compiler/context.py        |  8 +-
 src/winml/modelkit/compiler/stages/compile.py |  2 +-
 .../modelkit/models/winml/composite_model.py  |  2 +-
 src/winml/modelkit/optracing/base.py          | 18 ++++-
 src/winml/modelkit/optracing/qnn/profiler.py  |  4 +-
 src/winml/modelkit/serve/app.py               |  2 +-
 src/winml/modelkit/utils/cli.py               | 33 +++++---
 tests/unit/analyze/test_analyze_onnx.py       | 24 +-----
 tests/unit/commands/test_eval.py              | 81 +++++++++++++++++++
 27 files changed, 374 insertions(+), 223 deletions(-)

diff --git a/pyproject.toml b/pyproject.toml
index 6ce286c71..9d8b6049c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -116,6 +116,7 @@ dev = [
   "pytest-timeout>=2.4.0",
   "types-jsonschema>=4.26.0.20260518",
   "types-protobuf>=7.34.1.20260518",
+  "types-pyyaml>=6.0.12.20260518",
   "types-tqdm>=4.67.3.20260518",
 ]
 
@@ -472,43 +473,24 @@ cache_dir = "nul"
 [[tool.mypy.overrides]]
 module = [
   # Third-party ML libraries (no stubs)
-  "torch.*",
   "torchvision.*",
-  "onnx.*",
   "onnxruntime.*",
-  "transformers.*",
   "datasets.*",
   "optimum.*",
-  "timm.*",
-  "onnxscript.*",
   "snakemd",
   "openvino",
   "openvino.*",
-  # Namespace package
-  "winml",
-  "winml.*",
+  "plotext",
 ]
 ignore_missing_imports = true
 
 # Relaxed modules: tests and WIP code
-# TODO: Gradually fix types in modelkit.* and tighten rules
 
 [[tool.mypy.overrides]]
 module = [
   # Tests
   "tests",
   "tests.*",
-  "winml.modelkit.configs",
-  "winml.modelkit.loader",
-  "winml.modelkit.export",
-  "winml.modelkit.export.*",
-  "winml.modelkit.optimization",
-  "winml.modelkit.optimization.*",
-  "winml.modelkit.models",
-  "winml.modelkit.models.*",
-  "winml.modelkit.utils",
-  "winml.modelkit.utils.*",
-  "winml.modelkit.commands.export",
 ]
 disallow_untyped_defs = false
 disallow_incomplete_defs = false
diff --git a/src/winml/modelkit/analyze/analyzer.py b/src/winml/modelkit/analyze/analyzer.py
index 5525bf061..1bc8b60c8 100644
--- a/src/winml/modelkit/analyze/analyzer.py
+++ b/src/winml/modelkit/analyze/analyzer.py
@@ -838,14 +838,6 @@ def has_errors(self) -> bool:
         """True if blocking errors (unsupported patterns) exist."""
         return self.lint.errors > 0
 
-    @property
-    def autoconf(self) -> WinMLOptimizationConfig | None:
-        """Auto-discovered optimization config, or None/empty if nothing found.
-
-        Falsy when no opportunities: ``if result.autoconf: ...``
-        """
-        return self.optimization_config
-
 
 def analyze_onnx(
     model: str | Path,
@@ -892,7 +884,7 @@ def analyze_onnx(
         >>> result = analyze_onnx("optimized.onnx", ep="qnn", device="NPU")
         >>> if result.has_errors:
         ...     print(f"Errors: {result.lint.error_patterns}")
-        >>> if result.autoconf:
+        >>> if result.optimization_config:
         ...     print(f"Autoconf: {result.optimization_config.to_dict()}")
 
         >>> # Save full analysis JSON alongside the model
diff --git a/src/winml/modelkit/build/__init__.py b/src/winml/modelkit/build/__init__.py
index 2a6c6f107..8f283b26b 100644
--- a/src/winml/modelkit/build/__init__.py
+++ b/src/winml/modelkit/build/__init__.py
@@ -25,6 +25,8 @@
     )
 """
 
+from typing import Any
+
 from .hf import BuildResult, build_hf_model
 from .onnx import build_onnx_model
 
@@ -42,7 +44,7 @@
 }
 
 
-def __getattr__(name: str):
+def __getattr__(name: str) -> Any:
     """Lazy-load build helpers to avoid pulling in heavy deps at import time."""
     if name in _LAZY_IMPORTS:
         module_path, attr_name = _LAZY_IMPORTS[name]
diff --git a/src/winml/modelkit/build/common.py b/src/winml/modelkit/build/common.py
index 462da96ec..6c2dda0c8 100644
--- a/src/winml/modelkit/build/common.py
+++ b/src/winml/modelkit/build/common.py
@@ -162,31 +162,32 @@ def _run_analyze_loop(
             )
             analyze_iterations += 1
 
-            if not analysis.autoconf:
+            optim_config = analysis.optimization_config
+            if not optim_config:
                 break
 
             logger.info(
                 "Autoconf iteration %d: discovered %s",
                 _iteration + 1,
-                analysis.optimization_config.to_dict(),
+                optim_config.to_dict(),
             )
 
             # Notify: patterns discovered
             if on_patterns_discovered is not None:
-                on_patterns_discovered(analysis.optimization_config)
+                on_patterns_discovered(optim_config)
 
             # Notify: re-optimizing with discovered flags
             if on_reoptimize is not None:
-                on_reoptimize(analysis.optimization_config)
+                on_reoptimize(optim_config)
 
             # Re-optimize with ONLY the autoconf flags (not merged with original)
             optimize_onnx(
                 model=iter_model,
                 output=iter_model,
                 **kwargs,
-                **analysis.optimization_config,
+                **optim_config,
             )
-            discovered_optim.update(analysis.optimization_config)
+            discovered_optim.update(optim_config)
         else:
             logger.warning(
                 "Autoconf did not converge after %d iteration(s)",
@@ -214,10 +215,13 @@ def _run_analyze_loop(
         config.optim.update(discovered_optim)
         logger.info("  [autoconf] final config: %s", discovered_optim)
 
-    if analysis.autoconf:
+    # analysis is None only when max_optim_iterations == 0 (the loop body never
+    # ran, so analyze_onnx was never called).
+    final_optim_config = analysis.optimization_config if analysis else None
+    if final_optim_config:
         logger.warning(
             "Analysis still has autoconf suggestions: %s",
-            analysis.optimization_config.to_dict(),
+            final_optim_config.to_dict(),
         )
 
     if analysis is not None and analysis.has_errors:
diff --git a/src/winml/modelkit/build/hf.py b/src/winml/modelkit/build/hf.py
index e9e67aeb3..26356a6eb 100644
--- a/src/winml/modelkit/build/hf.py
+++ b/src/winml/modelkit/build/hf.py
@@ -478,7 +478,10 @@ def _load_model(
 
         # Prefer explicit model_class from loader config (set by winml config),
         # fall back to resolve_task_and_model_class for auto-detection.
-        model_class = None
+        # Annotated Any: resolvers return bare `type`, but the actual classes are
+        # HF model classes with extra methods (from_config, from_pretrained, etc.)
+        # that bare `type` doesn't expose.
+        model_class: Any = None
         if config.loader.model_class:
             from ..loader import resolve_hf_model_class
 
diff --git a/src/winml/modelkit/commands/analyze.py b/src/winml/modelkit/commands/analyze.py
index bc60f1752..29f8b0d23 100644
--- a/src/winml/modelkit/commands/analyze.py
+++ b/src/winml/modelkit/commands/analyze.py
@@ -17,7 +17,7 @@
 import re
 import sys
 from pathlib import Path
-from typing import Literal
+from typing import TYPE_CHECKING, Literal, cast
 
 import click
 from rich.console import Console
@@ -48,6 +48,10 @@
 from ..utils.logging import configure_logging
 
 
+if TYPE_CHECKING:
+    from ..analyze.models.runtime_checks import PatternRuntime
+
+
 logger = logging.getLogger(__name__)
 
 # ── Rich visualization helpers ────────────────────────────────────────────
@@ -396,7 +400,7 @@ def _render_analysis_summary(
     console.print("═" * 80)
 
     if not results:
-        ep_label = ep or "all EPs"
+        ep_label: str = ep or "all EPs"
         if device:
             msg = (
                 f"   [dim]No runtime check results for [bold]{ep_label}[/bold] "
@@ -512,10 +516,10 @@ def _render_analysis_summary(
 
 
 def _resolve_run_unknown_op(
-    ep: str,
+    ep: EPName,
     device: str,
     run_unknown_op: bool,
-    local_pairs: set[tuple[str, str]],
+    local_pairs: set[tuple[EPName, str]],
 ) -> bool:
     """Resolve whether to run unknown operators for a given (EP, device) pair.
 
@@ -565,7 +569,10 @@ def _get_local_ep_device_pairs() -> list[tuple[EPName, str]]:
             if not ep_name_raw or ep_name_raw.endswith(".AUTO"):
                 continue
 
-            ep_name = normalize_ep_name(ep_name_raw)
+            # ep_name_raw is an arbitrary attribute string from ORT; cast lets
+            # normalize_ep_name (typed for EPNameOrAlias | None) accept it.
+            # Unknown values return None and get filtered below.
+            ep_name = normalize_ep_name(cast("EPNameOrAlias", ep_name_raw))
             if ep_name is None or ep_name not in SUPPORTED_EPS:
                 continue
 
@@ -746,33 +753,44 @@ def analyze(
             has_rule_data_for_ep,
         )
 
+        devices: list[str]
         if device == "auto":
             from ..sysinfo.device import _get_available_devices
 
-            devices = _get_available_devices()
+            devices = list(_get_available_devices())
         elif device == "all":
-            devices = SUPPORTED_DEVICES
-        else:
+            devices = list(SUPPORTED_DEVICES)
+        elif device is not None:
             devices = [device]
+        else:
+            devices = []
         devices = sorted(d.upper() for d in devices)
 
+        eps: list[EPName | None]
         if ep == "auto":
             from ..sysinfo.device import _get_available_eps
 
-            eps = _get_available_eps()
+            eps = list(_get_available_eps())
         elif ep == "all":
-            eps = SUPPORTED_EPS
+            eps = list(SUPPORTED_EPS)
         else:
             # ep is a specific EP or alias
             eps = [normalize_ep_name(ep)]
 
-        execution_pairs = [
-            (candidate_ep, candidate_device)
-            for candidate_ep in eps
-            for candidate_device in devices
-            if candidate_ep in EP_SUPPORTED_DEVICES
-            and candidate_device.lower() in EP_SUPPORTED_DEVICES[candidate_ep]
-        ]
+        # Build with a for-loop rather than a single nested comprehension so
+        # the `candidate_ep is not None and ... in EP_SUPPORTED_DEVICES`
+        # narrowing carries through to the appended tuple's type (EPName,
+        # not str). The inner generator stays a comprehension to satisfy
+        # ruff PERF401.
+        execution_pairs: list[tuple[EPName, str]] = []
+        for candidate_ep in eps:
+            if candidate_ep is None or candidate_ep not in EP_SUPPORTED_DEVICES:
+                continue
+            execution_pairs.extend(
+                (candidate_ep, candidate_device)
+                for candidate_device in devices
+                if candidate_device.lower() in EP_SUPPORTED_DEVICES[candidate_ep]
+            )
         execution_pairs = _sort_ep_device_pairs(execution_pairs)
 
         local_pairs = set(_get_local_ep_device_pairs())
@@ -943,7 +961,7 @@ def _finalize_live(mark_complete: bool = True) -> None:
                 live.stop()
                 live = None
 
-        def on_ep_start(ep_name, operator_counts):
+        def on_ep_start(ep_name: EPName, operator_counts: dict[str, int]) -> None:
             """Called when analysis starts for a new EP."""
             nonlocal current_ep_device_pair
             nonlocal instance_counts, all_op_counts, ep_counter, live
@@ -1023,7 +1041,7 @@ def on_ep_start(ep_name, operator_counts):
             )
             live.start()
 
-        def on_node_result(pattern_runtime):
+        def on_node_result(pattern_runtime: PatternRuntime) -> None:
             """Callback invoked per-node during analysis."""
             op = _display_name(pattern_runtime.pattern_id)
             level = pattern_runtime.result.classification.value
diff --git a/src/winml/modelkit/commands/build.py b/src/winml/modelkit/commands/build.py
index 486d7e678..19272e7dc 100644
--- a/src/winml/modelkit/commands/build.py
+++ b/src/winml/modelkit/commands/build.py
@@ -22,7 +22,7 @@
 import logging
 import time
 from pathlib import Path
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any, cast
 
 import click
 from rich.logging import RichHandler
@@ -140,10 +140,13 @@ def _instantiate_parent_model(model_type: str, task: str | None = None) -> nn.Mo
     """
     from ..loader import resolve_loader_config
 
-    _, hf_config, resolved_class = resolve_loader_config(
+    _, hf_config, resolved_class_typed = resolve_loader_config(
         model_type=model_type,
         task=task,
     )
+    # Annotated Any: resolver returns bare `type` but the class is a HF model
+    # with extra methods (from_config) that bare `type` doesn't expose.
+    resolved_class: Any = resolved_class_typed
 
     try:
         model = resolved_class(hf_config)
@@ -152,7 +155,7 @@ def _instantiate_parent_model(model_type: str, task: str | None = None) -> nn.Mo
         model = resolved_class.from_config(hf_config)
 
     model.eval()
-    return model
+    return cast("nn.Module", model)
 
 
 def _build_modules(
@@ -599,8 +602,6 @@ def build(
             if no_compile:
                 config_or_configs.compile = None
 
-        is_module_mode = isinstance(config_or_configs, list)
-
         # If --device was explicitly provided, patch compile config and clear
         # quant for CPU/GPU (neither device uses quantization by default).
         if cli_utils.is_cli_provided(ctx, "device") and device:
@@ -631,7 +632,7 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
                     if patched is not None:
                         cfg.compile = patched
 
-            if is_module_mode:
+            if isinstance(config_or_configs, list):
                 for _cfg in config_or_configs:
                     _patch_device(_cfg)
             else:
@@ -642,7 +643,11 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
         # surfaces malformed configs immediately and prevents partial
         # scratch state when the user passes the wrong file or a
         # hand-edited config (#P1 UX).
-        _configs_to_validate = config_or_configs if is_module_mode else [config_or_configs]
+        _configs_to_validate: list[WinMLBuildConfig] = (
+            config_or_configs
+            if isinstance(config_or_configs, list)
+            else [config_or_configs]
+        )
         try:
             for _cfg in _configs_to_validate:
                 _cfg.validate()
@@ -666,7 +671,7 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
         if trust_remote_code:
             extra_kwargs["trust_remote_code"] = True
 
-        if is_module_mode:
+        if isinstance(config_or_configs, list):
             # ---- MODULE MODE: array config, one build per submodule ----
             if use_cache:
                 raise click.UsageError(
@@ -685,7 +690,7 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
             print_setup(
                 console,
                 model=model_id or "random-init",
-                config=Path(config_file).name,
+                config=Path(config_file).name if config_file else "(auto)",
                 output=str(resolved_dir),
                 source="HuggingFace",
             )
@@ -760,6 +765,9 @@ def _patch_device(cfg: WinMLBuildConfig) -> None:
                     config.generate_cache_key(),
                 )
             else:
+                # Guarded earlier (line ~381: `if not output_dir and not use_cache`).
+                if output_dir is None:
+                    raise click.UsageError("--output-dir is required when --use-cache is not set.")
                 resolved_dir = Path(output_dir)
 
             _run_single_build(
@@ -868,6 +876,7 @@ def _run_single_build(
 
     try:
         if _is_onnx:
+            assert model_id is not None  # _is_onnx implies this
             stage_timings = _build_onnx_pipeline(
                 config=config,
                 onnx_path=Path(model_id),
@@ -937,13 +946,13 @@ def _show_io(sl: Any, config: WinMLBuildConfig) -> None:
         return
     inputs = export_cfg.input_tensors or []
     outputs = export_cfg.output_tensors or []
-    for i, t in enumerate(inputs):
-        name = t.name or "(unnamed)"
-        shape = str(list(t.shape)) if getattr(t, "shape", None) else "dynamic"
-        dtype = getattr(t, "dtype", None) or "?"
+    for i, in_spec in enumerate(inputs):
+        name = in_spec.name or "(unnamed)"
+        shape = str(list(in_spec.shape)) if in_spec.shape else "dynamic"
+        dtype = getattr(in_spec, "dtype", None) or "?"
         sl.io_input(name, shape, dtype, first=(i == 0))
-    for i, t in enumerate(outputs):
-        name = t.name or "(unnamed)"
+    for i, out_spec in enumerate(outputs):
+        name = out_spec.name or "(unnamed)"
         # OutputTensorSpec has name only — show name, no shape/dtype
         label = "Output:       " if i == 0 else "              "
         sl.detail(f"{label}[cyan]{name}[/cyan]")
@@ -998,7 +1007,7 @@ def _run_optimize_stage(
         _ep_bars: dict[str, int] = {}
         _ep_counts: dict[str, dict[str, int]] = {}
         _ep_totals: dict[str, int] = {}
-        _current_ep = [""]
+        _current_ep: EPName | None = None
         _current_iter = [0, 0]  # [iteration, max_iter]
         _header_shown = [False]
 
@@ -1018,7 +1027,8 @@ def _on_iteration_start(iteration: int, max_iter: int) -> None:
         _resolved_device, _ = _resolve_device(device=device or "auto", ep=ep)
 
         def _on_ep_start(ep_name: EPName, operator_counts: dict) -> None:
-            _current_ep[0] = ep_name
+            nonlocal _current_ep
+            _current_ep = ep_name
             _ep_counts[ep_name] = {}
             total = sum(operator_counts.values())
             _ep_totals[ep_name] = total
@@ -1034,7 +1044,9 @@ def _on_ep_start(ep_name: EPName, operator_counts: dict) -> None:
                 _ep_bars[ep_name] = sl.ep_bar_add(ep_name, total=total)
 
         def _on_node_result(pattern_runtime: Any) -> None:
-            ep_name = _current_ep[0]
+            ep_name = _current_ep
+            if ep_name is None:
+                return  # pre-init: _on_ep_start hasn't fired yet
             level = pattern_runtime.result.classification.value
             counts = _ep_counts.setdefault(ep_name, {})
             counts[level] = counts.get(level, 0) + 1
@@ -1134,7 +1146,7 @@ def _run_quantize_stage(
         return current_path
 
     with StageLive("quantize", console) as sl:
-        wt = config.quant.weight_type or "?"
+        wt = config.quant.weight_type
         sl.set_status(f"Quantizing ({wt})...")
         # Calibration info before blocking call
         ds = config.quant.dataset_name or "default"
diff --git a/src/winml/modelkit/commands/config.py b/src/winml/modelkit/commands/config.py
index 489344e68..fe6a5b79e 100644
--- a/src/winml/modelkit/commands/config.py
+++ b/src/winml/modelkit/commands/config.py
@@ -278,6 +278,8 @@ def config(
                 "--module is not supported with ONNX file input. "
                 "Module discovery requires a HuggingFace model."
             )
+        config_obj: WinMLBuildConfig | None = None
+        output_data: dict[str, Any] | list[Any]
         if hf_model and cli_utils.is_onnx_file_path(hf_model):
             config_obj = generate_onnx_build_config(
                 hf_model,
@@ -326,26 +328,26 @@ def config(
                 )
                 return
 
-            # Generate config(s) - returns single or list based on module parameter
-            result = generate_hf_build_config(
-                model_id=hf_model,
-                task=task,
-                model_class=model_class,
-                model_type=model_type,
-                module=module,
-                override=override,
-                shape_config=shape_config,
-                library_name=library_name,
-                device=device,
-                precision=precision,
-                trust_remote_code=trust_remote_code,
-                ep=ep,
-            )
-
-            # Handle output format
+            # Generate config(s) - module parameter selects overload:
+            # module=str → list[WinMLBuildConfig], module=None → WinMLBuildConfig.
+            # ``module`` is the only differing kwarg, so build a shared dict
+            # once and add it only on the list-returning branch. This keeps
+            # the overload dispatch but avoids repeating the other 10 kwargs.
+            _shared_kwargs: dict[str, Any] = {
+                "model_id": hf_model,
+                "task": task,
+                "model_class": model_class,
+                "model_type": model_type,
+                "override": override,
+                "shape_config": shape_config,
+                "library_name": library_name,
+                "device": device,
+                "precision": precision,
+                "trust_remote_code": trust_remote_code,
+                "ep": ep,
+            }
             if module:
-                # Module mode: result is list[WinMLBuildConfig]
-                configs = result
+                configs = generate_hf_build_config(module=module, **_shared_kwargs)
                 for cfg in configs:
                     _apply_stage_overrides(cfg, no_quant=no_quant, no_compile=no_compile)
                 output_data = [cfg.to_dict() for cfg in configs]
@@ -353,8 +355,7 @@ def config(
                 # Use first config for display metadata
                 config_obj = configs[0] if configs else None
             else:
-                # Normal mode: result is WinMLBuildConfig
-                config_obj = result
+                config_obj = generate_hf_build_config(**_shared_kwargs)
                 configs = []
                 _apply_stage_overrides(config_obj, no_quant=no_quant, no_compile=no_compile)
                 output_data = config_obj.to_dict()
diff --git a/src/winml/modelkit/commands/eval.py b/src/winml/modelkit/commands/eval.py
index 1ac039b4a..61647d262 100644
--- a/src/winml/modelkit/commands/eval.py
+++ b/src/winml/modelkit/commands/eval.py
@@ -20,6 +20,7 @@
 
 
 if TYPE_CHECKING:
+    from ..eval import EvalResult, WinMLEvaluationConfig
     from ..utils.constants import EPNameOrAlias
 
 
@@ -114,6 +115,12 @@
 )
 @click.option(
     "--label-mapping",
+    # Distinct Python variable name so ctx.params["label_mapping_path"] does
+    # not collide with ``DatasetConfig.label_mapping`` (which is the *parsed*
+    # ``dict[str, int] | None``, not a Path). ``collect_cli_overrides`` is
+    # name-based, so without the rename the Path would be passed to the dict
+    # field with the wrong type.
+    "label_mapping_path",
     type=click.Path(exists=True, path_type=Path),
     default=None,
     help='Path to a JSON file with label mapping: {"label_name": id}.',
@@ -157,7 +164,7 @@ def eval(
     shuffle: bool,
     streaming: bool,
     column: tuple[str, ...],
-    label_mapping: Path | None,
+    label_mapping_path: Path | None,
     output: Path | None,
     verbose: bool,
     dataset_script: str | None,
@@ -202,7 +209,7 @@ def eval(
     from ..eval import evaluate
 
     # ── 1. Build config: defaults ← config file ← CLI ──
-    cfg = _build_eval_config(ctx, config_file, column, label_mapping)
+    cfg = _build_eval_config(ctx, config_file, column, label_mapping_path)
 
     # ── 2. Resolve in place ──
     _resolve_model(cfg, model, model_id)
@@ -233,8 +240,8 @@ def _build_eval_config(
     ctx: click.Context,
     config_file: Path | None,
     column: tuple[str, ...],
-    label_mapping: Path | None,
-) -> object:
+    label_mapping_path: Path | None,
+) -> WinMLEvaluationConfig:
     """Build a WinMLEvaluationConfig with precedence: defaults ← config file ← CLI.
 
     Reads raw JSON for config-file values so only explicitly-present keys
@@ -244,24 +251,15 @@ def _build_eval_config(
     from ..eval import DatasetConfig, WinMLEvaluationConfig
     from ..utils.config_utils import merge_config
 
-    # Initialize config object from CLI ctx params.
-    p = ctx.params
-    cfg = WinMLEvaluationConfig(
-        task=p.get("task"),
-        device=p.get("device"),
-        precision=p.get("precision"),
-        ep=p.get("ep"),
-        output_path=p.get("output"),
-        dataset=DatasetConfig(
-            path=p.get("dataset_path"),
-            name=p.get("dataset_name"),
-            split=p.get("split"),
-            samples=p.get("samples"),
-            shuffle=p.get("shuffle"),
-            streaming=p.get("streaming"),
-            build_script=p.get("dataset_script"),
-        ),
-    )
+    # Initialize config object from CLI ctx params. ``collect_cli_overrides``
+    # filters to user-provided values and applies the cli_name → field_name
+    # renames declared on the dataclass fields (e.g. output → output_path).
+    # The --label-mapping Click option binds to ``label_mapping_path`` (see the
+    # ``@click.option`` decorator) so it does NOT collide with the
+    # ``DatasetConfig.label_mapping`` field name.
+    eval_kwargs = cli_utils.collect_cli_overrides(ctx, WinMLEvaluationConfig)
+    dataset_kwargs = cli_utils.collect_cli_overrides(ctx, DatasetConfig)
+    cfg = WinMLEvaluationConfig(dataset=DatasetConfig(**dataset_kwargs), **eval_kwargs)
 
     # ── Config file layer (only explicitly-present keys) ──
     if config_file is not None:
@@ -299,8 +297,8 @@ def _build_eval_config(
             columns_mapping[k] = v
         ds_overrides["columns_mapping"] = columns_mapping
 
-    if label_mapping is not None:
-        ds_overrides["label_mapping_file"] = str(label_mapping)
+    if label_mapping_path is not None:
+        ds_overrides["label_mapping_file"] = str(label_mapping_path)
 
     if ds_overrides:
         overrides["dataset"] = ds_overrides
@@ -312,7 +310,7 @@ def _build_eval_config(
 
 
 def _resolve_model(
-    cfg: object,
+    cfg: WinMLEvaluationConfig,
     model: tuple[str, ...],
     model_id: str | None,
 ) -> None:
@@ -322,7 +320,7 @@ def _resolve_model(
     cfg.model_id = resolved_id
 
 
-def _resolve_device(cfg: object) -> None:
+def _resolve_device(cfg: WinMLEvaluationConfig) -> None:
     """Resolve ``'auto'`` → concrete device string on *cfg* in place."""
     if cfg.device and cfg.device.lower() != "auto":
         return
@@ -336,14 +334,14 @@ def _resolve_device(cfg: object) -> None:
     console.print(f"[dim]Using device:[/dim] {resolved}")
 
 
-def _resolve_label_mapping(cfg: object) -> None:
+def _resolve_label_mapping(cfg: WinMLEvaluationConfig) -> None:
     """Load label-mapping JSON file (if any) into ``cfg.dataset.label_mapping``."""
     if cfg.dataset.label_mapping_file:
         with Path(cfg.dataset.label_mapping_file).open() as f:
             cfg.dataset.label_mapping = json.load(f)
 
 
-def _run_dataset_script(cfg: object, trust_remote_code: bool) -> None:
+def _run_dataset_script(cfg: WinMLEvaluationConfig, trust_remote_code: bool) -> None:
     """Run the dataset build script referenced by *cfg*, if any.
 
     The script is invoked with ``--output <dataset.path>`` so the built
@@ -384,7 +382,7 @@ def _run_dataset_script(cfg: object, trust_remote_code: bool) -> None:
         )
 
 
-def _write_and_display(result: object, output_path: Path | None) -> None:
+def _write_and_display(result: EvalResult, output_path: Path | None) -> None:
     """Display evaluation results and optionally save to JSON."""
     console = Console()
     display_eval_report(result, console)
@@ -485,7 +483,7 @@ def _json_default(obj: object) -> object:
     raise TypeError(f"Object of type {type(obj).__name__} is not JSON serializable")
 
 
-def display_eval_report(result: object, console: object) -> None:
+def display_eval_report(result: EvalResult, console: Console) -> None:
     """Display evaluation results in formatted console output."""
     from rich.panel import Panel
     from rich.table import Table
diff --git a/src/winml/modelkit/commands/inspect.py b/src/winml/modelkit/commands/inspect.py
index 1d3d735e9..b79b1012e 100644
--- a/src/winml/modelkit/commands/inspect.py
+++ b/src/winml/modelkit/commands/inspect.py
@@ -330,7 +330,7 @@ def _inspect_model_v2(
     # =========================================================================
     # STEP 2: Shared loader resolution (same call as config command)
     # =========================================================================
-    from huggingface_hub.utils import RepositoryNotFoundError
+    from huggingface_hub.errors import RepositoryNotFoundError
 
     try:
         loader_config, hf_config, _resolved_class = resolve_loader_config(
@@ -362,6 +362,10 @@ def _inspect_model_v2(
 
     model_type = loader_config.model_type
     task = loader_config.task
+    if model_type is None:
+        raise InspectError("Could not resolve model_type from loader config")
+    if task is None:
+        raise InspectError("Could not resolve task from loader config")
     architectures = getattr(parent_hf_config, "architectures", []) or []
 
     # =========================================================================
@@ -409,7 +413,7 @@ def _inspect_model_v2(
         export_cfg = registered.export
         input_tensors = [
             TensorInfo(name=s.name or "unknown", dtype=s.dtype, shape=s.shape)
-            for s in export_cfg.input_tensors
+            for s in (export_cfg.input_tensors or [])
         ]
         output_tensors = [
             TensorInfo(name=s.name or "unknown") for s in (export_cfg.output_tensors or [])
@@ -515,7 +519,9 @@ def _inspect_model_v2(
     #   2. parent_hf_config     — pre-narrowing config (only when model_id was
     #                             provided and AutoConfig succeeded in step 1)
     #   3. model_type           — narrowed loader_config.model_type (fallback)
-    display_model_type = model_type_override or getattr(parent_hf_config, "model_type", model_type)
+    display_model_type: str = (
+        model_type_override or getattr(parent_hf_config, "model_type", None) or model_type
+    )
 
     return InspectResult(
         model_id=model_id or display_model_type or model_class_override or "unknown",
diff --git a/src/winml/modelkit/commands/optimize.py b/src/winml/modelkit/commands/optimize.py
index b41a6ac10..7d78e391f 100644
--- a/src/winml/modelkit/commands/optimize.py
+++ b/src/winml/modelkit/commands/optimize.py
@@ -23,7 +23,7 @@
 import json
 import logging
 from pathlib import Path
-from typing import TYPE_CHECKING, Any
+from typing import TYPE_CHECKING, Any, TypeVar
 
 import click
 from rich.console import Console
@@ -35,6 +35,8 @@
 if TYPE_CHECKING:
     from collections.abc import Callable
 
+F = TypeVar("F", bound="Callable[..., Any]")
+
 
 logger = logging.getLogger(__name__)
 console = Console()
@@ -106,7 +108,7 @@ def _load_json(path: Path) -> dict[str, Any]:
         raise click.ClickException(f"Invalid JSON in config file: {e}") from e
 
 
-def capability_options(func: Callable) -> Callable:
+def capability_options(func: F) -> F:
     """Decorator that adds CLI options for all registered capabilities.
 
     This decorator auto-generates CLI options from the capability registry,
@@ -185,7 +187,7 @@ def capability_options(func: Callable) -> Callable:
     help="Enable verbose output",
 )
 @capability_options
-@click.pass_context
+@click.pass_context  # type: ignore[arg-type]  # capability_options widens the signature; click stubs want positional-only ctx but we keep it keyword-callable for back-compat
 def optimize(
     ctx: click.Context,
     list_capabilities: bool,
diff --git a/src/winml/modelkit/commands/perf.py b/src/winml/modelkit/commands/perf.py
index ac511fcb6..841c2c98b 100644
--- a/src/winml/modelkit/commands/perf.py
+++ b/src/winml/modelkit/commands/perf.py
@@ -288,6 +288,7 @@ def run(self) -> BenchmarkResult:
         # [1] Load model
         logger.info("Loading model: %s", self.config.model_id)
         self._load_model()
+        assert self._model is not None
 
         # [2] Generate inputs
         logger.info("Generating benchmark inputs")
@@ -335,7 +336,7 @@ def _load_model(self) -> None:
         use_cache = not self.config.ignore_cache
         force_rebuild = self.config.rebuild or self.config.ignore_cache
 
-        common_kwargs = {
+        common_kwargs: dict[str, Any] = {
             "task": self.config.task,
             "config": override,
             "device": self.config.device,
@@ -359,6 +360,7 @@ def _load_model(self) -> None:
 
     def _generate_inputs(self) -> None:
         """Generate random inputs based on model io_config."""
+        assert self._model is not None
         io_config = self._model.io_config
         self._inputs = generate_random_inputs(
             io_config=io_config,
@@ -373,6 +375,8 @@ def _run_benchmark(self) -> PerfStats:
 
     def _run_benchmark_simple(self) -> PerfStats:
         """Execute benchmark without live monitoring."""
+        assert self._model is not None
+        assert self._inputs is not None
         session = self._model._session
         total_iterations = self.config.warmup + self.config.iterations
 
@@ -393,6 +397,8 @@ def _run_benchmark_monitored(self) -> PerfStats:
         from ..session.monitor.hw_monitor import HWMonitor
         from ..session.monitor.vitisai_monitor import VitisAIMonitor
 
+        assert self._model is not None
+        assert self._inputs is not None
         session = self._model._session
         total_iterations = self.config.warmup + self.config.iterations
 
@@ -416,9 +422,9 @@ def _run_benchmark_monitored(self) -> PerfStats:
 
         # EP-specific proof-of-execution monitor.
         # When QNN/OpenVINO monitors become real, add entries here.
-        _ep_monitors = {"vitisai": VitisAIMonitor}
-        ep = self.config.ep
-        monitor_cls = _ep_monitors.get(ep)
+        _ep_monitors: dict[EPName, Any] = {"VitisAIExecutionProvider": VitisAIMonitor}
+        monitor_cls = _ep_monitors.get(session.ep_name) if session.ep_name else None
+        ep_monitor: Any
         if monitor_cls and monitor_cls.is_available():
             ep_monitor = monitor_cls()
         else:
@@ -450,6 +456,7 @@ def _run_benchmark_monitored(self) -> PerfStats:
 
     def _collect_results(self, stats: PerfStats) -> BenchmarkResult:
         """Collect benchmark results from PerfStats."""
+        assert self._model is not None
         io_config = self._model.io_config
 
         # Calculate throughput
@@ -1409,7 +1416,13 @@ def perf(
             # Determine the ONNX model path from the benchmark flow.
             # For HF models the ONNX is built internally by PerfBenchmark.
             try:
-                onnx_for_trace = model_path if is_onnx else benchmark._model._onnx_path
+                onnx_for_trace = (
+                    model_path
+                    if is_onnx
+                    else (benchmark._model._onnx_path if benchmark._model else None)
+                )
+                if onnx_for_trace is None:
+                    raise AttributeError("benchmark._model not initialized")
             except AttributeError:
                 console.print(
                     "[red]Error:[/red] Could not determine ONNX model path for op-tracing"
diff --git a/src/winml/modelkit/commands/quantize.py b/src/winml/modelkit/commands/quantize.py
index 9b4539165..9f9b5cfca 100644
--- a/src/winml/modelkit/commands/quantize.py
+++ b/src/winml/modelkit/commands/quantize.py
@@ -21,6 +21,7 @@
 
 import logging
 from pathlib import Path
+from typing import TYPE_CHECKING, cast
 
 import click
 from rich.console import Console
@@ -29,6 +30,10 @@
 from ..utils.logging import configure_logging
 
 
+if TYPE_CHECKING:
+    from typing import Literal
+
+
 logger = logging.getLogger(__name__)
 console = Console()
 
@@ -198,12 +203,14 @@ def quantize(
     console.print(f"[bold blue]Samples:[/bold blue] {samples}")
     console.print(f"[bold blue]Method:[/bold blue] {method}")
 
-    # Create config (output_path is passed separately to API)
+    # Create config (output_path is passed separately to API).
+    # Click's Choice validates these strings at parse time, so cast acknowledges
+    # the Literal[] contract that mypy can't see through the str return type.
     config = WinMLQuantizationConfig(
         samples=samples,
-        calibration_method=method,
-        weight_type=resolved_weight,
-        activation_type=resolved_activation,
+        calibration_method=cast('Literal["minmax", "entropy", "percentile"]', method),
+        weight_type=cast('Literal["uint8", "int8", "uint16", "int16"]', resolved_weight),
+        activation_type=cast('Literal["uint8", "int8", "uint16", "int16"]', resolved_activation),
         per_channel=per_channel,
         symmetric=symmetric,
         task=task,
diff --git a/src/winml/modelkit/commands/run.py b/src/winml/modelkit/commands/run.py
index a910398ac..791efb8fb 100644
--- a/src/winml/modelkit/commands/run.py
+++ b/src/winml/modelkit/commands/run.py
@@ -30,7 +30,7 @@
 import logging
 import sys
 from pathlib import Path
-from typing import TYPE_CHECKING, Any
+from typing import TYPE_CHECKING, Any, cast
 
 import click
 
@@ -383,9 +383,9 @@ def _build_example_command(
     # Add a representative -P param with a sample value
     if params:
         for p in params:
-            val = p.get("sample_value")
-            if val is not None:
-                parts.append(f"-P {p['name']}={val}")
+            sample = p.get("sample_value")
+            if sample is not None:
+                parts.append(f"-P {p['name']}={sample}")
                 break
 
     return " ".join(parts)
@@ -561,11 +561,11 @@ def run(
     # Read file bytes (for --file shortcut)
     file_bytes_list: list[bytes] = []
     for fp in files:
-        p = Path(fp)
-        if not p.exists() or not p.is_file():
+        file_path = Path(fp)
+        if not file_path.exists() or not file_path.is_file():
             click.echo(f"Error: file not found: {fp}", err=True)
             ctx.exit(2)
-        file_bytes_list.append(p.read_bytes())
+        file_bytes_list.append(file_path.read_bytes())
 
     if len(file_bytes_list) > 1:
         click.echo(
@@ -643,7 +643,6 @@ def run(
     except click.ClickException as exc:
         click.echo(f"Error: {exc.format_message()}", err=True)
         ctx.exit(2)
-        return
 
     # Merge --file/--text shortcuts with --input
     try:
@@ -651,7 +650,6 @@ def run(
     except click.ClickException as exc:
         click.echo(f"Error: {exc.format_message()}", err=True)
         ctx.exit(2)
-        return
 
     # Check input / -P collision (after shortcuts are resolved so that
     # --file and --text shortcut keys are included in the check)
@@ -666,12 +664,12 @@ def run(
         ctx.exit(2)
 
     try:
-        result = engine.predict(inputs=inputs, **pipeline_kwargs)
+        prediction = engine.predict(inputs=inputs, **pipeline_kwargs)
     except (ValueError, TypeError, RuntimeError, OSError) as exc:
         click.echo(f"Error during inference: {exc}", err=True)
         ctx.exit(4)
 
-    _print_result(result.model_dump(), output_format=output_format, output_path=output)
+    _print_result(prediction.model_dump(), output_format=output_format, output_path=output)
 
 
 # ---------------------------------------------------------------------------
@@ -695,7 +693,7 @@ def _resolve_text_field_via_schema(client: Any, base_url: str) -> str:
             user_inputs = schema.get("user_inputs", [])
             text_fields = [f for f in user_inputs if f.get("type") == "text"]
             if len(text_fields) == 1:
-                return text_fields[0]["name"]
+                return str(text_fields[0]["name"])
     except Exception:
         logger.debug("Schema probe failed; falling back to field name 'text'", exc_info=True)
     return "text"
@@ -760,7 +758,7 @@ def _try_server_predict(
                     )
                 resp.raise_for_status()
                 logger.debug("Auto-connected to winml serve at %s", base_url)
-                return resp.json()
+                return cast("dict[Any, Any]", resp.json())
 
             # Route 2: no file → JSON /v1/predict with named inputs
             #   Coerce raw CLI strings (JSON arrays, numbers, booleans)
@@ -782,7 +780,7 @@ def _try_server_predict(
             )
             resp.raise_for_status()
             logger.debug("Auto-connected to winml serve at %s", base_url)
-            return resp.json()
+            return cast("dict[Any, Any]", resp.json())
     except (httpx.HTTPError, OSError, json.JSONDecodeError, KeyError, ValueError) as exc:
         logger.debug("Auto-connect failed (%s) — using embedded inference", exc)
         return None
diff --git a/src/winml/modelkit/commands/sys.py b/src/winml/modelkit/commands/sys.py
index da0397057..083a5ff89 100644
--- a/src/winml/modelkit/commands/sys.py
+++ b/src/winml/modelkit/commands/sys.py
@@ -31,7 +31,7 @@
 import sys
 from concurrent.futures import ThreadPoolExecutor
 from importlib.metadata import PackageNotFoundError, version
-from typing import TYPE_CHECKING, Any
+from typing import TYPE_CHECKING, Any, cast
 
 import click
 
@@ -39,8 +39,12 @@
 
 
 if TYPE_CHECKING:
+    from collections.abc import Sequence
+
     from rich.console import Console
 
+    from ..utils.constants import EPName
+
 
 logger = logging.getLogger(__name__)
 
@@ -281,7 +285,7 @@ def _check_openvino() -> dict[str, Any]:
     info: dict[str, Any] = {"installed": False}
 
     try:
-        import openvino  # type: ignore[import-not-found]
+        import openvino
 
         info["installed"] = True
         info["version"] = openvino.__version__
@@ -493,7 +497,7 @@ def _gather_device_info() -> list[dict[str, Any]]:
     from ..sysinfo import CPU, GPU, NPU
 
     # NPU > GPU > CPU priority order.
-    hw_queries: list[tuple[str, type]] = [
+    hw_queries: list[tuple[str, type[NPU] | type[GPU] | type[CPU]]] = [
         ("NPU", NPU),
         ("GPU", GPU),
         ("CPU", CPU),
@@ -501,10 +505,15 @@ def _gather_device_info() -> list[dict[str, Any]]:
 
     with ThreadPoolExecutor(max_workers=len(hw_queries)) as pool:
         futures = [(label, pool.submit(cls.get_all)) for label, cls in hw_queries]
-        ordered_results: list[tuple[str, list[Any] | Exception]] = []
+        # Sequence (not list) because list is invariant in its element type:
+        # fut.result() at runtime is list[CPU] | list[GPU] | list[NPU], none
+        # of which are list[Any]. Sequence is covariant, so this accepts
+        # all three. The cast at .result() is needed because pool.submit
+        # collapses the union-typed `cls.get_all` callable to Future[object].
+        ordered_results: list[tuple[str, Sequence[Any] | Exception]] = []
         for label, fut in futures:
             try:
-                ordered_results.append((label, fut.result()))
+                ordered_results.append((label, cast("Sequence[Any]", fut.result())))
             except Exception as e:  # noqa: PERF203 - per-future error capture
                 ordered_results.append((label, e))
 
@@ -586,7 +595,7 @@ def _gather_ep_info() -> list[dict[str, Any]]:
         List of EP dicts with name, device, and optional path.
     """
     eps: list[dict[str, Any]] = []
-    winml_eps: dict[str, str] = {}
+    winml_eps: dict[EPName, str] = {}
 
     # Try WinML EP Registry first
     try:
@@ -608,18 +617,23 @@ def _gather_ep_info() -> list[dict[str, Any]]:
 
     # Merge: WinML EPs first (they have paths), then ORT-only EPs
     ep_device_map = get_ep_device_map()
-    seen: set[str] = set()
+    seen: set[EPName] = set()
 
     for ep_name, ep_path in winml_eps.items():
         device = ep_device_map.get(ep_name, "unknown").upper()
         eps.append({"name": ep_name, "device": device, "path": ep_path})
         seen.add(ep_name)
 
-    for ep_name in ort_providers:
-        if ep_name not in seen:
-            device = ep_device_map.get(ep_name, "unknown").upper()
-            eps.append({"name": ep_name, "device": device, "path": None})
-            seen.add(ep_name)
+    for raw_name in ort_providers:
+        # ORT returns arbitrary strings; cast acknowledges that downstream
+        # storage and lookup treat them as EPName. Unknown names fall through
+        # to the "unknown" device default.
+        ep_name = cast("EPName", raw_name)
+        if ep_name in seen:
+            continue
+        device = ep_device_map.get(ep_name, "unknown").upper()
+        eps.append({"name": ep_name, "device": device, "path": None})
+        seen.add(ep_name)
 
     return eps
 
@@ -641,8 +655,8 @@ def _output_ep_text(eps: list[dict[str, Any]]) -> None:
             console.print("    [dim](built-in)[/dim]")
 
 
-@click.command()  # type: ignore[misc]
-@click.option(  # type: ignore[misc]
+@click.command()
+@click.option(
     "--format",
     "-f",
     "output_format",
@@ -650,26 +664,26 @@ def _output_ep_text(eps: list[dict[str, Any]]) -> None:
     default="text",
     help="Output format: text (human-readable), json, or compact",
 )
-@click.option(  # type: ignore[misc]
+@click.option(
     "--verbose",
     "-v",
     is_flag=True,
     default=False,
     help="Include additional diagnostic information",
 )
-@click.option(  # type: ignore[misc]
+@click.option(
     "--list-device",
     is_flag=True,
     default=False,
     help="List available devices in priority order",
 )
-@click.option(  # type: ignore[misc]
+@click.option(
     "--list-ep",
     is_flag=True,
     default=False,
     help="List available execution providers",
 )
-@click.pass_context  # type: ignore[misc]
+@click.pass_context
 def sysinfo(
     ctx: click.Context,
     output_format: str,
diff --git a/src/winml/modelkit/compiler/__init__.py b/src/winml/modelkit/compiler/__init__.py
index b911d46f5..99c0eb42d 100644
--- a/src/winml/modelkit/compiler/__init__.py
+++ b/src/winml/modelkit/compiler/__init__.py
@@ -24,6 +24,8 @@
     result = compile_onnx("model.onnx", config)
 """
 
+from typing import TYPE_CHECKING, Any
+
 from .configs import (
     EPConfig,
     WinMLCompileConfig,
@@ -34,7 +36,18 @@
 from .utils import QDQ_OP_TYPES, needs_format_conversion
 
 
-def __getattr__(name: str):
+# Names below are loaded lazily via ``__getattr__`` to avoid pulling in session/
+# torch at import time. The TYPE_CHECKING re-imports give static analyzers
+# (mypy, CodeQL) visibility into what ``__all__`` actually exports without
+# triggering the heavy imports at runtime.
+if TYPE_CHECKING:
+    from .compiler import Compiler, compile_onnx, list_compilers
+    from .stages.compile import CompileStage
+    from .stages.optimize import OptimizeStage
+    from .stages.qformat import QFormatConvertStage
+
+
+def __getattr__(name: str) -> Any:
     """Lazy-load heavy symbols that pull in session/torch to speed up import."""
     if name in {"Compiler", "compile_onnx", "list_compilers"}:
         from .compiler import Compiler, compile_onnx, list_compilers
diff --git a/src/winml/modelkit/compiler/compiler.py b/src/winml/modelkit/compiler/compiler.py
index c6cbe2978..6267daca6 100644
--- a/src/winml/modelkit/compiler/compiler.py
+++ b/src/winml/modelkit/compiler/compiler.py
@@ -93,7 +93,7 @@ def compile(
         if config is None:
             return CompileResult(
                 success=True,
-                output_path=str(model_path),
+                output_path=model_path,
                 errors=[],
                 warnings=["No compile config provided, skipping compilation (passthrough)"],
             )
diff --git a/src/winml/modelkit/compiler/configs.py b/src/winml/modelkit/compiler/configs.py
index 60cb40e29..2059c9528 100644
--- a/src/winml/modelkit/compiler/configs.py
+++ b/src/winml/modelkit/compiler/configs.py
@@ -21,6 +21,8 @@
 
 
 if TYPE_CHECKING:
+    from collections.abc import Callable
+
     from ..utils.constants import EPNameOrAlias
 
 
@@ -115,7 +117,7 @@ def for_provider(
         canonical = normalize_ep_name(provider)
         if canonical is None:
             return None
-        factories: dict[EPName, Any] = {
+        factories: dict[EPName, Callable[[], WinMLCompileConfig]] = {
             "QNNExecutionProvider": lambda: cls.for_qnn(device=device),
             "DmlExecutionProvider": cls.for_dml,
             "CUDAExecutionProvider": cls.for_cuda,
diff --git a/src/winml/modelkit/compiler/context.py b/src/winml/modelkit/compiler/context.py
index ead5ab980..18fb22679 100644
--- a/src/winml/modelkit/compiler/context.py
+++ b/src/winml/modelkit/compiler/context.py
@@ -9,7 +9,7 @@
 import logging
 from dataclasses import dataclass, field
 from pathlib import Path
-from typing import TYPE_CHECKING, Any
+from typing import TYPE_CHECKING, Any, cast
 
 import onnx
 import onnxruntime as ort
@@ -89,14 +89,14 @@ def get_config(self, key: str, default: Any = None) -> Any:
     @property
     def execution_provider(self) -> EPAlias:
         """Get target execution provider."""
-        return self.config.get("execution_provider", "qnn")
+        return cast("EPAlias", self.config.get("execution_provider", "qnn"))
 
     @property
     def enable_ep_context(self) -> bool:
         """Whether to generate EPContext model."""
-        return self.config.get("enable_ep_context", True)
+        return bool(self.config.get("enable_ep_context", True))
 
     @property
     def validate(self) -> bool:
         """Whether to validate compiled model."""
-        return self.config.get("validate", True)
+        return bool(self.config.get("validate", True))
diff --git a/src/winml/modelkit/compiler/stages/compile.py b/src/winml/modelkit/compiler/stages/compile.py
index 961572773..94b476956 100644
--- a/src/winml/modelkit/compiler/stages/compile.py
+++ b/src/winml/modelkit/compiler/stages/compile.py
@@ -184,7 +184,7 @@ def _validate_model(self, session: ort.InferenceSession, context: CompileContext
 
     def _generate_dummy_inputs(self, session: ort.InferenceSession) -> dict[str, np.ndarray]:
         """Generate dummy inputs for validation using all-ones data."""
-        inputs = {}
+        inputs: dict[str, np.ndarray] = {}
 
         for input_meta in session.get_inputs():
             name = input_meta.name
diff --git a/src/winml/modelkit/models/winml/composite_model.py b/src/winml/modelkit/models/winml/composite_model.py
index 4907c5f71..9b7aa20a6 100644
--- a/src/winml/modelkit/models/winml/composite_model.py
+++ b/src/winml/modelkit/models/winml/composite_model.py
@@ -62,7 +62,7 @@
 
 # Maps (model_type, task) → pipeline class with _SUB_MODEL_CONFIG.
 # Used by `winml config` to generate one config file per sub-component.
-COMPOSITE_MODEL_REGISTRY: dict[tuple[str, str], type] = {}
+COMPOSITE_MODEL_REGISTRY: dict[tuple[str, str], type[WinMLCompositeModel]] = {}
 
 
 def register_composite_model(model_type: str, task: str):
diff --git a/src/winml/modelkit/optracing/base.py b/src/winml/modelkit/optracing/base.py
index 386fe466f..7fd1f875c 100644
--- a/src/winml/modelkit/optracing/base.py
+++ b/src/winml/modelkit/optracing/base.py
@@ -7,6 +7,7 @@
 from __future__ import annotations
 
 from abc import ABC, abstractmethod
+from pathlib import Path
 from typing import TYPE_CHECKING
 
 
@@ -22,14 +23,27 @@ class OpTracer(ABC):
 
     Concrete implementations receive the model path and output directory
     at construction time, then call ``run()`` to execute profiling.
+
+    Subclasses overriding ``__init__`` MUST call ``super().__init__(...)`` so
+    that ``onnx_path``, ``output_dir``, and ``level`` are stored on ``self``.
     """
 
+    def __init__(self, onnx_path: Path, *, output_dir: Path, level: str = "basic") -> None:
+        """Construct an OpTracer for an ONNX model.
+
+        Args:
+            onnx_path: Path to the ONNX model to trace.
+            output_dir: Directory for profiling artifacts.
+            level: Profiling level ("basic" or "detail").
+        """
+        self.onnx_path = Path(onnx_path)
+        self.output_dir = Path(output_dir)
+        self.level = level
+
     @abstractmethod
     def run(self, iterations: int = 5, warmup: int = 2) -> OpTraceResult:
         """Run operator-level tracing and return structured results."""
-        ...
 
     @abstractmethod
     def is_available(self) -> bool:
         """Check if this tracer's runtime dependencies are available."""
-        ...
diff --git a/src/winml/modelkit/optracing/qnn/profiler.py b/src/winml/modelkit/optracing/qnn/profiler.py
index 2cc26503e..12bb44403 100644
--- a/src/winml/modelkit/optracing/qnn/profiler.py
+++ b/src/winml/modelkit/optracing/qnn/profiler.py
@@ -88,9 +88,7 @@ def __init__(
         output_dir: Path,
         level: str = "basic",
     ) -> None:
-        self.onnx_path = Path(onnx_path)
-        self.output_dir = Path(output_dir)
-        self.level = level
+        super().__init__(onnx_path, output_dir=output_dir, level=level)
 
     def is_available(self) -> bool:
         """Check if QNN EP is available for profiling."""
diff --git a/src/winml/modelkit/serve/app.py b/src/winml/modelkit/serve/app.py
index 8f54c044d..ea05f52eb 100644
--- a/src/winml/modelkit/serve/app.py
+++ b/src/winml/modelkit/serve/app.py
@@ -825,7 +825,7 @@ def print_startup_banner(
     *,
     host: str,
     port: int,
-    model_path: str,
+    model_path: str | None,
     task: str | None,
     device: str,
     ep: EPNameOrAlias | None,
diff --git a/src/winml/modelkit/utils/cli.py b/src/winml/modelkit/utils/cli.py
index 5d5d07ca9..99c8661d9 100644
--- a/src/winml/modelkit/utils/cli.py
+++ b/src/winml/modelkit/utils/cli.py
@@ -8,7 +8,7 @@
 
 import json
 from pathlib import Path
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any, TypeVar
 
 import click
 from rich.console import Console
@@ -17,9 +17,15 @@
 
 
 if TYPE_CHECKING:
+    from collections.abc import Callable
+
     from ..config import WinMLBuildConfig
 
 
+# TypeVar for signature-preserving Click decorators.
+F = TypeVar("F", bound="Callable[..., Any]")
+
+
 # Shared stderr console for security/diagnostic messages emitted from utils.
 # Mirrors the module-level ``console = Console()`` pattern used by individual
 # command modules, but targets stderr so messages survive ``-q/--quiet``.
@@ -54,7 +60,7 @@ def warn_trust_remote_code() -> None:
     )
 
 
-def model_path_option(required=True):
+def model_path_option(required: bool = True) -> Callable[[F], F]:
     """Add --model option that accepts a local ONNX file path.
 
     The path is validated for existence on disk.
@@ -74,7 +80,7 @@ def model_path_option(required=True):
     )
 
 
-def model_option(required=True, optional_message=None):
+def model_option(required: bool = True, optional_message: str | None = None) -> Callable[[F], F]:
     """Add --model option that accepts any model reference.
 
     Accepts a HuggingFace model ID, build output directory, or .onnx file path.
@@ -98,7 +104,7 @@ def model_option(required=True, optional_message=None):
     )
 
 
-def output_option(help_text: str, required: bool = False):
+def output_option(help_text: str, required: bool = False) -> Callable[[F], F]:
     """Add ``-o/--output`` option that accepts a file path.
 
     The path is delivered to the callback as a :class:`pathlib.Path`.
@@ -118,7 +124,7 @@ def output_option(help_text: str, required: bool = False):
     return click.option("--output", "-o", **kwargs)
 
 
-def ep_option(required=True, optional_message=None):
+def ep_option(required: bool = True, optional_message: str | None = None) -> Callable[[F], F]:
     """Add --ep (execution provider) option to a Click command.
 
     Args:
@@ -150,7 +156,12 @@ def ep_option(required=True, optional_message=None):
     )
 
 
-def device_option(required=True, optional_message=None, default="NPU", include_auto=False):
+def device_option(
+    required: bool = True,
+    optional_message: str | None = None,
+    default: str | None = "NPU",
+    include_auto: bool = False,
+) -> Callable[[F], F]:
     """Add --device option to a Click command.
 
     Args:
@@ -182,7 +193,7 @@ def device_option(required=True, optional_message=None, default="NPU", include_a
     )
 
 
-def verbosity_options(f):
+def verbosity_options(f: F) -> F:
     """Add verbose and quiet logging options to a Click command.
 
     Adds --verbose/-v (stackable: -v, -vv, -vvv) and --quiet/-q flags.
@@ -213,7 +224,7 @@ def verbosity_options(f):
     return f  # noqa: RET504
 
 
-def build_config_option(help: str | None = None):
+def build_config_option(help: str | None = None) -> Callable[[F], F]:
     """Add -c/--config option for WinMLBuildConfig JSON file."""
     if help is None:
         help = (
@@ -230,7 +241,7 @@ def build_config_option(help: str | None = None):
     )
 
 
-def trust_remote_code_option(optional_message: str | None = None):
+def trust_remote_code_option(optional_message: str | None = None) -> Callable[[F], F]:
     """Add shared --trust-remote-code option to a Click command.
 
     Args:
@@ -316,7 +327,7 @@ def is_cli_provided(ctx: click.Context, param_name: str) -> bool:
     return source == click.core.ParameterSource.COMMANDLINE
 
 
-def collect_cli_overrides(ctx: click.Context, cls: type) -> dict[str, object]:
+def collect_cli_overrides(ctx: click.Context, cls: type) -> dict[str, Any]:
     """Collect CLI-provided values that match fields on a dataclass.
 
     Iterates ``ctx.params`` and returns ``{field_name: value}`` for every
@@ -343,7 +354,7 @@ def collect_cli_overrides(ctx: click.Context, cls: type) -> dict[str, object]:
         if cli_name:
             rename[cli_name] = f.name
 
-    overrides: dict[str, object] = {}
+    overrides: dict[str, Any] = {}
     for cli_name, value in ctx.params.items():
         field_name = rename.get(cli_name, cli_name)
         if field_name in valid_fields and is_cli_provided(ctx, cli_name):
diff --git a/tests/unit/analyze/test_analyze_onnx.py b/tests/unit/analyze/test_analyze_onnx.py
index 5177d534d..818628003 100644
--- a/tests/unit/analyze/test_analyze_onnx.py
+++ b/tests/unit/analyze/test_analyze_onnx.py
@@ -118,26 +118,6 @@ def test_has_errors_false_when_no_lint_errors(self) -> None:
         result = AnalyzeResult(lint=lint, optimization_config=WinMLOptimizationConfig())
         assert result.has_errors is False
 
-    def test_autoconf_truthy_with_nonempty_config(self) -> None:
-        """autoconf is truthy when config has flags."""
-        lint = _make_lint_result()
-        config = WinMLOptimizationConfig(gelu_fusion=True)
-        result = AnalyzeResult(lint=lint, optimization_config=config)
-        assert result.autoconf  # truthy — has flags
-        assert result.autoconf["gelu_fusion"] is True
-
-    def test_autoconf_falsy_with_empty_config(self) -> None:
-        """autoconf is falsy when config is empty."""
-        lint = _make_lint_result()
-        result = AnalyzeResult(lint=lint, optimization_config=WinMLOptimizationConfig())
-        assert not result.autoconf  # falsy — empty dict
-
-    def test_autoconf_falsy_when_none(self) -> None:
-        """autoconf is falsy when optimization_config is None."""
-        lint = _make_lint_result()
-        result = AnalyzeResult(lint=lint, optimization_config=None)
-        assert not result.autoconf  # falsy — None
-
     def test_lint_field_accessible(self) -> None:
         """lint field is directly accessible."""
         lint = _make_lint_result(errors=1, warnings=2, info=3)
@@ -364,8 +344,8 @@ def test_autoconf_with_gelu_pattern(self, tmp_path) -> None:
 
             result = analyze_onnx(str(model_file), ep="QNNExecutionProvider", device="NPU")
 
-        assert result.autoconf  # truthy — has flags
-        assert result.autoconf["gelu_fusion"] is True
+        assert result.optimization_config  # truthy — has flags
+        assert result.optimization_config["gelu_fusion"] is True
 
     def test_autoconf_with_multiple_patterns(self, tmp_path) -> None:
         """Autoconf picks up multiple fusion flags from action items."""
diff --git a/tests/unit/commands/test_eval.py b/tests/unit/commands/test_eval.py
index c3f5f7587..99b021006 100644
--- a/tests/unit/commands/test_eval.py
+++ b/tests/unit/commands/test_eval.py
@@ -385,6 +385,87 @@ def to_dict(self):
         assert cfg.dataset.samples == 33
 
 
+# ---------------------------------------------------------------------------
+# --label-mapping wiring (Click Path → label_mapping_file str)
+# ---------------------------------------------------------------------------
+
+
+class TestLabelMappingWiring:
+    """``--label-mapping`` is a Click ``Path`` that must land in
+    ``cfg.dataset.label_mapping_file`` (a ``str``), NOT in
+    ``cfg.dataset.label_mapping`` (the *parsed* ``dict[str, int] | None``).
+
+    The Click param name is ``label_mapping_path`` (distinct from the
+    ``DatasetConfig.label_mapping`` field) precisely so
+    ``cli_utils.collect_cli_overrides`` doesn't accidentally pass a Path
+    into the dict field. This test locks in that wiring.
+    """
+
+    def test_label_mapping_path_routes_to_file_field_not_dict_field(
+        self,
+        runner: CliRunner,
+        tmp_path,
+    ):
+        """--label-mapping <file> must set cfg.dataset.label_mapping_file (str)
+        and leave cfg.dataset.label_mapping (dict) untouched at this stage."""
+        from winml.modelkit.commands.eval import eval as eval_cmd
+
+        # Sentinel mapping file; existence matters because Click validates the path.
+        label_file = tmp_path / "labels.json"
+        label_file.write_text(json.dumps({"cat": 0, "dog": 1}), encoding="utf-8")
+
+        captured_cfg: dict = {}
+
+        def _fake_evaluate(cfg):
+            captured_cfg["cfg"] = cfg
+
+            class _R:
+                config = cfg
+                metrics = {"accuracy": 1.0}  # noqa: RUF012
+
+                def to_dict(self):
+                    return {"metrics": self.metrics, "config": cfg.to_dict()}
+
+            return _R()
+
+        with (
+            patch("winml.modelkit.eval.evaluate", side_effect=_fake_evaluate),
+            patch("winml.modelkit.commands.eval._resolve_device", return_value=None),
+            patch(
+                "winml.modelkit.commands.eval._resolve_label_mapping",
+                return_value=None,
+            ),
+            patch("winml.modelkit.commands.eval._write_and_display", return_value=None),
+        ):
+            result = runner.invoke(
+                eval_cmd,
+                [
+                    "-m",
+                    "microsoft/resnet-50",
+                    "--task",
+                    "image-classification",
+                    "--label-mapping",
+                    str(label_file),
+                ],
+                obj={"debug": False},
+            )
+
+        assert result.exit_code == 0, result.output
+        cfg = captured_cfg["cfg"]
+
+        # The CLI Path must land in label_mapping_file as a str — the field
+        # is serialized via to_dict(), so a Path would break JSON output.
+        assert cfg.dataset.label_mapping_file == str(label_file)
+        assert isinstance(cfg.dataset.label_mapping_file, str)
+
+        # label_mapping is the *parsed* dict and must stay at its default
+        # (None) until _resolve_label_mapping loads it at eval time. If the
+        # Click Path ever leaks into this field, this assertion fails — that
+        # was the bug introduced when ``collect_cli_overrides`` saw a Click
+        # param named ``label_mapping`` matching a same-named dataclass field.
+        assert cfg.dataset.label_mapping is None
+
+
 # ---------------------------------------------------------------------------
 # Per-task default dataset resolution
 # ---------------------------------------------------------------------------

From d96aa0cecae853a914b12848c09fb54458f95242 Mon Sep 17 00:00:00 2001
From: Zhipeng Wang <zhiwang@microsoft.com>
Date: Wed, 3 Jun 2026 11:17:57 +0800
Subject: [PATCH 027/143] refactor(task): relocate map_task_synonym to
 loader.task as to_optimum_task (#801)

## What

PR1 of #800. Relocate `map_task_synonym` ->
`loader/task.py::to_optimum_task` to establish a single WinML->Optimum
task-collapse boundary.

## Changes

- `loader/task.py`: add `to_optimum_task` + `TASK_SYNONYM_EXTENSIONS`
(moved from `export/io.py`); exported via `loader/__init__.py`.
- `export/io.py`: local implementation removed; `map_task_synonym` kept
as a backward-compatible alias (`= to_optimum_task`); internal use
repointed.
- Optimum-boundary call sites repointed to `to_optimum_task`:
`commands/inspect.py`, `export/htp/exporter.py`, `inspect/resolver.py`.
- `commands/build.py`: `TASK_SYNONYM_EXTENSIONS` now imported from
`loader`.
- New `tests/unit/loader/test_task_boundary.py` pins the collapse
contract.

## Behavior

No behavior change. `map_task_synonym` stays importable from
`export.io`; the collapse semantics (`image-feature-extraction` ->
`feature-extraction`, WinML extensions preserved) are byte-identical.
Existing synonym and #777/#782 regression tests stay green.

Sets up PR2 (#800), which adds the modality-aware `detect_task` and
relies on this single collapse boundary.
---
 src/winml/modelkit/commands/build.py      |  4 +-
 src/winml/modelkit/commands/inspect.py    |  7 ++-
 src/winml/modelkit/export/htp/exporter.py |  6 +--
 src/winml/modelkit/export/io.py           | 57 ++++-------------------
 src/winml/modelkit/inspect/resolver.py    |  6 +--
 src/winml/modelkit/loader/__init__.py     |  4 ++
 src/winml/modelkit/loader/task.py         | 38 +++++++++++++++
 tests/unit/loader/test_task_boundary.py   | 34 ++++++++++++++
 8 files changed, 96 insertions(+), 60 deletions(-)
 create mode 100644 tests/unit/loader/test_task_boundary.py

diff --git a/src/winml/modelkit/commands/build.py b/src/winml/modelkit/commands/build.py
index 19272e7dc..df5a8d388 100644
--- a/src/winml/modelkit/commands/build.py
+++ b/src/winml/modelkit/commands/build.py
@@ -276,8 +276,8 @@ def _validate_task_supported_for_model(
     Raises:
         ValueError: If the task is not supported for the model architecture.
     """
-    from ..export.io import TASK_SYNONYM_EXTENSIONS, ensure_hf_models_registered
-    from ..loader.task import get_supported_tasks, normalize_task
+    from ..export.io import ensure_hf_models_registered
+    from ..loader.task import TASK_SYNONYM_EXTENSIONS, get_supported_tasks, normalize_task
 
     if hf_config is None:
         from transformers import AutoConfig
diff --git a/src/winml/modelkit/commands/inspect.py b/src/winml/modelkit/commands/inspect.py
index b79b1012e..38a1f1769 100644
--- a/src/winml/modelkit/commands/inspect.py
+++ b/src/winml/modelkit/commands/inspect.py
@@ -428,14 +428,13 @@ def _inspect_model_v2(
             import optimum.exporters.onnx.model_configs  # noqa: F401
             from optimum.exporters.tasks import TasksManager
 
-            # TasksManager expects normalized task names
-            from ..export.io import map_task_synonym
-            from ..loader import resolve_optimum_library
+            # TasksManager expects Optimum-canonical task names
+            from ..loader import resolve_optimum_library, to_optimum_task
 
             onnx_config_cls = TasksManager.get_exporter_config_constructor(
                 exporter="onnx",
                 model_type=model_type,
-                task=map_task_synonym(task),
+                task=to_optimum_task(task),
                 library_name=resolve_optimum_library(model_type),
             )
             if onnx_config_cls:
diff --git a/src/winml/modelkit/export/htp/exporter.py b/src/winml/modelkit/export/htp/exporter.py
index ce8109572..c960deafa 100644
--- a/src/winml/modelkit/export/htp/exporter.py
+++ b/src/winml/modelkit/export/htp/exporter.py
@@ -470,14 +470,14 @@ def _get_optimum_patcher(model: nn.Module, task: str | None) -> Any:
             logger.debug("Model has no config.model_type; skipping Optimum patcher.")
             return contextlib.nullcontext()
 
-        # TasksManager expects normalized task names
-        from ..io import map_task_synonym
+        # TasksManager expects Optimum-canonical task names
+        from ...loader import to_optimum_task
 
         try:
             cfg_cls = TasksManager.get_exporter_config_constructor(
                 "onnx",
                 model_type=model_type,
-                task=map_task_synonym(task),
+                task=to_optimum_task(task),
                 library_name="transformers",
             )
             return cfg_cls(model_config).patch_model_for_export(model)
diff --git a/src/winml/modelkit/export/io.py b/src/winml/modelkit/export/io.py
index dcb374452..bc8011660 100644
--- a/src/winml/modelkit/export/io.py
+++ b/src/winml/modelkit/export/io.py
@@ -40,6 +40,7 @@
     DummyTextInputGenerator,
 )
 
+from ..loader import to_optimum_task
 from .value_range import intercept_value_ranges
 
 
@@ -79,53 +80,13 @@ def ensure_hf_models_registered() -> None:
 
 
 # =============================================================================
-# Task Synonym Extensions (extends Optimum's TasksManager.map_from_synonym)
+# Task Synonym Extensions (relocated to loader.task — single source of truth)
 # =============================================================================
-
-# Extends Optimum's built-in task synonym mapping for tasks it doesn't recognize.
-# Optimum's map_from_synonym handles known synonyms like:
-#   - "image-feature-extraction" → "feature-extraction"
-# This dict adds mappings for tasks Optimum doesn't support at all.
-TASK_SYNONYM_EXTENSIONS: dict[str, str] = {
-    # next-sentence-prediction has same I/O as text-classification: input_ids → logits
-    "next-sentence-prediction": "text-classification",
-    # mask-generation is registered via register_onnx_overwrite for SAM2.
-    # Optimum incorrectly maps it to "feature-extraction"; preserve as-is.
-    "mask-generation": "mask-generation",
-}
-
-
-def map_task_synonym(task: str) -> str:
-    """Map task name to canonical form, extending Optimum's synonym mapping.
-
-    Our extensions take priority over Optimum's built-in synonym map.
-    If a task is found in ``TASK_SYNONYM_EXTENSIONS``, return immediately
-    without passing through Optimum (which may incorrectly normalize
-    custom-registered tasks like ``mask-generation``).
-
-    Args:
-        task: Task name (e.g., "next-sentence-prediction", "image-feature-extraction")
-
-    Returns:
-        Canonical task name (e.g., "text-classification", "feature-extraction")
-
-    Example:
-        >>> map_task_synonym("next-sentence-prediction")  # Our extension
-        'text-classification'
-        >>> map_task_synonym("mask-generation")  # Preserved (not Optimum-normalized)
-        'mask-generation'
-        >>> map_task_synonym("image-feature-extraction")  # Optimum's synonym
-        'feature-extraction'
-        >>> map_task_synonym("text-classification")  # Already canonical
-        'text-classification'
-    """
-    # Our extensions take priority — return early to prevent Optimum from
-    # incorrectly normalizing custom-registered tasks.
-    if task in TASK_SYNONYM_EXTENSIONS:
-        return TASK_SYNONYM_EXTENSIONS[task]
-
-    # Fallback: normalize via Optimum's built-in synonym mapping
-    return TasksManager.map_from_synonym(task)
+# ``TASK_SYNONYM_EXTENSIONS`` and the WinML -> Optimum collapse now live in
+# ``loader.task`` as ``to_optimum_task``. Both are imported above and re-exported
+# here; ``map_task_synonym`` is kept as a backward-compatible alias for existing
+# importers (and is identical to ``to_optimum_task``).
+map_task_synonym = to_optimum_task
 
 
 # =============================================================================
@@ -195,7 +156,7 @@ def _get_onnx_config(
 
     Args:
         model_type: HF model type (e.g., "bert", "clip_vision_model")
-        task: Task name (will be normalized via map_task_synonym)
+        task: Task name (will be collapsed to Optimum-canonical via to_optimum_task)
         hf_config: HuggingFace PretrainedConfig for OnnxConfig instantiation
         library_name: Source library (default: "transformers")
         exporter: Export backend (default: "onnx")
@@ -208,7 +169,7 @@ def _get_onnx_config(
     """
     ensure_hf_models_registered()
 
-    normalized_task = map_task_synonym(task)
+    normalized_task = to_optimum_task(task)
 
     # Route model_types whose Optimum OnnxConfig is registered under another
     # library (e.g. timm via "timm_wrapper" -> "timm") so the lookup succeeds
diff --git a/src/winml/modelkit/inspect/resolver.py b/src/winml/modelkit/inspect/resolver.py
index e42afdc5f..91a791bb9 100644
--- a/src/winml/modelkit/inspect/resolver.py
+++ b/src/winml/modelkit/inspect/resolver.py
@@ -355,15 +355,15 @@ def resolve_exporter(
         import optimum.exporters.onnx.model_configs  # noqa: F401
         from optimum.exporters.tasks import TasksManager
 
-        # TasksManager expects normalized task names
-        from ..export.io import map_task_synonym
+        # TasksManager expects Optimum-canonical task names
+        from ..loader import to_optimum_task
 
         # TasksManager uses underscores (sam2_video), not hyphens (sam2-video)
         # Use original model_type for TasksManager lookup
         onnx_config_cls = TasksManager.get_exporter_config_constructor(
             exporter="onnx",
             model_type=model_type,
-            task=map_task_synonym(task),
+            task=to_optimum_task(task),
             library_name=resolve_optimum_library(model_type),
         )
         if onnx_config_cls:
diff --git a/src/winml/modelkit/loader/__init__.py b/src/winml/modelkit/loader/__init__.py
index 89a8a0f55..13c54dfef 100644
--- a/src/winml/modelkit/loader/__init__.py
+++ b/src/winml/modelkit/loader/__init__.py
@@ -29,17 +29,20 @@
 from .task import (
     HF_TASK_DEFAULTS,
     KNOWN_TASKS,
+    TASK_SYNONYM_EXTENSIONS,
     get_supported_tasks,
     get_task_abbrev,
     normalize_task,
     resolve_optimum_library,
     resolve_task_and_model_class,
+    to_optimum_task,
 )
 
 
 __all__ = [
     "HF_TASK_DEFAULTS",
     "KNOWN_TASKS",
+    "TASK_SYNONYM_EXTENSIONS",
     "WinMLLoaderConfig",
     "get_supported_tasks",
     "get_task_abbrev",
@@ -49,6 +52,7 @@
     "resolve_loader_config",
     "resolve_optimum_library",
     "resolve_task_and_model_class",
+    "to_optimum_task",
 ]
 
 
diff --git a/src/winml/modelkit/loader/task.py b/src/winml/modelkit/loader/task.py
index 08eb141aa..ca24b85ce 100644
--- a/src/winml/modelkit/loader/task.py
+++ b/src/winml/modelkit/loader/task.py
@@ -10,6 +10,7 @@
     resolve_task_and_model_class  - Main orchestrator (3 resolution cases)
     resolve_optimum_library      - Route a model_type to the Optimum export library
     normalize_task               - Map task aliases to canonical names
+    to_optimum_task              - Collapse a WinMLTask to its Optimum-canonical form
     get_task_abbrev              - Abbreviated task name for cache keys
     get_supported_tasks          - List ONNX-exportable tasks for a model type
 
@@ -507,6 +508,43 @@ def normalize_task(task: str) -> str:
     return TasksManager.map_from_synonym(task)
 
 
+# WinML task-synonym extensions — extend Optimum's ``TasksManager.map_from_synonym``
+# for tasks it does not recognize or mis-maps. Entries here take priority over Optimum.
+TASK_SYNONYM_EXTENSIONS: dict[str, str] = {
+    # next-sentence-prediction has the same I/O as text-classification: input_ids -> logits
+    "next-sentence-prediction": "text-classification",
+    # mask-generation is registered via register_onnx_overwrite for SAM2.
+    # Optimum incorrectly maps it to "feature-extraction"; preserve as-is.
+    "mask-generation": "mask-generation",
+}
+
+
+def to_optimum_task(task: str) -> str:
+    """Map a task name to its Optimum-canonical form, extending Optimum's synonyms.
+
+    This is the single WinML -> Optimum boundary translation: call it only at the
+    moment of an Optimum API call (e.g. ``TasksManager.get_exporter_config_constructor``).
+    The result is lossy — modality-aware names collapse
+    (``image-feature-extraction`` -> ``feature-extraction``).
+
+    WinML extensions in ``TASK_SYNONYM_EXTENSIONS`` take priority and short-circuit
+    before Optimum, which may otherwise mis-normalize custom-registered tasks such as
+    ``mask-generation``.
+
+    Args:
+        task: Task name (a WinMLTask or an alias).
+
+    Returns:
+        Optimum-canonical task name.
+    """
+    if task in TASK_SYNONYM_EXTENSIONS:
+        return TASK_SYNONYM_EXTENSIONS[task]
+
+    from optimum.exporters.tasks import TasksManager
+
+    return TasksManager.map_from_synonym(task)
+
+
 def get_task_abbrev(task: str) -> str:
     """Get abbreviated task name for cache keys.
 
diff --git a/tests/unit/loader/test_task_boundary.py b/tests/unit/loader/test_task_boundary.py
new file mode 100644
index 000000000..df183cfe0
--- /dev/null
+++ b/tests/unit/loader/test_task_boundary.py
@@ -0,0 +1,34 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Unit tests for ``to_optimum_task`` — the single WinML -> Optimum collapse boundary.
+
+``to_optimum_task`` is the relocated/renamed ``export.io.map_task_synonym``. It must
+reproduce that behavior verbatim: WinML extensions short-circuit before Optimum, and
+everything else passes through ``TasksManager.map_from_synonym`` (which collapses
+modality-aware names like ``image-feature-extraction`` to ``feature-extraction``).
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from winml.modelkit.loader import to_optimum_task
+
+
+@pytest.mark.parametrize(
+    "task, expected",
+    [
+        # Optimum collapses modality (image-feature-extraction -> feature-extraction).
+        ("image-feature-extraction", "feature-extraction"),
+        # WinML extension: routed to its Optimum-canonical target.
+        ("next-sentence-prediction", "text-classification"),
+        # WinML extension preserved as-is (Optimum would mis-map it otherwise).
+        ("mask-generation", "mask-generation"),
+        # Already-canonical task passes through unchanged.
+        ("text-classification", "text-classification"),
+    ],
+)
+def test_to_optimum_task(task: str, expected: str) -> None:
+    assert to_optimum_task(task) == expected

From c047cf39b5914be263da9724b36f86c8f70cac23 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Wed, 3 Jun 2026 14:15:55 +0800
Subject: [PATCH 028/143] fix: unify verbosity declaration and stderr-only
 logger output (#566) (#793)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fixes #566.

## Problem

- Top-level group declared ``-v/--verbose`` (count) and ``-q/--quiet``,
but 12 of 13 subcommands redeclared ``--verbose`` as ``is_flag=True``,
so ``winml export -vv …`` errored with ``extra argument``.
- No subcommand exposed ``-q/--quiet``, so ``winml export --quiet …``
failed with ``no such option``.
- Each command wired logging differently; DEBUG/INFO lines interleaved
with Rich tables on stdout, breaking ``cmd > out 2> log.txt``.

## Changes

- ``utils/cli.py``: ``verbosity_options`` decorator (``-v`` count,
``-q`` flag) + new ``resolve_verbosity(ctx, verbose, quiet)`` helper
that merges top-level and subcommand-level values (max of verbose, OR of
quiet). Honors the legacy ``ctx.obj[""debug""]`` so tests that bypass
``main()`` still raise the verbosity floor.
- ``utils/logging.py``: format ``[%(asctime)s %(levelname)-7s %(name)s]
%(message)s`` with ``datefmt=%H:%M:%S``, ``stream=sys.stderr``.
Idempotent — re-creates the WinML handler bound to the current
``sys.stderr`` on each call so Click ``CliRunner`` stream redirection
keeps working, and leaves non-WinML handlers (notably pytest ``caplog``)
intact.
- ``cli.py``: top-level group uses ``@verbosity_options`` (replaces
inline declarations); ``--debug`` alias preserved.
- 12 subcommands (``build``, ``compile``, ``config``, ``eval``,
``export``, ``inspect``, ``optimize``, ``perf``, ``quantize``, ``sys``,
plus ``analyze`` cleanup): replace ad-hoc ``--verbose``
(``is_flag=True``) with ``@cli_utils.verbosity_options``, add ``quiet:
bool`` param, call ``configure_logging(verbosity=verbose, quiet=quiet)``
after ``resolve_verbosity``. Removes the legacy ``if
ctx.obj.get(""debug""): verbose = True`` blocks (folded into the
helper).
- ``serve/app.py``: pre-existing latent bug — module-level
``logging.getLogger(""winml.modelkit"").setLevel(INFO)`` ran at import,
which muted DEBUG capture in unrelated tests that got collected
alongside the serve test module. Split into ``_attach_log_handler()``
(idempotent, called from ``_register_routes``) and a paired
``_ensure_log_capture_level`` / ``_restore_log_capture_level`` invoked
from the production lifespan. Tests that build the app via
``_register_routes`` + a mock lifespan no longer leak global logger
state.

## Behavior

Both flag positions work; subcommand value wins when both are passed
(max/OR merge):

```text
winml -v export -m … -o …            # top-level: works
winml export -vv -m … -o …            # subcommand: now works (was: extra argument)
winml --quiet export -m … -o …        # top-level: works
winml export --quiet -m … -o …        # subcommand: now works (was: no such option)
winml inspect -vv -m … --format json > out 2> log.txt   # clean stdout/stderr split
```

## Tests

- ``tests/cli/`` (23): pass
- ``tests/unit/`` (5061 collected): **5058 pass**, 3 fail — all 3
pre-existing on main and unrelated to this change:
-
``test_winml_session.py::TestOpenVINODeviceRouting::test_compile_openvino_cpu_device_succeeds``
-
``test_winml_session.py::TestOpenVINODeviceRouting::test_compile_openvino_cpu_provider_not_npu``
(both env: no OpenVINO EP installed)
-
``test_config_utils.py::TestMergeConfigNoneHandling::test_none_to_value_transition``
(test isolation, passes alone)

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
---
 src/winml/modelkit/cli.py               |  15 +-
 src/winml/modelkit/commands/analyze.py  |   6 +-
 src/winml/modelkit/commands/build.py    |  21 +--
 src/winml/modelkit/commands/compile.py  |  33 ++--
 src/winml/modelkit/commands/config.py   |  18 +-
 src/winml/modelkit/commands/eval.py     |  16 +-
 src/winml/modelkit/commands/export.py   |  26 ++-
 src/winml/modelkit/commands/inspect.py  |  33 ++--
 src/winml/modelkit/commands/optimize.py |  22 +--
 src/winml/modelkit/commands/perf.py     |  20 +--
 src/winml/modelkit/commands/quantize.py |  19 +-
 src/winml/modelkit/commands/sys.py      | 230 +++++++++++-------------
 src/winml/modelkit/serve/app.py         |  56 +++++-
 src/winml/modelkit/utils/cli.py         |  67 +++++--
 src/winml/modelkit/utils/logging.py     |  41 ++++-
 15 files changed, 330 insertions(+), 293 deletions(-)

diff --git a/src/winml/modelkit/cli.py b/src/winml/modelkit/cli.py
index 4504856b7..2e4745950 100644
--- a/src/winml/modelkit/cli.py
+++ b/src/winml/modelkit/cli.py
@@ -29,6 +29,7 @@
 from . import __version__
 from .telemetry import ActionGroup
 from .telemetry import telemetry as _telemetry_mod
+from .utils.cli import verbosity_options
 from .utils.logging import configure_logging, flush_ort_startup_logs
 
 
@@ -240,19 +241,7 @@ def format_commands(self, ctx: click.Context, formatter: click.HelpFormatter) ->
     context_settings={"help_option_names": ["-h", "--help"]},
 )
 @click.version_option(version=__version__, prog_name="winml")
-@click.option(
-    "--verbose",
-    "-v",
-    count=True,
-    help="Increase verbosity (-v=INFO, -vv=DEBUG)",
-)
-@click.option(
-    "--quiet",
-    "-q",
-    is_flag=True,
-    default=False,
-    help="Quiet mode - errors only",
-)
+@verbosity_options()
 @click.option(
     "--debug",
     is_flag=True,
diff --git a/src/winml/modelkit/commands/analyze.py b/src/winml/modelkit/commands/analyze.py
index 29f8b0d23..b05463a58 100644
--- a/src/winml/modelkit/commands/analyze.py
+++ b/src/winml/modelkit/commands/analyze.py
@@ -655,7 +655,7 @@ def _ep_name_device_display_name(ep_name: str, device_name: str) -> str:
         "all = all rule-data-backed devices; auto = infer from local availability"
     ),
 )
-@cli_utils.verbosity_options
+@cli_utils.verbosity_options()
 @cli_utils.build_config_option()
 @cli_utils.output_option("Save JSON output to file")
 @click.option(
@@ -730,7 +730,9 @@ def analyze(
         if not cli_utils.is_cli_provided(ctx, "ep") and "execution_provider" in cc:
             ep = cc["execution_provider"]
 
-    # Configure logging
+    # Configure logging — merge with top-level group so `winml -v analyze …`
+    # and `winml analyze -v …` are equivalent.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
     configure_logging(verbosity=verbose, quiet=quiet)
 
     try:
diff --git a/src/winml/modelkit/commands/build.py b/src/winml/modelkit/commands/build.py
index df5a8d388..bc21b274f 100644
--- a/src/winml/modelkit/commands/build.py
+++ b/src/winml/modelkit/commands/build.py
@@ -37,6 +37,7 @@
     print_stage_skip,
     print_stages_header,
 )
+from ..utils.logging import configure_logging
 
 
 if TYPE_CHECKING:
@@ -488,13 +489,7 @@ def _validate_loader_tasks_for_model(
 @cli_utils.trust_remote_code_option(
     optional_message="Trust remote code for custom model architectures (e.g., Mu2)."
 )
-@click.option(
-    "-v",
-    "--verbose",
-    is_flag=True,
-    default=False,
-    help="Enable verbose logging",
-)
+@cli_utils.verbosity_options()
 @click.pass_context
 def build(
     ctx: click.Context,
@@ -511,7 +506,8 @@ def build(
     no_analyze: bool,
     max_optim_iterations: int | None,
     trust_remote_code: bool,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
 ) -> None:
     r"""Build a WinML-optimized ONNX model from a HuggingFace model or .onnx file.
 
@@ -541,12 +537,9 @@ def build(
         # Force rebuild
         winml build -c config.json -m microsoft/resnet-50 -o output/ --rebuild
     """
-    # Inherit debug flag from parent context
-    if ctx.obj and ctx.obj.get("debug"):
-        verbose = True
-
-    if verbose:
-        logging.basicConfig(level=logging.DEBUG)
+    # Merge top-level -v/-q with subcommand-level flags so either position works.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     # Validate mutual exclusion
     if output_dir and use_cache:
diff --git a/src/winml/modelkit/commands/compile.py b/src/winml/modelkit/commands/compile.py
index 7cdbfa819..22c1e0097 100644
--- a/src/winml/modelkit/commands/compile.py
+++ b/src/winml/modelkit/commands/compile.py
@@ -72,13 +72,6 @@
     default=True,
     help="Validate compiled model (default: enabled)",
 )
-@click.option(
-    "--verbose",
-    "-v",
-    is_flag=True,
-    default=False,
-    help="Enable verbose output",
-)
 @click.option(
     "--compiler",
     type=click.Choice(["ort", "qairt"]),
@@ -105,6 +98,7 @@
     help="List available compilers for the selected device and exit",
 )
 @cli_utils.build_config_option()
+@cli_utils.verbosity_options()
 @click.pass_context
 def compile(
     ctx: click.Context,
@@ -114,7 +108,8 @@ def compile(
     device: str,
     ep: EPNameOrAlias | None,
     validate: bool,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     compiler: str,
     qnn_sdk_root: Path | None,
     embed: bool,
@@ -140,9 +135,8 @@ def compile(
         # Compile using QAIRT SDK
         winml compile -m model.onnx --compiler qairt --qnn-sdk-root /path/to/sdk
     """
-    # Inherit debug mode from parent
-    if ctx.obj and ctx.obj.get("debug"):
-        verbose = True
+    # Merge top-level -v/-q with subcommand-level flags so either position works.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
 
     # Apply build config defaults (CLI explicit options take precedence).
     # Read raw JSON so missing keys are distinguishable from dataclass defaults.
@@ -157,10 +151,17 @@ def compile(
             embed = cc["embed_context"]
         if not cli_utils.is_cli_provided(ctx, "validate") and "validate" in cc:
             validate = cc["validate"]
-        if not cli_utils.is_cli_provided(ctx, "verbose") and "verbose" in cc:
-            verbose = cc["verbose"]
-
-    configure_logging(verbose=verbose)
+        # Config-file verbosity fallback. CLI flags always win: only honor the
+        # build config's `verbose` when the user gave no verbosity on either CLI
+        # position (resolve_verbosity above already merged top-level + subcommand
+        # -v, so a merged 0 means "none on the CLI") and did not ask for --quiet.
+        # ``int`` maps both `true`->1 (INFO) and an explicit count (e.g. 2->DEBUG).
+        # Currently compile-only; tracked for all commands in
+        # https://github.com/microsoft/winml-cli/issues/799
+        if verbose == 0 and not quiet and "verbose" in cc:
+            verbose = int(cc["verbose"])
+
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     try:
         resolved_device, _ = resolve_device(device, ep=ep)
@@ -200,7 +201,7 @@ def compile(
         )
 
     config.validate = validate
-    config.verbose = verbose
+    config.verbose = bool(verbose)
 
     # Set compiler options
     config.ep_config.compiler = compiler
diff --git a/src/winml/modelkit/commands/config.py b/src/winml/modelkit/commands/config.py
index fe6a5b79e..203d9aa03 100644
--- a/src/winml/modelkit/commands/config.py
+++ b/src/winml/modelkit/commands/config.py
@@ -30,6 +30,7 @@
 import click
 
 from ..utils import cli as cli_utils
+from ..utils.logging import configure_logging
 
 
 if TYPE_CHECKING:
@@ -124,13 +125,6 @@ def _apply_stage_overrides(cfg: Any, *, no_quant: bool, no_compile: bool) -> Non
     default="transformers",
     help="Source library for TasksManager (default: transformers)",
 )
-@click.option(
-    "-v",
-    "--verbose",
-    is_flag=True,
-    default=False,
-    help="Enable verbose logging",
-)
 @click.option(
     "--no-quant",
     is_flag=True,
@@ -144,7 +138,10 @@ def _apply_stage_overrides(cfg: Any, *, no_quant: bool, no_compile: bool) -> Non
     help="Exclude compilation from generated config (sets compile=None). Default: exclude.",
 )
 @cli_utils.trust_remote_code_option()
+@cli_utils.verbosity_options()
+@click.pass_context
 def config(
+    ctx: click.Context,
     model: str | None,
     task: str | None,
     model_class: str | None,
@@ -157,7 +154,8 @@ def config(
     precision: str,
     output: Path | None,
     library_name: str,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     no_quant: bool,
     no_compile: bool,
     trust_remote_code: bool,
@@ -206,8 +204,8 @@ def config(
         # Generate configs for submodules
         winml config -m microsoft/resnet-50 --module ResNetConvLayer
     """
-    if verbose:
-        logging.basicConfig(level=logging.DEBUG)
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     hf_model = model  # rename for clarity in this function
     # Validate: at least one of -m, --model-type, or --model-class is required
diff --git a/src/winml/modelkit/commands/eval.py b/src/winml/modelkit/commands/eval.py
index 61647d262..c719a8ecc 100644
--- a/src/winml/modelkit/commands/eval.py
+++ b/src/winml/modelkit/commands/eval.py
@@ -17,6 +17,7 @@
 
 from ..utils import cli as cli_utils
 from ..utils.eval_utils import TASK_SCHEMAS, TaskSchema
+from ..utils.logging import configure_logging
 
 
 if TYPE_CHECKING:
@@ -126,13 +127,6 @@
     help='Path to a JSON file with label mapping: {"label_name": id}.',
 )
 @cli_utils.output_option("Output JSON file path.")
-@click.option(
-    "-v",
-    "--verbose",
-    is_flag=True,
-    default=False,
-    help="Enable verbose output.",
-)
 @click.option(
     "--dataset-script",
     type=str,
@@ -148,6 +142,7 @@
     help="Print expected dataset schema for the given --task and exit.",
 )
 @cli_utils.build_config_option()
+@cli_utils.verbosity_options()
 @click.pass_context
 def eval(
     ctx: click.Context,
@@ -166,7 +161,8 @@ def eval(
     column: tuple[str, ...],
     label_mapping_path: Path | None,
     output: Path | None,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     dataset_script: str | None,
     trust_remote_code: bool,
     show_schema: bool,
@@ -203,8 +199,8 @@ def eval(
         _print_schema(task_arg, schema)
         return
 
-    if verbose or (ctx.obj and ctx.obj.get("debug")):
-        logging.getLogger("winml.modelkit").setLevel(logging.DEBUG)
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     from ..eval import evaluate
 
diff --git a/src/winml/modelkit/commands/export.py b/src/winml/modelkit/commands/export.py
index b58c9c84c..729a41802 100644
--- a/src/winml/modelkit/commands/export.py
+++ b/src/winml/modelkit/commands/export.py
@@ -32,6 +32,7 @@
 from rich.console import Console
 
 from ..utils import cli as cli_utils
+from ..utils.logging import configure_logging
 
 
 logger = logging.getLogger(__name__)
@@ -69,13 +70,6 @@ def _delete_onnx_with_external_data(onnx_path: Path) -> None:
     help="HuggingFace model name or local path (e.g., prajjwal1/bert-tiny)",
 )
 @cli_utils.output_option("Output ONNX file path (e.g., model.onnx)", required=True)
-@click.option(
-    "--verbose",
-    "-v",
-    is_flag=True,
-    default=False,
-    help="Enable verbose console output (8-step format)",
-)
 @click.option(
     "--with-report",
     is_flag=True,
@@ -129,12 +123,14 @@ def _delete_onnx_with_external_data(onnx_path: Path) -> None:
     help='JSON with shape overrides (e.g., {"sequence_length": 2048, "height": 640}).',
 )
 @cli_utils.build_config_option()
+@cli_utils.verbosity_options()
 @click.pass_context
 def export(
     ctx: click.Context,
     model: str,
     output: Path,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     with_report: bool,
     no_hierarchy: bool,
     dynamo: bool,
@@ -186,9 +182,8 @@ def export(
         # Custom ONNX export configuration
         winml export -m bert-base-uncased -o bert.onnx --export-config config.json
     """
-    # Inherit debug mode from parent
-    if ctx.obj.get("debug"):
-        verbose = True
+    # Merge top-level -v/-q with subcommand-level flags so either position works.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
 
     # Apply build config defaults (CLI explicit options take precedence).
     # Read raw JSON so missing keys are distinguishable from dataclass defaults.
@@ -209,9 +204,8 @@ def export(
     from ..export import export_pytorch as export_onnx
     from ..loader import load_hf_model
 
-    # Configure logging based on verbose flag
-    if verbose:
-        logging.getLogger("winml.modelkit").setLevel(logging.DEBUG)
+    # Configure logging — stderr only, shared format with the rest of the CLI.
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     # Show export info
     console.print(f"[bold blue]Model:[/bold blue] {model}")
@@ -341,7 +335,7 @@ def export(
     if cli_utils.is_cli_provided(ctx, "no_hierarchy"):
         config_kwargs["enable_hierarchy_tags"] = not no_hierarchy
     if cli_utils.is_cli_provided(ctx, "verbose"):
-        config_kwargs["verbose"] = verbose
+        config_kwargs["verbose"] = bool(verbose)
     if cli_utils.is_cli_provided(ctx, "dynamo"):
         config_kwargs["dynamo"] = dynamo
 
@@ -401,7 +395,7 @@ def export(
             export_config=cfg,
             model_id=model,
             task=detected_task,
-            verbose=verbose,
+            verbose=bool(verbose),
             enable_reporting=with_report,
         )
         logger.debug("Export stats: %s", export_stats)
diff --git a/src/winml/modelkit/commands/inspect.py b/src/winml/modelkit/commands/inspect.py
index 38a1f1769..6947c6c19 100644
--- a/src/winml/modelkit/commands/inspect.py
+++ b/src/winml/modelkit/commands/inspect.py
@@ -26,6 +26,9 @@
 import click
 from rich.console import Console
 
+from ..utils import cli as cli_utils
+from ..utils.logging import configure_logging
+
 
 logger = logging.getLogger(__name__)
 # `console` is stdout-bound — table/JSON output goes here.
@@ -94,13 +97,6 @@ def _looks_like_local_path(model_id: str) -> bool:
     default="table",
     help="Output format (default: table)",
 )
-@click.option(
-    "-v",
-    "--verbose",
-    is_flag=True,
-    default=False,
-    help="Show full configuration details",
-)
 @click.option(
     "-t",
     "--task",
@@ -134,12 +130,14 @@ def _looks_like_local_path(model_id: str) -> bool:
     default=None,
     help="Override model class (e.g., BertForMaskedLM) — can be used without --model",
 )
+@cli_utils.verbosity_options()
 @click.pass_context
 def inspect(
     ctx: click.Context,
     model_id: str | None,
     output_format: str,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     task: str | None,
     hierarchy: bool,
     list_tasks: bool,
@@ -203,13 +201,18 @@ def inspect(
         if not _p.exists():
             raise click.ClickException(f"Local path '{model_id}' does not exist.")
 
+    # Merge top-level -v/-q with subcommand-level flags so either position
+    # works, once and up front. The banner decision below needs the merged
+    # --quiet (so both `winml --quiet inspect …` and `winml inspect -q`
+    # suppress it); configure_logging needs both. Single source of truth.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+
     # Print a banner BEFORE the heavy import chain / network calls so users
     # see immediate feedback instead of ~14 s of silence and assume the
     # command hung (see #543). Banner + spinner go to stderr so `--format
     # json` consumers still get clean stdout. Suppressed in --quiet mode
     # and in JSON mode (Click 8.4 mixes stderr into CliRunner.result.output,
     # and JSON consumers expect clean stdout regardless).
-    quiet = bool(ctx.obj and ctx.obj.get("quiet"))
     json_mode = output_format.lower() == "json"
     target = model_id or model_type or model_class
     if not quiet and not json_mode:
@@ -218,13 +221,7 @@ def inspect(
     from ..inspect import InspectError, ModelNotFoundError, NetworkError
     from ..inspect.formatter import output_json, output_table
 
-    # Inherit debug mode from parent context
-    if ctx.obj and ctx.obj.get("debug"):
-        verbose = True
-
-    # Configure logging based on verbosity
-    if verbose:
-        logging.getLogger("winml.modelkit").setLevel(logging.DEBUG)
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     try:
         if quiet or json_mode:
@@ -249,9 +246,9 @@ def inspect(
                 )
 
         if output_format.lower() == "json":
-            click.echo(output_json(result, verbose=verbose))
+            click.echo(output_json(result, verbose=bool(verbose)))
         else:
-            output_table(console, result, verbose=verbose)
+            output_table(console, result, verbose=bool(verbose))
 
     except ModelNotFoundError as e:
         raise click.ClickException(f"Model not found: {e}") from e
diff --git a/src/winml/modelkit/commands/optimize.py b/src/winml/modelkit/commands/optimize.py
index 7d78e391f..287f9a423 100644
--- a/src/winml/modelkit/commands/optimize.py
+++ b/src/winml/modelkit/commands/optimize.py
@@ -30,6 +30,7 @@
 
 from ..onnx import load_onnx, save_onnx
 from ..utils import cli as cli_utils
+from ..utils.logging import configure_logging
 
 
 if TYPE_CHECKING:
@@ -179,13 +180,7 @@ def capability_options(func: F) -> F:
     default=None,
     help="Configuration file (YAML/JSON)",
 )
-@click.option(
-    "--verbose",
-    "-v",
-    is_flag=True,
-    default=False,
-    help="Enable verbose output",
-)
+@cli_utils.verbosity_options()
 @capability_options
 @click.pass_context  # type: ignore[arg-type]  # capability_options widens the signature; click stubs want positional-only ctx but we keep it keyword-callable for back-compat
 def optimize(
@@ -195,7 +190,8 @@ def optimize(
     model: Path | None,
     output: Path | None,
     config: Path | None,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     **kwargs: Any,
 ) -> None:
     r"""Optimize ONNX model with capability-driven optimizer.
@@ -339,13 +335,9 @@ def optimize(
     if model is None:
         raise click.UsageError("Missing option '--model' / '-m'.")
 
-    # Inherit debug mode from parent
-    if ctx.obj and ctx.obj.get("debug"):
-        verbose = True
-
-    # Configure logging
-    if verbose:
-        logging.getLogger("winml.modelkit").setLevel(logging.DEBUG)
+    # Merge top-level -v/-q with subcommand-level flags so either position works.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     # Import optimizer
     from ..optim import Optimizer
diff --git a/src/winml/modelkit/commands/perf.py b/src/winml/modelkit/commands/perf.py
index 841c2c98b..5b531f6f0 100644
--- a/src/winml/modelkit/commands/perf.py
+++ b/src/winml/modelkit/commands/perf.py
@@ -28,6 +28,7 @@
 
 from ..utils import cli as cli_utils
 from ..utils.constants import EPName, EPNameOrAlias
+from ..utils.logging import configure_logging
 from ._live_chart import LiveMonitorDisplay
 
 
@@ -1196,14 +1197,8 @@ def _run_onnx_benchmark(
     help="Enable operator-level profiling (requires onnxruntime-qnn)",
     hidden=True,  # Not ready, so hide from --help for now
 )
-@click.option(
-    "--verbose",
-    "-v",
-    is_flag=True,
-    default=False,
-    help="Enable verbose output",
-)
 @cli_utils.build_config_option()
+@cli_utils.verbosity_options()
 @click.pass_context
 def perf(
     ctx: click.Context,
@@ -1223,7 +1218,8 @@ def perf(
     module_class: str | None,
     monitor: bool,
     op_tracing: str | None,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     config_file: Path | None,
 ) -> None:
     r"""Benchmark model inference performance.
@@ -1270,9 +1266,9 @@ def perf(
         if not cli_utils.is_cli_provided(ctx, "ep") and "execution_provider" in cc:
             ep = cc["execution_provider"]
 
-    # Setup logging
-    if verbose or (ctx.obj and ctx.obj.get("debug")):
-        logging.getLogger("winml.modelkit").setLevel(logging.DEBUG)
+    # Merge top-level -v/-q with subcommand-level flags so either position works.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     console = Console()
 
@@ -1305,7 +1301,7 @@ def perf(
             batch_size=batch_size,
             no_quantize=no_quantize,
             output=output,
-            verbose=verbose,
+            verbose=bool(verbose),
             console=console,
             monitor=monitor,
             device=device.lower(),
diff --git a/src/winml/modelkit/commands/quantize.py b/src/winml/modelkit/commands/quantize.py
index 9f9b5cfca..90fa794bb 100644
--- a/src/winml/modelkit/commands/quantize.py
+++ b/src/winml/modelkit/commands/quantize.py
@@ -105,14 +105,8 @@
     help="HuggingFace model name (e.g., 'microsoft/resnet-50'). When provided "
     "with --task, enables task-aware calibration datasets using the model's preprocessor.",
 )
-@click.option(
-    "--verbose",
-    "-v",
-    is_flag=True,
-    default=False,
-    help="Enable verbose output",
-)
 @cli_utils.build_config_option()
+@cli_utils.verbosity_options()
 @click.pass_context
 def quantize(
     ctx: click.Context,
@@ -127,7 +121,8 @@ def quantize(
     symmetric: bool,
     task: str | None,
     model_name: str | None,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     config_file: Path | None,
 ) -> None:
     r"""Quantize ONNX model by inserting QDQ nodes.
@@ -153,11 +148,9 @@ def quantize(
         # Explicit types with entropy calibration
         winml quantize -m model.onnx --weight-type int8 --method entropy
     """
-    # Inherit debug mode from parent
-    if ctx.obj and ctx.obj.get("debug"):
-        verbose = True
-
-    configure_logging(verbose=verbose)
+    # Merge top-level -v/-q with subcommand-level flags so either position works.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     # Apply build config defaults (CLI explicit options take precedence).
     # Only read the JSON for what explicitly specified in config file.
diff --git a/src/winml/modelkit/commands/sys.py b/src/winml/modelkit/commands/sys.py
index 083a5ff89..312397008 100644
--- a/src/winml/modelkit/commands/sys.py
+++ b/src/winml/modelkit/commands/sys.py
@@ -36,6 +36,8 @@
 import click
 
 from ..sysinfo import OS, get_ep_device_map
+from ..utils import cli as cli_utils
+from ..utils.logging import configure_logging
 
 
 if TYPE_CHECKING:
@@ -664,13 +666,6 @@ def _output_ep_text(eps: list[dict[str, Any]]) -> None:
     default="text",
     help="Output format: text (human-readable), json, or compact",
 )
-@click.option(
-    "--verbose",
-    "-v",
-    is_flag=True,
-    default=False,
-    help="Include additional diagnostic information",
-)
 @click.option(
     "--list-device",
     is_flag=True,
@@ -683,11 +678,13 @@ def _output_ep_text(eps: list[dict[str, Any]]) -> None:
     default=False,
     help="List available execution providers",
 )
+@cli_utils.verbosity_options()
 @click.pass_context
 def sysinfo(
     ctx: click.Context,
     output_format: str,
-    verbose: bool,
+    verbose: int,
+    quiet: bool,
     list_device: bool,
     list_ep: bool,
 ) -> None:
@@ -720,133 +717,114 @@ def sysinfo(
         # List execution providers as JSON
         winml sys --list-ep --format json
     """
-    # Inherit debug mode from parent
-    if ctx.obj.get("debug"):
-        verbose = True
-
-    # Route winml.modelkit logs through Rich so they never interleave with CLI output.
-    # In normal mode suppress everything below WARNING; in debug mode show all levels.
-    # Restore logger state on exit so tests using caplog are not affected.
-    #
-    # For --format json, send log records to stderr so DEBUG/WARNING lines do
-    # not corrupt the JSON payload on stdout (verbose+json was unparseable).
-    from rich.console import Console as _RichConsole
-    from rich.logging import RichHandler
+    # Merge top-level -v/-q with subcommand-level flags so either position works.
+    verbose, quiet = cli_utils.resolve_verbosity(ctx, verbose, quiet)
+
+    # Standard verbosity contract: stderr-only logs in the shared format.
+    # `-v` here keeps its `sys`-specific second job of expanding the displayed
+    # diagnostics; see the table-DEBUG audit follow-up for fully decoupling
+    # them.
+    configure_logging(verbosity=verbose, quiet=quiet)
 
     use_json = output_format.lower() == "json"
 
-    log_level = logging.DEBUG if verbose else logging.WARNING
-    pkg_logger = logging.getLogger("winml.modelkit")
-    _saved_handlers = pkg_logger.handlers[:]
-    _saved_level = pkg_logger.level
-    _saved_propagate = pkg_logger.propagate
-    pkg_logger.handlers = [h for h in pkg_logger.handlers if not isinstance(h, RichHandler)]
-    log_console = _RichConsole(stderr=True) if use_json else _get_console()
-    rich_handler = RichHandler(console=log_console, show_path=False)
-    rich_handler.setLevel(log_level)
-    pkg_logger.setLevel(log_level)
-    pkg_logger.addHandler(rich_handler)
-    pkg_logger.propagate = False
+    # Logging is configured via the shared, idempotent configure_logging above;
+    # no per-command logger snapshot/restore is needed (every other command
+    # relies on the same contract for test isolation).
 
-    try:
-        # Handle --list-device and/or --list-ep (combinable)
-        if list_device or list_ep:
-            if use_json:
-                # Combine both into a single JSON object so output is always valid JSON
-                result: dict[str, Any] = {}
-                if list_device:
-                    try:
-                        result["devices"] = _gather_device_info()
-                    except Exception as e:
-                        logger.exception("Failed to detect devices")
-                        raise click.ClickException(f"Error detecting devices: {e}") from e
-                if list_ep:
-                    try:
-                        result["executionProviders"] = _gather_ep_info()
-                    except Exception as e:
-                        logger.exception("Failed to detect execution providers")
-                        msg = f"Error detecting execution providers: {e}"
-                        raise click.ClickException(msg) from e
-                click.echo(json.dumps(result, indent=2))
-            elif output_format.lower() == "compact":
-                if list_device:
-                    try:
-                        devices = _gather_device_info()
-                        parts = [f"{d['type']}: {d['name'].strip()}" for d in devices]
-                        click.echo(" | ".join(parts) if parts else "No devices found")
-                    except Exception as e:
-                        logger.exception("Failed to detect devices")
-                        raise click.ClickException(f"Error detecting devices: {e}") from e
-                if list_ep:
-                    try:
-                        eps = _gather_ep_info()
-                        parts = [f"{ep['name']}({ep['device']})" for ep in eps]
-                        click.echo("EPs: " + ", ".join(parts) if parts else "EPs: none")
-                    except Exception as e:
-                        logger.exception("Failed to detect execution providers")
-                        msg = f"Error detecting execution providers: {e}"
-                        raise click.ClickException(msg) from e
-            else:
-                if list_device:
-                    try:
-                        devices = _gather_device_info()
-                        _output_device_text(devices)
-                    except Exception as e:
-                        _get_console().print(f"[bold red]Error detecting devices:[/bold red] {e}")
-                        logger.exception("Failed to detect devices")
-                        raise click.ClickException(f"Error detecting devices: {e}") from e
-                if list_ep:
-                    try:
-                        eps = _gather_ep_info()
-                        _output_ep_text(eps)
-                    except Exception as e:
-                        _get_console().print(
-                            f"[bold red]Error detecting execution providers:[/bold red] {e}"
-                        )
-                        logger.exception("Failed to detect execution providers")
-                        msg = f"Error detecting execution providers: {e}"
-                        raise click.ClickException(msg) from e
-            return
-
-        # Default: full sysinfo including devices and EPs
-        try:
-            info = _gather_system_info(verbose=verbose)
-
-            if use_json:
-                # Add devices and EPs to JSON output
+    # Handle --list-device and/or --list-ep (combinable)
+    if list_device or list_ep:
+        if use_json:
+            # Combine both into a single JSON object so output is always valid JSON
+            result: dict[str, Any] = {}
+            if list_device:
                 try:
-                    info["devices"] = _gather_device_info()
-                except Exception:
-                    info["devices"] = []
+                    result["devices"] = _gather_device_info()
+                except Exception as e:
+                    logger.exception("Failed to detect devices")
+                    raise click.ClickException(f"Error detecting devices: {e}") from e
+            if list_ep:
                 try:
-                    info["executionProviders"] = _gather_ep_info()
-                except Exception:
-                    info["executionProviders"] = []
-                _output_json(info)
-            elif output_format.lower() == "compact":
-                _output_compact(info)
-            else:
-                _output_text(info, verbose=verbose)
-                # Append devices and EPs to text output
-                _get_console().print()
+                    result["executionProviders"] = _gather_ep_info()
+                except Exception as e:
+                    logger.exception("Failed to detect execution providers")
+                    msg = f"Error detecting execution providers: {e}"
+                    raise click.ClickException(msg) from e
+            click.echo(json.dumps(result, indent=2))
+        elif output_format.lower() == "compact":
+            if list_device:
+                try:
+                    devices = _gather_device_info()
+                    parts = [f"{d['type']}: {d['name'].strip()}" for d in devices]
+                    click.echo(" | ".join(parts) if parts else "No devices found")
+                except Exception as e:
+                    logger.exception("Failed to detect devices")
+                    raise click.ClickException(f"Error detecting devices: {e}") from e
+            if list_ep:
+                try:
+                    eps = _gather_ep_info()
+                    parts = [f"{ep['name']}({ep['device']})" for ep in eps]
+                    click.echo("EPs: " + ", ".join(parts) if parts else "EPs: none")
+                except Exception as e:
+                    logger.exception("Failed to detect execution providers")
+                    msg = f"Error detecting execution providers: {e}"
+                    raise click.ClickException(msg) from e
+        else:
+            if list_device:
                 try:
                     devices = _gather_device_info()
                     _output_device_text(devices)
-                except Exception:
-                    logger.debug("Device detection failed in default output")
-                _get_console().print()
+                except Exception as e:
+                    _get_console().print(f"[bold red]Error detecting devices:[/bold red] {e}")
+                    logger.exception("Failed to detect devices")
+                    raise click.ClickException(f"Error detecting devices: {e}") from e
+            if list_ep:
                 try:
                     eps = _gather_ep_info()
                     _output_ep_text(eps)
-                except Exception:
-                    logger.debug("EP detection failed in default output")
-
-        except Exception as e:
-            _get_console().print(f"[bold red]Error gathering system information:[/bold red] {e}")
-            logger.exception("Failed to gather system information")
-            raise click.ClickException(f"Error gathering system information: {e}") from e
-
-    finally:
-        pkg_logger.handlers = _saved_handlers
-        pkg_logger.setLevel(_saved_level)
-        pkg_logger.propagate = _saved_propagate
+                except Exception as e:
+                    _get_console().print(
+                        f"[bold red]Error detecting execution providers:[/bold red] {e}"
+                    )
+                    logger.exception("Failed to detect execution providers")
+                    msg = f"Error detecting execution providers: {e}"
+                    raise click.ClickException(msg) from e
+        return
+
+    # Default: full sysinfo including devices and EPs
+    try:
+        info = _gather_system_info(verbose=bool(verbose))
+
+        if use_json:
+            # Add devices and EPs to JSON output
+            try:
+                info["devices"] = _gather_device_info()
+            except Exception:
+                info["devices"] = []
+            try:
+                info["executionProviders"] = _gather_ep_info()
+            except Exception:
+                info["executionProviders"] = []
+            _output_json(info)
+        elif output_format.lower() == "compact":
+            _output_compact(info)
+        else:
+            _output_text(info, verbose=bool(verbose))
+            # Append devices and EPs to text output
+            _get_console().print()
+            try:
+                devices = _gather_device_info()
+                _output_device_text(devices)
+            except Exception:
+                logger.debug("Device detection failed in default output")
+            _get_console().print()
+            try:
+                eps = _gather_ep_info()
+                _output_ep_text(eps)
+            except Exception:
+                logger.debug("EP detection failed in default output")
+
+    except Exception as e:
+        _get_console().print(f"[bold red]Error gathering system information:[/bold red] {e}")
+        logger.exception("Failed to gather system information")
+        raise click.ClickException(f"Error gathering system information: {e}") from e
diff --git a/src/winml/modelkit/serve/app.py b/src/winml/modelkit/serve/app.py
index ea05f52eb..d19ee96bd 100644
--- a/src/winml/modelkit/serve/app.py
+++ b/src/winml/modelkit/serve/app.py
@@ -92,9 +92,51 @@ def since(self, after_seq: int) -> list[dict]:
 
 _log_handler = _RingHandler()
 _log_handler.setFormatter(logging.Formatter("%(message)s"))
-# Attach to modelkit root logger so all sub-loggers feed into the ring
-logging.getLogger("winml.modelkit").addHandler(_log_handler)
-logging.getLogger("winml.modelkit").setLevel(logging.INFO)
+# Bound the records the ring receives independently of the package logger's
+# level. This lets us attach the handler without raising the package logger
+# level at import time — which would otherwise mute DEBUG capture in unrelated
+# tests that get collected alongside the serve test module.
+_log_handler.setLevel(logging.INFO)
+
+
+def _attach_log_handler() -> None:
+    """Idempotently attach the ring handler to the modelkit logger tree.
+
+    Does NOT touch the package logger's level — that is a per-process side
+    effect we only want during an actual ``winml serve`` run (handled by the
+    lifespan startup hook), not at module-import time and not for tests that
+    only need ``_register_routes`` wired up.
+    """
+    pkg_logger = logging.getLogger("winml.modelkit")
+    if _log_handler not in pkg_logger.handlers:
+        pkg_logger.addHandler(_log_handler)
+
+
+def _ensure_log_capture_level() -> int | None:
+    """Raise the package logger level to INFO if needed; return prior level.
+
+    Returns ``None`` when no change was made (so the caller knows there is
+    nothing to restore). Pair with :func:`_restore_log_capture_level`.
+    """
+    pkg_logger = logging.getLogger("winml.modelkit")
+    if pkg_logger.level == logging.NOTSET or pkg_logger.level > logging.INFO:
+        prior = pkg_logger.level
+        pkg_logger.setLevel(logging.INFO)
+        return prior
+    return None
+
+
+def _restore_log_capture_level(prior: int | None) -> None:
+    if prior is None:
+        return
+    pkg_logger = logging.getLogger("winml.modelkit")
+    # Only restore if the value still matches what we set. If anyone else
+    # (middleware, debug hook, a nested ``configure_logging`` from a CLI
+    # invocation routed through /v1/cli) has since changed it, defer to them
+    # rather than silently rolling their change back.
+    if pkg_logger.level == logging.INFO:
+        pkg_logger.setLevel(prior)
+
 
 # ---------------------------------------------------------------------------
 # App factory
@@ -126,6 +168,11 @@ def create_app(
     @asynccontextmanager
     async def lifespan(app: FastAPI):
         app.state.start_time = time.time()
+        # Raise the modelkit logger to INFO so the ring handler receives
+        # operational records during `winml serve`. Tests that build the app
+        # via ``_register_routes`` + their own mock lifespan never reach this
+        # branch, so the global level stays unchanged for them.
+        app.state._log_capture_prior = _ensure_log_capture_level()
         if mode == "multi":
             mgr = ModelSlotManager(
                 memory_budget_mb=memory_budget_mb,
@@ -147,6 +194,7 @@ async def lifespan(app: FastAPI):
         logger.info("Model ready")
         yield
         app.state.manager.shutdown()
+        _restore_log_capture_level(getattr(app.state, "_log_capture_prior", None))
 
     app = FastAPI(
         title="WinML CLI Inference Server",
@@ -183,6 +231,8 @@ async def demo_ui() -> FileResponse:
 
 
 def _register_routes(app: FastAPI, *, mode: str) -> None:
+    _attach_log_handler()
+
     # ------------------------------------------------------------------
     # Local helpers (closure over app)
     # ------------------------------------------------------------------
diff --git a/src/winml/modelkit/utils/cli.py b/src/winml/modelkit/utils/cli.py
index 99c8661d9..b6e48e11b 100644
--- a/src/winml/modelkit/utils/cli.py
+++ b/src/winml/modelkit/utils/cli.py
@@ -193,7 +193,7 @@ def device_option(
     )
 
 
-def verbosity_options(f: F) -> F:
+def verbosity_options():
     """Add verbose and quiet logging options to a Click command.
 
     Adds --verbose/-v (stackable: -v, -vv, -vvv) and --quiet/-q flags.
@@ -202,26 +202,59 @@ def verbosity_options(f: F) -> F:
 
     See :mod:`winml.modelkit.utils.logging` for the verbosity convention.
 
+    Returns:
+        Decorator function adding verbose and quiet options.
+    """
+
+    def decorator(f):
+        f = click.option(
+            "--quiet",
+            "-q",
+            is_flag=True,
+            default=False,
+            help="Quiet mode - errors only to stderr",
+        )(f)
+        return click.option(
+            "--verbose",
+            "-v",
+            count=True,
+            help="Increase verbosity (-v=INFO, -vv=DEBUG)",
+        )(f)
+
+    return decorator
+
+
+def resolve_verbosity(ctx: click.Context, verbose: int, quiet: bool) -> tuple[int, bool]:
+    """Merge subcommand ``--verbose``/``--quiet`` with the parent group's values.
+
+    The top-level ``winml`` group also accepts ``-v``/``-q`` and stores the
+    resolved values in ``ctx.obj``. Both positions are equally valid:
+    ``winml -v export …`` and ``winml export -v …`` should behave the same.
+    This helper takes the max verbosity and OR of quiet so users can supply
+    the flag at either level (or both).
+
+    Precedence: ``-q``/``--quiet`` always wins over verbosity, including the
+    ``--debug`` alias — ``winml --debug export -q …`` runs at ERROR. ``-q`` is
+    an explicit "shut up" signal and trumps any verbosity raise, so the user
+    is never surprised by debug spam after they asked for quiet.
+
     Args:
-        f: Click command function to decorate
+        ctx: Click context for the current subcommand.
+        verbose: Subcommand-level ``-v`` count.
+        quiet: Subcommand-level ``--quiet`` flag.
 
     Returns:
-        Decorated function with verbose and quiet options
+        Tuple ``(verbose, quiet)`` ready to pass to ``configure_logging``.
     """
-    f = click.option(
-        "--quiet",
-        "-q",
-        is_flag=True,
-        default=False,
-        help="Quiet mode - errors only to stderr",
-    )(f)
-    f = click.option(
-        "--verbose",
-        "-v",
-        count=True,
-        help="Increase verbosity (-v=INFO, -vv=DEBUG)",
-    )(f)
-    return f  # noqa: RET504
+    if ctx.obj:
+        verbose = max(verbose, int(ctx.obj.get("verbosity", 0)))
+        # ``debug`` is the historical backward-compat alias for ``-vv``; keep
+        # honoring it so tests that bypass ``main()`` and stuff ``debug=True``
+        # straight into ctx.obj still raise the verbosity floor.
+        if ctx.obj.get("debug"):
+            verbose = max(verbose, 2)
+        quiet = quiet or bool(ctx.obj.get("quiet", False))
+    return verbose, quiet
 
 
 def build_config_option(help: str | None = None) -> Callable[[F], F]:
diff --git a/src/winml/modelkit/utils/logging.py b/src/winml/modelkit/utils/logging.py
index 656dd8d88..a7d6b3f00 100644
--- a/src/winml/modelkit/utils/logging.py
+++ b/src/winml/modelkit/utils/logging.py
@@ -19,13 +19,22 @@
     Quiet:   level = ERROR (40)
 
 All log output goes to stderr so stdout stays clean for structured data
-(JSON, compact output, piped commands).
+(JSON, compact output, piped commands). Format:
+
+    [%(asctime)s %(levelname)-7s %(name)s] %(message)s
+
+Sample line: ``[14:32:11 INFO    winml.modelkit.export] Loaded config.json``
 """
 
 import logging
 import sys
 
 
+_HANDLER_MARKER = "_winml_cli_handler"
+_LOG_FORMAT = "[%(asctime)s %(levelname)-7s %(name)s] %(message)s"
+_DATE_FORMAT = "%H:%M:%S"
+
+
 def configure_logging(
     verbosity: int = 0,
     quiet: bool = False,
@@ -35,6 +44,11 @@ def configure_logging(
 ) -> None:
     """Configure root logger based on verbosity level.
 
+    Idempotent: subcommands re-call this after merging top-level + subcommand
+    ``-v``/``-q``. The first call installs the WinML stderr handler; later
+    calls only adjust the level. Existing non-WinML handlers (notably pytest's
+    ``caplog`` propagate-handler) are preserved.
+
     Args:
         verbosity: Number of ``-v`` flags (0=WARNING, 1=INFO, 2+=DEBUG).
         quiet: If True, override to ERROR level regardless of verbosity.
@@ -49,13 +63,24 @@ def configure_logging(
     # Clamp between DEBUG (10) and WARNING (30); quiet overrides to ERROR
     log_level = logging.ERROR if quiet else max(logging.DEBUG, logging.WARNING - verbosity * 10)
 
-    logging.basicConfig(
-        level=log_level,
-        format="[%(asctime)s] %(levelname)s: %(message)s",
-        datefmt="%Y-%m-%dT%H:%M:%S",
-        stream=sys.stderr,
-        force=True,
-    )
+    root = logging.getLogger()
+    # Drop any prior WinML handler and install a fresh one bound to the
+    # *current* ``sys.stderr``. Click's ``CliRunner.invoke()`` swaps the
+    # process stderr for each test, so a cached handler from an earlier
+    # invocation would write to a stream the test no longer captures.
+    # We leave non-WinML handlers (notably pytest's caplog handler) alone.
+    for h in list(root.handlers):
+        if getattr(h, _HANDLER_MARKER, False):
+            root.removeHandler(h)
+    own_handler = logging.StreamHandler(sys.stderr)
+    own_handler.setFormatter(logging.Formatter(_LOG_FORMAT, datefmt=_DATE_FORMAT))
+    setattr(own_handler, _HANDLER_MARKER, True)
+    root.addHandler(own_handler)
+    # The root level is the sole gate: it already filters every record before
+    # it reaches any handler, so the handler is left at NOTSET (passes through)
+    # to avoid a redundant double-filter at the same threshold. This mirrors
+    # the prior ``logging.basicConfig`` behavior, which never set a handler level.
+    root.setLevel(log_level)
 
 
 def flush_ort_startup_logs() -> None:

From cd951a9a368640250341b63c4b85a6a87429e0d0 Mon Sep 17 00:00:00 2001
From: "Qiong Wu (qiowu)" <qiowu@microsoft.com>
Date: Wed, 3 Jun 2026 21:35:19 +0800
Subject: [PATCH 029/143] fix: align analyzer API EP list with CLI (#803)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary
- Replace hardcoded 4-EP list in `analyze_from_proto(ep=None)` with
dynamic lookup from `EP_SUPPORTED_DEVICES`, filtered by target device
- Remove `max_length=4` constraint on `AnalysisOutput.results` to
support more than 4 EPs per device
- Change uniqueness validator from IHV type to EP type (multiple EPs can
share the same IHV, e.g. CUDA and DML both map to MICROSOFT)

**Before:** `analyzer.analyze(ep=None)` always analyzed QNN, OpenVINO,
VitisAI, NvTensorRTRTX regardless of device — NvTensorRTRTX was analyzed
on NPU even though it only supports GPU.

**After:** EP list is derived from `EP_SUPPORTED_DEVICES` filtered by
the target device, matching the CLI `--ep all` behavior exactly.
---
 src/winml/modelkit/analyze/analyzer.py      | 29 ++++++++++-----------
 src/winml/modelkit/analyze/models/output.py | 14 +++++-----
 tests/unit/analyze/models/test_output.py    |  4 +--
 tests/unit/analyze/test_analyzer.py         | 18 ++++++-------
 4 files changed, 31 insertions(+), 34 deletions(-)

diff --git a/src/winml/modelkit/analyze/analyzer.py b/src/winml/modelkit/analyze/analyzer.py
index 1bc8b60c8..f10a5e34e 100644
--- a/src/winml/modelkit/analyze/analyzer.py
+++ b/src/winml/modelkit/analyze/analyzer.py
@@ -18,7 +18,7 @@
 from typing import TYPE_CHECKING
 
 from ..optim.config import WinMLOptimizationConfig
-from ..utils.constants import EPName, EPNameOrAlias, normalize_ep_name
+from ..utils.constants import EP_SUPPORTED_DEVICES, EPName, EPNameOrAlias, normalize_ep_name
 from .models.information import Information
 from .models.support_level import SupportLevel
 from .utils.timing_utils import make_timing_logger
@@ -684,20 +684,6 @@ def analyze_from_proto(
 
         logger.info("Analyzing model from ModelProto")
 
-        # Determine which EPs to analyze
-        eps_to_analyze: list[EPName] = []
-        if ep_normalized is None:
-            # Analyze all supported EPs
-            eps_to_analyze = [
-                "QNNExecutionProvider",
-                "OpenVINOExecutionProvider",
-                "VitisAIExecutionProvider",
-                "NvTensorRTRTXExecutionProvider",
-            ]
-            logger.info("No EP specified, analyzing all supported EPs: %s", eps_to_analyze)
-        else:
-            eps_to_analyze = [ep_normalized]
-
         # Resolve device — rule files are device-specific (CPU/GPU/NPU).
         if device is not None and device.lower() == "auto":
             from ..sysinfo import resolve_device
@@ -709,6 +695,19 @@ def analyze_from_proto(
             device_to_use = device if device is not None else "NPU"
             logger.info("Using device: %s", device_to_use)
 
+        # Determine which EPs to analyze
+        eps_to_analyze: list[EPName] = []
+        if ep_normalized is None:
+            # Analyze all EPs that support the target device
+            eps_to_analyze = [
+                ep_name
+                for ep_name, supported_devices in EP_SUPPORTED_DEVICES.items()
+                if device_to_use.lower() in supported_devices
+            ]
+            logger.info("No EP specified, analyzing all supported EPs: %s", eps_to_analyze)
+        else:
+            eps_to_analyze = [ep_normalized]
+
         # Step 1: Create ONNXModel and extract patterns (once)
         extraction_start = time.perf_counter()
         logger.info("Loading model and extracting patterns...")
diff --git a/src/winml/modelkit/analyze/models/output.py b/src/winml/modelkit/analyze/models/output.py
index a19a8e7b1..ace56ccfe 100644
--- a/src/winml/modelkit/analyze/models/output.py
+++ b/src/winml/modelkit/analyze/models/output.py
@@ -112,17 +112,15 @@ class AnalysisOutput(BaseModel):
         default_factory=datetime.now, description="Analysis timestamp"
     )
     metadata: ModelStats = Field(..., description="Model metadata and statistics")
-    results: list[EPSupport] = Field(
-        ..., max_length=4, description="Execution Provider support results (max 4)"
-    )
+    results: list[EPSupport] = Field(..., description="Execution Provider support results")
 
     @field_validator("results")
     @classmethod
-    def validate_ihv_types_unique(cls, v: list[EPSupport]) -> list[EPSupport]:
-        """Validate that IHV types are unique in the list."""
-        ihv_types = [item.ihv_type for item in v]
-        if len(ihv_types) != len(set(ihv_types)):
-            raise ValueError(f"Duplicate IHV types found: {ihv_types}")
+    def validate_ep_types_unique(cls, v: list[EPSupport]) -> list[EPSupport]:
+        """Validate that EP types are unique in the list."""
+        ep_types = [item.ep_type for item in v]
+        if len(ep_types) != len(set(ep_types)):
+            raise ValueError(f"Duplicate EP types found: {ep_types}")
         return v
 
     def model_dump_json(self, **kwargs: object) -> str:
diff --git a/tests/unit/analyze/models/test_output.py b/tests/unit/analyze/models/test_output.py
index d207ca6e0..1882ebbcb 100644
--- a/tests/unit/analyze/models/test_output.py
+++ b/tests/unit/analyze/models/test_output.py
@@ -228,8 +228,8 @@ def test_unique_ihv_types_validation(self):
         )
         assert len(output.results) == 2
 
-        # Invalid: duplicate IHV types
-        with pytest.raises(ValidationError, match="Duplicate IHV types found"):
+        # Invalid: duplicate EP types
+        with pytest.raises(ValidationError, match="Duplicate EP types found"):
             AnalysisOutput(
                 metadata=ModelStats(
                     model_path="/test.onnx",
diff --git a/tests/unit/analyze/test_analyzer.py b/tests/unit/analyze/test_analyzer.py
index 410de461d..e12fa54d3 100644
--- a/tests/unit/analyze/test_analyzer.py
+++ b/tests/unit/analyze/test_analyzer.py
@@ -899,17 +899,17 @@ def test_analyze_from_proto_multi_ep(
 
         # Assertions
         assert isinstance(result, AnalysisResult)
-        # Should have results for all 4 EPs: QNN, OpenVINO, VitisAI, NvTensorRTRTX
-        assert len(result.output.results) == 4
+        # Should have results for all NPU-capable EPs: QNN, OpenVINO, VitisAI
+        # (NvTensorRTRTX only supports GPU, so it's excluded for device=NPU)
+        assert len(result.output.results) == 3
 
-        ihv_types = {r.ihv_type for r in result.output.results}
-        assert IHVType.QC in ihv_types
-        assert IHVType.INTEL in ihv_types
-        assert IHVType.AMD in ihv_types
-        assert IHVType.NVIDIA in ihv_types
+        ep_types = {r.ep_type for r in result.output.results}
+        assert "QNNExecutionProvider" in ep_types
+        assert "OpenVINOExecutionProvider" in ep_types
+        assert "VitisAIExecutionProvider" in ep_types
 
-        # Verify RuntimeChecker was called 4 times (once per EP)
-        assert mock_runtime_checker_cls.call_count == 4
+        # Verify RuntimeChecker was called 3 times (once per NPU-capable EP)
+        assert mock_runtime_checker_cls.call_count == 3
 
     @patch("winml.modelkit.analyze.utils.ep_utils.has_rule_data_for_ep", return_value=True)
     @patch("winml.modelkit.analyze.core.onnx_loader.ONNXLoader")

From 17fdfc384e3cdf0dfc3036ee047d39d06f1377aa Mon Sep 17 00:00:00 2001
From: Hyunsik Jeon <hyunsikjeon@microsoft.com>
Date: Wed, 3 Jun 2026 11:37:29 -0700
Subject: [PATCH 030/143] feat(eval): add depth-estimation evaluator (#326)
 (#437)

Resolves #326.

Adds `WinMLDepthEstimationEvaluator` and `DepthMetric` (Absolute
Relative error, RMSE, delta-1) following the NYU/KITTI evaluation
protocol. HuggingFace `evaluate` doesn't ship a depth-estimation
evaluator, so the metric loop is implemented manually.

### Background

Depth-estimation models fall into a few groups, and the same input image
gives wildly different prediction scales depending on which group the
model belongs to.

- Metric-depth models (ZoeDepth, DepthPro) predict depth in meters
directly.
- Relative-depth models (Depth-Anything, Marigold) predict depth up to
an unknown scale and shift.
- Disparity models (DPT, MiDaS) predict `1 / depth` (inverse depth) up
to scale and shift.

Comparing predictions against the NYU ground truth therefore requires
(1) optionally inverting disparity into depth and (2) aligning the
prediction to the ground truth before computing metrics. This is what
AbsRel/RMSE/delta-1 benchmarks in the literature do, and what this PR
adds as user-selectable options.

### Options

Two `columns_mapping` keys, both overridable via `--column`, and both
visible in `winml eval --schema --task depth-estimation`.

`align` controls how each prediction is rescaled against the ground
truth depth map before metrics are computed:

- `affine` (default): per-image least-squares fit of `pred_aligned = s *
pred + t`, where `s` is a scalar scale and `t` is a scalar shift, solved
on the valid pixels (those passing the depth range mask). Suitable for
relative-depth and disparity models.
- `median`: scale-only alignment, `pred_aligned = (median(gt) /
median(pred)) * pred`. No shift. Cheaper but less accurate when the
model has a non-zero offset.
- `none`: use the prediction as-is. Suitable for metric-depth models
that already output meters.

`depth_kind` indicates what the model outputs:

- `depth` (default): prediction is interpreted as depth.
- `disparity`: prediction is interpreted as inverse depth, so it is
inverted (`pred := 1 / pred`) before alignment. Needed for
DPT/MiDaS-style outputs.

The depth range used for the valid-pixel mask is also overridable:
`min_depth` (default 1e-3, NYU convention) and `max_depth` (default 10.0
meters, NYU convention). Only pixels with `min_depth <= gt <= max_depth`
contribute to the metrics.

### Default dataset and testset

Default dataset is `sayakpaul/nyu_depth_v2`. All 11 depth-estimation
entries from `models_all.json` are added to `models_with_acc.json`, with
per-model overrides only where the defaults don't match the model
family:

- `Intel/zoedepth-nyu-kitti` and `apple/DepthPro-hf` set `align=none`
(metric-depth).
- `Intel/dpt-hybrid-midas` and `Intel/dpt-large` set
`depth_kind=disparity`.
- The remaining 7 entries (Depth-Anything family, Marigold, etc.) rely
on the defaults (`align=affine`, `depth_kind=depth`).

### Tests

Unit tests cover the new evaluator and the metric, including the
affine-fit path and the disparity inversion path. The slow/network
integration test runs the full pipeline end-to-end on Depth-Anything V2,
ZoeDepth, and DPT.
---
 scripts/e2e_eval/cache/baseline_cache.json    |  10 +
 scripts/e2e_eval/run_eval.py                  |   4 +
 scripts/e2e_eval/run_pytorch_baseline.py      |   8 +
 .../e2e_eval/testsets/models_with_acc.json    | 191 ++++++++++
 src/winml/modelkit/commands/eval.py           |   9 +
 src/winml/modelkit/datasets/__init__.py       |   4 +
 .../modelkit/datasets/depth_estimation.py     | 169 +++++++++
 src/winml/modelkit/datasets/image.py          |  13 +-
 src/winml/modelkit/eval/__init__.py           |   8 +
 src/winml/modelkit/eval/base_evaluator.py     |   1 +
 src/winml/modelkit/eval/config.py             |   7 +
 .../eval/depth_estimation_evaluator.py        | 148 ++++++++
 src/winml/modelkit/eval/evaluate.py           |   9 +
 src/winml/modelkit/eval/metrics/__init__.py   |   3 +
 src/winml/modelkit/eval/metrics/depth.py      | 180 +++++++++
 .../modelkit/models/hf/depth_anything.py      |  42 ++-
 src/winml/modelkit/models/winml/__init__.py   |   5 +
 .../modelkit/models/winml/depth_estimation.py |  51 +++
 src/winml/modelkit/utils/eval_utils.py        |  40 ++
 .../datasets/test_depth_estimation.py         | 352 ++++++++++++++++++
 .../integration/eval/test_depth_estimation.py |  60 +++
 tests/unit/datasets/test_image_streaming.py   |   4 +-
 .../eval/test_depth_estimation_evaluator.py   | 335 +++++++++++++++++
 tests/unit/eval/test_depth_metric.py          | 346 +++++++++++++++++
 tests/unit/eval/test_eval.py                  |  91 +++++
 .../models/depth_anything/test_onnx_config.py | 151 ++++++++
 .../models/winml/test_depth_estimation.py     |  89 +++++
 27 files changed, 2324 insertions(+), 6 deletions(-)
 create mode 100644 src/winml/modelkit/datasets/depth_estimation.py
 create mode 100644 src/winml/modelkit/eval/depth_estimation_evaluator.py
 create mode 100644 src/winml/modelkit/eval/metrics/depth.py
 create mode 100644 src/winml/modelkit/models/winml/depth_estimation.py
 create mode 100644 tests/integration/datasets/test_depth_estimation.py
 create mode 100644 tests/integration/eval/test_depth_estimation.py
 create mode 100644 tests/unit/eval/test_depth_estimation_evaluator.py
 create mode 100644 tests/unit/eval/test_depth_metric.py
 create mode 100644 tests/unit/models/depth_anything/test_onnx_config.py
 create mode 100644 tests/unit/models/winml/test_depth_estimation.py

diff --git a/scripts/e2e_eval/cache/baseline_cache.json b/scripts/e2e_eval/cache/baseline_cache.json
index 9e3f5f03c..de6796dd5 100644
--- a/scripts/e2e_eval/cache/baseline_cache.json
+++ b/scripts/e2e_eval/cache/baseline_cache.json
@@ -938,5 +938,15 @@
     },
     "elapsed": 124.3,
     "command": "python.exe run_pytorch_baseline.py --model deberta-xlarge-mnli --task text-classification --device cpu --num-samples 100 --dataset glue --split validation_matched --dataset-config mnli --columns-mapping {\"input_column\": \"premise\", \"second_input_column\": \"hypothesis\"} --winml-metric-key accuracy"
+  },
+  "depth-anything/Depth-Anything-V2-Small-hf|depth-estimation|sayakpaul/nyu_depth_v2||validation|1000": {
+    "status": "PASS",
+    "metric": {
+      "metric": "abs_rel",
+      "value": 0.154906,
+      "num_samples": 1000
+    },
+    "elapsed": 794.0,
+    "command": "python.exe run_pytorch_baseline.py --model Depth-Anything-V2-Small-hf --task depth-estimation --device cpu --num-samples 1000 --dataset nyu_depth_v2 --split validation --dataset-revision parquet --columns-mapping {\"input_column\": \"image\", \"depth_column\": \"depth_map\"} --winml-metric-key abs_rel"
   }
 }
diff --git a/scripts/e2e_eval/run_eval.py b/scripts/e2e_eval/run_eval.py
index ae50351d1..d47b6d3b1 100644
--- a/scripts/e2e_eval/run_eval.py
+++ b/scripts/e2e_eval/run_eval.py
@@ -795,6 +795,8 @@ def _run_winml_eval(
     args += ["--samples", str(num_samples)]
     if ds_config.get("dataset_config"):
         args += ["--dataset-name", ds_config["dataset_config"]]
+    if ds_config.get("revision"):
+        args += ["--dataset-revision", ds_config["revision"]]
     for k, v in ds_config.get("columns_mapping", {}).items():
         args += ["--column", f"{k}={v}"]
     if ds_config.get("label_mapping_file"):
@@ -917,6 +919,8 @@ def _run_pytorch_baseline(entry: ModelEntry, device: str, timeout: int) -> dict:
         args += ["--split", ds_config["split"]]
     if ds_config.get("dataset_config"):
         args += ["--dataset-config", ds_config["dataset_config"]]
+    if ds_config.get("revision"):
+        args += ["--dataset-revision", ds_config["revision"]]
     if ds_config.get("columns_mapping"):
         args += ["--columns-mapping", json.dumps(ds_config["columns_mapping"])]
     if ds_config.get("label_mapping_file"):
diff --git a/scripts/e2e_eval/run_pytorch_baseline.py b/scripts/e2e_eval/run_pytorch_baseline.py
index cf5fd4d3a..579b99dc0 100644
--- a/scripts/e2e_eval/run_pytorch_baseline.py
+++ b/scripts/e2e_eval/run_pytorch_baseline.py
@@ -193,6 +193,7 @@ def _build_dataset_config(ds_dict: dict, num_samples: int):
         samples=num_samples,
         columns_mapping=columns_mapping,
         label_mapping=label_mapping,
+        revision=ds_dict.get("revision"),
     )
 
 
@@ -226,6 +227,12 @@ def parse_args() -> argparse.Namespace:
     parser.add_argument(
         "--dataset-config", default=None, help="HuggingFace dataset config/subset name"
     )
+    parser.add_argument(
+        "--dataset-revision",
+        default=None,
+        help="Git revision (branch, tag, or commit) for the dataset (e.g. "
+        "'refs/convert/parquet' to load the parquet mirror of a script-based dataset).",
+    )
     parser.add_argument(
         "--columns-mapping",
         default=None,
@@ -297,6 +304,7 @@ def main() -> None:
             "dataset": args.dataset,
             "split": args.split or "validation",
             **({"dataset_config": args.dataset_config} if args.dataset_config else {}),
+            **({"revision": args.dataset_revision} if args.dataset_revision else {}),
             **({"columns_mapping": columns_mapping} if columns_mapping else {}),
             **({"label_mapping_file": args.label_mapping_file} if args.label_mapping_file else {}),
             "winml_metric_key": args.winml_metric_key,
diff --git a/scripts/e2e_eval/testsets/models_with_acc.json b/scripts/e2e_eval/testsets/models_with_acc.json
index dc636e4a3..dadfa6edd 100644
--- a/scripts/e2e_eval/testsets/models_with_acc.json
+++ b/scripts/e2e_eval/testsets/models_with_acc.json
@@ -1638,5 +1638,196 @@
         "label_column": "text"
       }
     }
+  },
+  {
+    "hf_id": "Intel/dpt-hybrid-midas",
+    "task": "depth-estimation",
+    "model_type": "dpt",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map",
+        "depth_kind": "disparity"
+      }
+    }
+  },
+  {
+    "hf_id": "Intel/dpt-large",
+    "task": "depth-estimation",
+    "model_type": "dpt",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map",
+        "depth_kind": "disparity"
+      }
+    }
+  },
+  {
+    "hf_id": "Intel/zoedepth-nyu-kitti",
+    "task": "depth-estimation",
+    "model_type": "zoedepth",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map",
+        "align": "none"
+      }
+    }
+  },
+  {
+    "hf_id": "LiheYoung/depth-anything-base-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_anything",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map"
+      }
+    }
+  },
+  {
+    "hf_id": "LiheYoung/depth-anything-large-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_anything",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map"
+      }
+    }
+  },
+  {
+    "hf_id": "LiheYoung/depth-anything-small-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_anything",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map"
+      }
+    }
+  },
+  {
+    "hf_id": "apple/DepthPro-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_pro",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map",
+        "align": "none"
+      }
+    }
+  },
+  {
+    "hf_id": "depth-anything/Depth-Anything-V2-Base-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_anything",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map"
+      }
+    }
+  },
+  {
+    "hf_id": "depth-anything/Depth-Anything-V2-Large-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_anything",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map"
+      }
+    }
+  },
+  {
+    "hf_id": "depth-anything/Depth-Anything-V2-Small-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_anything",
+    "group": "Top200",
+    "priority": "P2",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map"
+      }
+    }
+  },
+  {
+    "hf_id": "xingyang1/Distill-Any-Depth-Large-hf",
+    "task": "depth-estimation",
+    "model_type": "depth_anything",
+    "group": "Top200",
+    "priority": "P3",
+    "dataset_config": {
+      "path": "sayakpaul/nyu_depth_v2",
+      "revision": "refs/convert/parquet",
+      "split": "validation",
+      "metric": "abs_rel",
+      "columns_mapping": {
+        "input_column": "image",
+        "depth_column": "depth_map"
+      }
+    }
   }
 ]
diff --git a/src/winml/modelkit/commands/eval.py b/src/winml/modelkit/commands/eval.py
index c719a8ecc..fe5eb3f6b 100644
--- a/src/winml/modelkit/commands/eval.py
+++ b/src/winml/modelkit/commands/eval.py
@@ -60,6 +60,14 @@
     default=None,
     help="Dataset config name for multi-config datasets (e.g. 'mrpc').",
 )
+@click.option(
+    "--dataset-revision",
+    "revision",
+    type=str,
+    default=None,
+    help="Git revision (branch, tag, or commit) to load. Useful for script-based "
+    "datasets that have a parquet mirror at 'refs/convert/parquet'.",
+)
 @click.option(
     "--task",
     type=str,
@@ -150,6 +158,7 @@ def eval(
     model_id: str | None,
     dataset_path: str,
     dataset_name: str | None,
+    revision: str | None,
     task: str | None,
     device: str,
     precision: str,
diff --git a/src/winml/modelkit/datasets/__init__.py b/src/winml/modelkit/datasets/__init__.py
index e7caa4ab6..aac765008 100644
--- a/src/winml/modelkit/datasets/__init__.py
+++ b/src/winml/modelkit/datasets/__init__.py
@@ -17,6 +17,7 @@
 
 from .base import BaseTaskDataset
 from .data_utils import format_data
+from .depth_estimation import DEFAULT_DEPTH_ESTIMATION_SIZE, DepthEstimationDataset
 from .image import ImageDataset
 from .image_segmentation import ImageSegmentationDataset
 from .object_detection import DEFAULT_OBJECT_DETECTION_SIZE, ObjectDetectionDataset
@@ -46,6 +47,7 @@
     "fill-mask": TextDataset,
     "zero-shot-classification": TextDataset,
     "image-segmentation": ImageSegmentationDataset,
+    "depth-estimation": DepthEstimationDataset,
     "random": RandomDataset,
     # Add more task types as needed
 }
@@ -286,8 +288,10 @@ def __len__(self) -> int:
     "DatasetCalibrationReader",
     # Config
     "DEFAULT_OBJECT_DETECTION_SIZE",
+    "DEFAULT_DEPTH_ESTIMATION_SIZE",
     # Dataset classes
     "BaseTaskDataset",
+    "DepthEstimationDataset",
     "ImageDataset",
     "ImageSegmentationDataset",
     "ObjectDetectionDataset",
diff --git a/src/winml/modelkit/datasets/depth_estimation.py b/src/winml/modelkit/datasets/depth_estimation.py
new file mode 100644
index 000000000..6aa3f7d62
--- /dev/null
+++ b/src/winml/modelkit/datasets/depth_estimation.py
@@ -0,0 +1,169 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Depth-estimation dataset support for calibration.
+
+This dataset keeps image preprocessing aligned with the exported ONNX model.
+When a model expects a fixed ``pixel_values`` shape, the processor is forced
+to emit that exact size so calibration samples match the model input.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import Any
+
+from datasets.features import Image
+from transformers import AutoImageProcessor
+
+from .image import ImageDataset
+
+
+logger = logging.getLogger(__name__)
+
+# Default fallback image size for depth-estimation models.
+DEFAULT_DEPTH_ESTIMATION_SIZE = 518
+
+# Default calibration dataset for depth estimation.
+# Using the same dataset family in calibration and evaluation keeps behavior
+# consistent when users rely on the built-in defaults.
+DEFAULT_DEPTH_ESTIMATION_DATASET = "sayakpaul/nyu_depth_v2"
+DEFAULT_DEPTH_ESTIMATION_SPLIT = "validation"
+# Use the parquet mirror revision so the dataset can be loaded reliably
+# through the standard HuggingFace datasets API.
+DEFAULT_DEPTH_ESTIMATION_REVISION = "refs/convert/parquet"
+
+
+class DepthEstimationDataset(ImageDataset):
+    """Depth-estimation dataset with fixed-shape preprocessing.
+
+    This specialization ensures calibration samples follow the input shape of
+    the exported ONNX model and works with datasets whose target is a depth map
+    instead of a class label.
+    """
+
+    _revision: str | None = None
+
+    def _get_default_dataset(self) -> None:
+        """Set the built-in depth-estimation dataset defaults.
+
+        The default points to the NYU depth dataset and a stable revision that
+        can be loaded directly through ``datasets``.
+        """
+        if self._dataset_name is None:
+            self._dataset_name = DEFAULT_DEPTH_ESTIMATION_DATASET
+            self._data_split = DEFAULT_DEPTH_ESTIMATION_SPLIT
+            self._revision = DEFAULT_DEPTH_ESTIMATION_REVISION
+
+    def _derive_overrides(self, io_config: dict[str, Any] | None) -> dict[str, Any]:
+        """Build processor overrides from the ONNX input configuration.
+
+        When the model exposes a fixed ``pixel_values`` shape, that shape is
+        applied to the image processor and variable-size preprocessing is
+        disabled.
+        """
+        overrides: dict[str, Any] = {
+            "keep_aspect_ratio": False,
+            "do_pad": False,
+        }
+
+        if io_config is None:
+            logger.debug("No io_config provided, using default overrides")
+            return overrides
+
+        if "pixel_values" in io_config:
+            shape = io_config["pixel_values"].get("shape", [])
+            # Shape is typically [batch, channels, height, width]
+            if len(shape) >= 4:
+                height = shape[2]
+                width = shape[3]
+                if height is not None and width is not None:
+                    overrides["size"] = {"height": height, "width": width}
+                    logger.debug(
+                        "Extracted size from io_config: height=%d, width=%d",
+                        height,
+                        width,
+                    )
+
+        return overrides
+
+    def _initialize(self) -> None:
+        """Load the dataset and prepare fixed-shape image tensors."""
+        # Use the built-in defaults when the caller does not provide a dataset.
+        if self._dataset_name is None:
+            self._get_default_dataset()
+
+        # Reuse the parent helper so streaming/shuffle/max_samples behavior
+        # stays consistent with ImageDataset and ObjectDetectionDataset.
+        dataset = self._load_and_sample(revision=getattr(self, "_revision", None))
+
+        # Detect the input image column and the depth target column.
+        self._detect_image_column(dataset)
+
+        # Match processor output to the ONNX input shape when available.
+        io_config = self._config.get("io_config")
+        overrides = self._derive_overrides(io_config)
+
+        # Fall back to the default square size when the ONNX shape is absent.
+        if "size" not in overrides:
+            overrides["size"] = {
+                "height": DEFAULT_DEPTH_ESTIMATION_SIZE,
+                "width": DEFAULT_DEPTH_ESTIMATION_SIZE,
+            }
+
+        # Create a processor that emits tensors compatible with calibration.
+        processor = AutoImageProcessor.from_pretrained(
+            self._model_name,
+            use_fast=True,
+            **overrides,
+        )
+
+        logger.debug("Created processor with overrides: %s", overrides)
+
+        # Convert raw images into model-ready tensors.
+        def preprocess_single_sample(example: dict[str, Any]) -> dict[str, Any]:
+            return processor(example[self._image_col].convert("RGB"), return_tensors="pt")
+
+        self._dataset = dataset.map(
+            preprocess_single_sample, remove_columns=[self._image_col]
+        ).with_format("torch", output_all_columns=True)
+
+        logger.info("Dataset initialized with %d samples", len(self._dataset))
+
+    def _detect_image_column(self, dataset: Any) -> None:
+        """Detect the input image column for calibration.
+
+        PTQ calibration only consumes ``pixel_values``; the depth target is
+        not needed and is not read here.
+        """
+        if not hasattr(dataset, "features"):
+            raise ValueError(f"Dataset {self._dataset_name} has no features metadata")
+
+        features = dataset.features
+
+        self._image_col = None
+        for col_name, feature in features.items():
+            if isinstance(feature, Image):
+                self._image_col = col_name
+                break
+
+        if not self._image_col:
+            available_cols = list(features.keys())
+            available_types = [type(f).__name__ for f in features.values()]
+            raise ValueError(
+                f"No Image column found in {self._dataset_name}. "
+                f"Available: {dict(zip(available_cols, available_types, strict=False))}"
+            )
+
+        logger.info("Detected image column: '%s'", self._image_col)
+
+    @property
+    def label_col(self) -> str:
+        """No label column is used for depth-estimation calibration."""
+        return ""
+
+    @property
+    def label_names(self) -> list[str]:
+        """Depth estimation has no class labels."""
+        return []
diff --git a/src/winml/modelkit/datasets/image.py b/src/winml/modelkit/datasets/image.py
index 92c6b885b..251eaec9f 100644
--- a/src/winml/modelkit/datasets/image.py
+++ b/src/winml/modelkit/datasets/image.py
@@ -52,13 +52,17 @@ def _get_default_dataset(self) -> None:
             self._data_split = "train"
             self._config.setdefault("streaming", True)
 
-    def _load_and_sample(self) -> Any:
+    def _load_and_sample(self, revision: str | None = None) -> Any:
         """Load the configured dataset and apply sample/shuffle.
 
         Shared by ImageDataset and ObjectDetectionDataset. Column detection
         is *not* done here — callers run their own detection on the returned
         dataset because the column schema differs by task.
 
+        Args:
+            revision: Optional dataset revision (branch, tag, or commit) for
+                datasets that need to be pinned (e.g. ``refs/convert/parquet``).
+
         Returns:
             A materialized arrow Dataset of up to ``self._max_samples`` rows.
         """
@@ -68,7 +72,12 @@ def _load_and_sample(self) -> Any:
         streaming = self._config.get("streaming", False) and self._max_samples is not None
         logger.info(f"Loading dataset: {self._dataset_name} with split: {self._data_split}")
         try:
-            dataset = load_dataset(self._dataset_name, split=self._data_split, streaming=streaming)
+            dataset = load_dataset(
+                self._dataset_name,
+                split=self._data_split,
+                streaming=streaming,
+                revision=revision,
+            )
         except Exception as e:
             logger.error(f"Failed to load dataset {self._dataset_name}: {e}")
             raise
diff --git a/src/winml/modelkit/eval/__init__.py b/src/winml/modelkit/eval/__init__.py
index 8904dd955..0d3225608 100644
--- a/src/winml/modelkit/eval/__init__.py
+++ b/src/winml/modelkit/eval/__init__.py
@@ -19,12 +19,14 @@
 
 
 if TYPE_CHECKING:
+    from .depth_estimation_evaluator import WinMLDepthEstimationEvaluator
     from .feature_extraction_evaluator import WinMLFeatureExtractionEvaluator
     from .fill_mask_evaluator import WinMLFillMaskEvaluator
     from .image_feature_extraction_evaluator import WinMLImageFeatureExtractionEvaluator
     from .image_segmentation_evaluator import WinMLImageSegmentationEvaluator
     from .image_to_text_evaluator import WinMLImageToTextEvaluator
     from .metrics.classification import ClassificationMetric
+    from .metrics.depth import DepthMetric
     from .metrics.knn_accuracy import KNNAccuracyMetric
     from .metrics.mean_average_precision import MAPMetric
     from .metrics.mean_iou import IGNORE_INDEX, MeanIoUMetric
@@ -41,6 +43,8 @@
 
 _LAZY_ATTRS: dict[str, str] = {
     # Evaluators
+    "WinMLDepthEstimationEvaluator":
+        ".depth_estimation_evaluator:WinMLDepthEstimationEvaluator",
     "WinMLFeatureExtractionEvaluator":
         ".feature_extraction_evaluator:WinMLFeatureExtractionEvaluator",
     "WinMLFillMaskEvaluator":
@@ -66,6 +70,8 @@
     # Metrics (defer numpy / scipy / torch / torchmetrics until first use)
     "ClassificationMetric":
         ".metrics.classification:ClassificationMetric",
+    "DepthMetric":
+        ".metrics.depth:DepthMetric",
     "IGNORE_INDEX":
         ".metrics.mean_iou:IGNORE_INDEX",
     "KNNAccuracyMetric":
@@ -104,6 +110,7 @@ def __dir__() -> list[str]:
     "IGNORE_INDEX",
     "ClassificationMetric",
     "DatasetConfig",
+    "DepthMetric",
     "EvalResult",
     "KNNAccuracyMetric",
     "MAPMetric",
@@ -111,6 +118,7 @@ def __dir__() -> list[str]:
     "PseudoPerplexityMetric",
     "SpearmanCorrelationMetric",
     "TopKAccuracyMetric",
+    "WinMLDepthEstimationEvaluator",
     "WinMLEvaluationConfig",
     "WinMLEvaluator",
     "WinMLFeatureExtractionEvaluator",
diff --git a/src/winml/modelkit/eval/base_evaluator.py b/src/winml/modelkit/eval/base_evaluator.py
index c11a87207..8ca25c091 100644
--- a/src/winml/modelkit/eval/base_evaluator.py
+++ b/src/winml/modelkit/eval/base_evaluator.py
@@ -97,6 +97,7 @@ def prepare_data(self) -> Dataset:
                     name=ds.name,
                     split=ds.split,
                     streaming=ds.streaming,
+                    revision=ds.revision,
                 )
         except Exception as e:
             raise DatasetValidationError(
diff --git a/src/winml/modelkit/eval/config.py b/src/winml/modelkit/eval/config.py
index 5054ef741..eff810bda 100644
--- a/src/winml/modelkit/eval/config.py
+++ b/src/winml/modelkit/eval/config.py
@@ -28,6 +28,9 @@ class DatasetConfig:
         columns_mapping: Column name overrides as key=value pairs.
             If empty, consumer uses its own defaults.
         streaming: Whether to stream dataset (avoids full download).
+        revision: Git revision (branch, tag, or commit) to load. Useful for
+            datasets pinned to a specific snapshot (e.g.
+            ``refs/convert/parquet``).
         build_script: Path to a Python script that builds the dataset locally.
             When set alongside ``path``, the script is invoked with
             ``--output <path>`` before the dataset is loaded.
@@ -44,6 +47,7 @@ class DatasetConfig:
     columns_mapping: dict[str, str] = field(default_factory=dict)
     label_mapping: dict[str, int] | None = None
     streaming: bool = False
+    revision: str | None = field(default=None, metadata={"cli_name": "dataset_revision"})
     build_script: str | None = field(default=None, metadata={"cli_name": "dataset_script"})
     label_mapping_file: str | None = None
 
@@ -65,6 +69,8 @@ def to_dict(self) -> dict[str, Any]:
             result["label_mapping"] = self.label_mapping
         if self.streaming:
             result["streaming"] = self.streaming
+        if self.revision is not None:
+            result["revision"] = self.revision
         if self.build_script is not None:
             result["build_script"] = self.build_script
         if self.label_mapping_file is not None:
@@ -136,6 +142,7 @@ def from_dict(cls, data: dict) -> WinMLEvaluationConfig:
             seed=ds_data.get("seed", 42),
             columns_mapping=ds_data.get("columns_mapping", {}),
             streaming=ds_data.get("streaming", False),
+            revision=ds_data.get("revision"),
             build_script=ds_data.get("build_script"),
             label_mapping_file=ds_data.get("label_mapping_file"),
         )
diff --git a/src/winml/modelkit/eval/depth_estimation_evaluator.py b/src/winml/modelkit/eval/depth_estimation_evaluator.py
new file mode 100644
index 000000000..7b0bd0333
--- /dev/null
+++ b/src/winml/modelkit/eval/depth_estimation_evaluator.py
@@ -0,0 +1,148 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Monocular depth estimation evaluator.
+
+HF ``evaluate`` has no depth-estimation evaluator, so we run the metric
+loop manually against ``DepthMetric`` (AbsRel, RMSE, delta1).
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import TYPE_CHECKING, Any
+
+from .base_evaluator import WinMLEvaluator
+
+
+if TYPE_CHECKING:
+    import numpy as np
+    from transformers.pipelines.base import Pipeline
+
+    from ..models.winml.base import WinMLPreTrainedModel
+    from .config import WinMLEvaluationConfig
+
+logger = logging.getLogger(__name__)
+
+
+class WinMLDepthEstimationEvaluator(WinMLEvaluator):
+    """Evaluator for monocular depth estimation."""
+
+    def __init__(
+        self,
+        config: WinMLEvaluationConfig,
+        model: WinMLPreTrainedModel,
+    ) -> None:
+        from ..utils.eval_utils import get_default
+
+        mapping = config.dataset.columns_mapping
+        task = "depth-estimation"
+        self._input_col = mapping.get("input_column", get_default(task, "input_column"))
+        self._depth_col = mapping.get("depth_column", get_default(task, "depth_column"))
+        self._align = mapping.get("align", get_default(task, "align"))
+        self._depth_kind = mapping.get("depth_kind", get_default(task, "depth_kind"))
+        self._min_depth = float(mapping.get("min_depth", get_default(task, "min_depth")))
+        max_depth_raw = mapping.get("max_depth", get_default(task, "max_depth"))
+        self._max_depth: float | None
+        if isinstance(max_depth_raw, str) and max_depth_raw.lower() == "none":
+            self._max_depth = None
+        else:
+            self._max_depth = float(max_depth_raw)
+        super().__init__(config, model)
+
+    def prepare_pipeline(self) -> Pipeline:
+        """Create pipeline and match image processor size to ONNX input shape.
+
+        Image processors for depth and detection models often default to
+        aspect-preserving resize and/or padding (e.g. Depth-Anything sets
+        ``keep_aspect_ratio=True`` with ``ensure_multiple_of=14``), which
+        produces a per-image output shape that does not match the static
+        ONNX input shape. We override these flags so the processor produces
+        exactly the target ``(h, w)`` for every input.
+
+        Models without these attributes are unaffected.
+        """
+        pipe = super().prepare_pipeline()
+
+        io_config = getattr(self.model, "io_config", None) or {}
+        input_shapes = io_config.get("input_shapes", [])
+        if input_shapes and len(input_shapes[0]) == 4:
+            _, _, h, w = input_shapes[0]
+            pipe.image_processor.size = {"height": h, "width": w}
+            if hasattr(pipe.image_processor, "keep_aspect_ratio"):
+                pipe.image_processor.keep_aspect_ratio = False
+            if hasattr(pipe.image_processor, "do_pad"):
+                pipe.image_processor.do_pad = False
+
+        return pipe
+
+    def compute(self) -> dict[str, Any]:
+        """Run depth evaluation over all samples."""
+        import numpy as np
+        from tqdm import tqdm
+
+        from .metrics import DepthMetric
+
+        metric = DepthMetric(
+            align=self._align,
+            depth_kind=self._depth_kind,
+            min_depth=self._min_depth,
+            max_depth=self._max_depth,
+        )
+
+        skipped = 0
+        for sample in tqdm(self.data, desc="Evaluating depth"):
+            image = sample.get(self._input_col)
+            depth = sample.get(self._depth_col)
+            if image is None or depth is None:
+                skipped += 1
+                continue
+
+            self._validate_image_input(image)
+            result = self.pipe(image)
+            pred = self._extract_predicted_depth(result)
+            gt = np.asarray(depth, dtype=np.float32)
+            metric.update(pred, gt)
+
+        if skipped:
+            logger.warning("Skipped %d samples with missing image or depth.", skipped)
+
+        return metric.compute()
+
+    def _validate_image_input(self, image: Any) -> None:
+        """Raise a clear error for tensor-formatted image columns."""
+        import numpy as np
+        import torch
+        from PIL import Image as PILImage
+
+        if isinstance(image, PILImage.Image):
+            return
+
+        if isinstance(image, (np.ndarray, torch.Tensor)):
+            raise TypeError(
+                f"Depth-estimation input column {self._input_col!r} must yield PIL "
+                f"images; got {type(image).__name__}. Use a datasets.Image column "
+                "or remove tensor/NumPy formatting before evaluation.",
+            )
+
+    @staticmethod
+    def _extract_predicted_depth(result: Any) -> np.ndarray:
+        """Pull the numeric depth tensor out of an HF pipeline result."""
+        import numpy as np
+        import torch
+
+        if not isinstance(result, dict):
+            raise TypeError(
+                f"Unexpected pipeline output type: {type(result).__name__}; expected dict.",
+            )
+
+        predicted = result.get("predicted_depth")
+        if predicted is None:
+            raise ValueError(
+                f"Pipeline output missing 'predicted_depth'; got keys {list(result)}.",
+            )
+        if isinstance(predicted, torch.Tensor):
+            predicted = predicted.detach().cpu().numpy()
+        return np.asarray(predicted, dtype=np.float32).squeeze()
diff --git a/src/winml/modelkit/eval/evaluate.py b/src/winml/modelkit/eval/evaluate.py
index 0d6a728c7..02139c112 100644
--- a/src/winml/modelkit/eval/evaluate.py
+++ b/src/winml/modelkit/eval/evaluate.py
@@ -57,6 +57,8 @@
         "winml.modelkit.eval.zero_shot_classification_evaluator:WinMLZeroShotClassificationEvaluator",
     "zero-shot-image-classification":
         "winml.modelkit.eval.zero_shot_image_classification_evaluator:WinMLZeroShotImageClassificationEvaluator",
+    "depth-estimation":
+        "winml.modelkit.eval.depth_estimation_evaluator:WinMLDepthEstimationEvaluator",
 }
 
 
@@ -156,6 +158,13 @@ def get_evaluator_class(task: str) -> type[WinMLEvaluator]:
             "label_column": "fine_label",
         },
     },
+    "depth-estimation": {
+        "path": "sayakpaul/nyu_depth_v2",
+        "split": "validation",
+        # Loaded via the parquet-mirror revision so the dataset works without
+        # the legacy `nyu_depth_v2.py` loader script.
+        "revision": "refs/convert/parquet",
+    },
 }
 
 
diff --git a/src/winml/modelkit/eval/metrics/__init__.py b/src/winml/modelkit/eval/metrics/__init__.py
index 0cd0ac0aa..2695c3d84 100644
--- a/src/winml/modelkit/eval/metrics/__init__.py
+++ b/src/winml/modelkit/eval/metrics/__init__.py
@@ -13,6 +13,7 @@
 
 if TYPE_CHECKING:
     from .classification import ClassificationMetric
+    from .depth import DepthMetric
     from .knn_accuracy import KNNAccuracyMetric
     from .mean_average_precision import MAPMetric
     from .mean_iou import IGNORE_INDEX, MeanIoUMetric
@@ -26,6 +27,7 @@
 # that do not actually use the metric in question.
 _LAZY_ATTRS: dict[str, str] = {
     "ClassificationMetric": ".classification:ClassificationMetric",
+    "DepthMetric": ".depth:DepthMetric",
     "IGNORE_INDEX": ".mean_iou:IGNORE_INDEX",
     "KNNAccuracyMetric": ".knn_accuracy:KNNAccuracyMetric",
     "MAPMetric": ".mean_average_precision:MAPMetric",
@@ -55,6 +57,7 @@ def __dir__() -> list[str]:
 __all__ = [
     "IGNORE_INDEX",
     "ClassificationMetric",
+    "DepthMetric",
     "KNNAccuracyMetric",
     "MAPMetric",
     "MeanIoUMetric",
diff --git a/src/winml/modelkit/eval/metrics/depth.py b/src/winml/modelkit/eval/metrics/depth.py
new file mode 100644
index 000000000..d89474440
--- /dev/null
+++ b/src/winml/modelkit/eval/metrics/depth.py
@@ -0,0 +1,180 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Monocular depth estimation metrics: AbsRel, RMSE, delta1.
+
+Follows the standard NYU/KITTI evaluation protocol (Eigen et al. 2014).
+Metrics are computed only over pixels where ground truth is finite,
+positive, and within an optional ``[min_depth, max_depth]`` range.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+import numpy as np
+import torch
+
+
+class DepthMetric:
+    """Per-pixel depth estimation metric (AbsRel, RMSE, delta1).
+
+    Accumulates statistics across calls to :meth:`update` and returns a
+    dict from :meth:`compute`. Predictions and ground truth must be the
+    same 2D shape; resampling is the caller's responsibility.
+    """
+
+    _VALID_ALIGN = ("none", "median", "affine")
+    _VALID_DEPTH_KIND = ("depth", "disparity")
+
+    def __init__(
+        self,
+        align: str = "affine",
+        depth_kind: str = "depth",
+        min_depth: float = 1e-3,
+        max_depth: float | None = 10.0,
+        delta_threshold: float = 1.25,
+    ) -> None:
+        """Initialize depth metric.
+
+        Args:
+            align: Per-image alignment of predictions to ground truth.
+                ``"affine"`` (default) fits ``s * pred + t`` via
+                least-squares — standard for relative-depth models
+                (MiDaS, Depth-Anything, Marigold). ``"median"`` rescales
+                by ``median(gt) / median(pred)`` (scale only, no shift).
+                ``"none"`` evaluates predictions as-is — for metric-depth
+                models like ZoeDepth and DepthPro.
+            depth_kind: Output space of ``prediction``. ``"depth"``
+                (default) treats values as forward depth/distance.
+                ``"disparity"`` first inverts the prediction to depth
+                (``1 / pred``) — for DPT/MiDaS-style models whose
+                output is inverse depth.
+            min_depth: Lower bound (inclusive) for valid ground-truth
+                pixels in the same units as ``gt``.
+            max_depth: Upper bound (inclusive) for valid ground-truth
+                pixels, or ``None`` to disable. Defaults to 10 m
+                (NYU indoor convention).
+            delta_threshold: Threshold for delta1 accuracy.
+        """
+        if align not in self._VALID_ALIGN:
+            raise ValueError(
+                f"align must be one of {self._VALID_ALIGN}, got {align!r}.",
+            )
+        if depth_kind not in self._VALID_DEPTH_KIND:
+            raise ValueError(
+                f"depth_kind must be one of {self._VALID_DEPTH_KIND}, got {depth_kind!r}.",
+            )
+        if delta_threshold <= 1.0:
+            raise ValueError(f"delta_threshold must be > 1, got {delta_threshold}.")
+
+        self._align = align
+        self._depth_kind = depth_kind
+        self._min_depth = float(min_depth)
+        self._max_depth = float(max_depth) if max_depth is not None else None
+        self._delta_threshold = float(delta_threshold)
+
+        self._abs_rel_sum = 0.0
+        self._sq_err_sum = 0.0
+        self._delta_hits = 0
+        self._pixel_count = 0
+        self._image_count = 0
+
+    def update(self, prediction: Any, reference: Any) -> None:
+        """Add one image's prediction and ground-truth depth map.
+
+        Args:
+            prediction: ``(H, W)`` array-like of predicted depth (or
+                disparity, when ``depth_kind="disparity"``).
+                Negative or non-finite values are treated as invalid.
+            reference: ``(H, W)`` array-like of ground-truth depth in
+                the same units as the aligned prediction.
+        """
+        pred = self._to_numpy(prediction)
+        gt = self._to_numpy(reference)
+        if pred.shape != gt.shape:
+            raise ValueError(
+                f"prediction and reference must share shape; got {pred.shape} vs {gt.shape}.",
+            )
+
+        if self._depth_kind == "disparity":
+            with np.errstate(divide="ignore", invalid="ignore"):
+                pred = np.where(pred > 0, 1.0 / pred, np.nan)
+
+        valid = self._valid_mask(pred, gt)
+        if not valid.any():
+            self._image_count += 1
+            return
+
+        pred_v = pred[valid].astype(np.float64)
+        gt_v = gt[valid].astype(np.float64)
+
+        if self._align == "median":
+            scale = np.median(gt_v) / np.median(pred_v)
+            pred_v = pred_v * scale
+        elif self._align == "affine":
+            # Least-squares fit of (s, t) such that s * pred + t ~ gt.
+            # Standard scale-and-shift alignment for relative-depth models
+            # (MiDaS, Depth-Anything, Marigold).
+            ones = np.ones_like(pred_v)
+            a = np.stack([pred_v, ones], axis=1)
+            (scale, shift), *_ = np.linalg.lstsq(a, gt_v, rcond=None)
+            pred_v = pred_v * scale + shift
+            # Affine alignment can introduce non-positive predicted depths.
+            # Re-filter them after scale-and-shift, following Eigen/MiDaS eval
+            # behavior; affine may therefore use fewer pixels than median/none.
+            pos = pred_v > self._min_depth
+            if not pos.any():
+                self._image_count += 1
+                return
+            pred_v = pred_v[pos]
+            gt_v = gt_v[pos]
+
+        diff = pred_v - gt_v
+        ratio = np.maximum(pred_v / gt_v, gt_v / pred_v)
+
+        self._abs_rel_sum += float(np.sum(np.abs(diff) / gt_v))
+        self._sq_err_sum += float(np.sum(diff * diff))
+        self._delta_hits += int(np.sum(ratio < self._delta_threshold))
+        self._pixel_count += int(pred_v.size)
+        self._image_count += 1
+
+    def compute(self) -> dict[str, Any]:
+        """Return aggregated metrics over all updates."""
+        if self._pixel_count == 0:
+            raise ValueError(
+                "DepthMetric.compute() called with no valid pixels; "
+                "check ground-truth ranges and update calls.",
+            )
+        return {
+            "abs_rel": self._abs_rel_sum / self._pixel_count,
+            "rmse": float(np.sqrt(self._sq_err_sum / self._pixel_count)),
+            "delta1": self._delta_hits / self._pixel_count,
+            "num_images": self._image_count,
+            "num_valid_pixels": self._pixel_count,
+        }
+
+    def reset(self) -> None:
+        """Clear accumulated state for a fresh evaluation."""
+        self._abs_rel_sum = 0.0
+        self._sq_err_sum = 0.0
+        self._delta_hits = 0
+        self._pixel_count = 0
+        self._image_count = 0
+
+    def _valid_mask(self, pred: np.ndarray, gt: np.ndarray) -> np.ndarray:
+        """Pixels where both prediction and ground truth are usable."""
+        mask = np.isfinite(gt) & (gt > self._min_depth)
+        if self._max_depth is not None:
+            mask &= gt <= self._max_depth
+        mask &= np.isfinite(pred) & (pred > 0)
+        return mask
+
+    @staticmethod
+    def _to_numpy(arr: Any) -> np.ndarray:
+        """Convert torch.Tensor / PIL / numpy to a 2D float numpy array."""
+        if isinstance(arr, torch.Tensor):
+            return arr.detach().cpu().numpy().squeeze()
+        return np.asarray(arr).squeeze()
diff --git a/src/winml/modelkit/models/hf/depth_anything.py b/src/winml/modelkit/models/hf/depth_anything.py
index c4336903e..3e30a34e6 100644
--- a/src/winml/modelkit/models/hf/depth_anything.py
+++ b/src/winml/modelkit/models/hf/depth_anything.py
@@ -17,12 +17,50 @@
 from __future__ import annotations
 
 from optimum.exporters.onnx import OnnxConfig
-from optimum.utils import NormalizedConfig
+from optimum.utils import DEFAULT_DUMMY_SHAPES, NormalizedConfig
 from optimum.utils.input_generators import DummyVisionInputGenerator
 
 from ...export import register_onnx_overwrite
 
 
+class _DepthAnythingVisionInputGenerator(DummyVisionInputGenerator):
+    """Vision input generator that lets explicit height/width override config.image_size.
+
+    Optimum's DummyVisionInputGenerator prioritizes normalized_config.image_size
+    (resolved here from backbone_config.image_size) over explicit height/width
+    kwargs. When the user supplies a non-default shape via --shape-config (e.g.
+    to match a non-square dataset), this subclass restores the override behavior
+    so user kwargs take precedence. Mirrors the pattern used in
+    `_SegformerVisionInputGenerator`.
+    """
+
+    def __init__(
+        self,
+        task: str,
+        normalized_config,
+        batch_size: int = DEFAULT_DUMMY_SHAPES["batch_size"],
+        num_channels: int = DEFAULT_DUMMY_SHAPES["num_channels"],
+        width: int = DEFAULT_DUMMY_SHAPES["width"],
+        height: int = DEFAULT_DUMMY_SHAPES["height"],
+        **kwargs,
+    ):
+        super().__init__(
+            task,
+            normalized_config,
+            batch_size=batch_size,
+            num_channels=num_channels,
+            width=width,
+            height=height,
+            **kwargs,
+        )
+        # If caller passed non-default height/width (e.g. from --shape-config),
+        # use those instead of the backbone config's pretraining resolution.
+        if height != DEFAULT_DUMMY_SHAPES["height"] or width != DEFAULT_DUMMY_SHAPES["width"]:
+            self.height = height
+            self.width = width
+            self.image_size = (height, width)
+
+
 @register_onnx_overwrite("depth_anything", "depth-estimation", library_name="transformers")
 class DepthAnythingIOConfig(OnnxConfig):
     """ONNX config for Depth Anything depth estimation.
@@ -42,7 +80,7 @@ class DepthAnythingIOConfig(OnnxConfig):
         num_channels="backbone_config.num_channels",
         allow_new=True,
     )
-    DUMMY_INPUT_GENERATOR_CLASSES = (DummyVisionInputGenerator,)
+    DUMMY_INPUT_GENERATOR_CLASSES = (_DepthAnythingVisionInputGenerator,)
 
     @property
     def inputs(self) -> dict[str, dict[int, str]]:
diff --git a/src/winml/modelkit/models/winml/__init__.py b/src/winml/modelkit/models/winml/__init__.py
index b0c3a941c..df3c9d94a 100644
--- a/src/winml/modelkit/models/winml/__init__.py
+++ b/src/winml/modelkit/models/winml/__init__.py
@@ -40,6 +40,7 @@
     "image-segmentation": "WinMLModelForImageSegmentation",
     "semantic-segmentation": "WinMLModelForSemanticSegmentation",
     "object-detection": "WinMLModelForObjectDetection",
+    "depth-estimation": "WinMLModelForDepthEstimation",
     # Not yet implemented — falls back to WinMLModelForGenericTask at runtime
     "token-classification": "WinMLModelForTokenClassification",
     "question-answering": "WinMLModelForQuestionAnswering",
@@ -74,6 +75,7 @@ def _import_winml_class(class_name: str) -> type[WinMLPreTrainedModel]:
         ImportError: If class is not implemented yet
     """
     from .base import WinMLModelForGenericTask
+    from .depth_estimation import WinMLModelForDepthEstimation
     from .feature_extraction import WinMLModelForFeatureExtraction
     from .image_classification import WinMLModelForImageClassification
     from .image_segmentation import (
@@ -86,6 +88,7 @@ def _import_winml_class(class_name: str) -> type[WinMLPreTrainedModel]:
 
     # Map class names to modules
     class_map: dict[str, type] = {
+        "WinMLModelForDepthEstimation": WinMLModelForDepthEstimation,
         "WinMLModelForFeatureExtraction": WinMLModelForFeatureExtraction,
         "WinMLModelForImageClassification": WinMLModelForImageClassification,
         "WinMLModelForImageSegmentation": WinMLModelForImageSegmentation,
@@ -182,6 +185,7 @@ def register_specialization(model_type: str, task: str, class_name: str) -> None
     register_composite_model,
 )
 from .decoder_only import WinMLDecoderOnlyModel
+from .depth_estimation import WinMLModelForDepthEstimation
 from .encoder_decoder import WinMLEncoderDecoderModel
 from .feature_extraction import WinMLModelForFeatureExtraction
 from .image_classification import WinMLModelForImageClassification
@@ -209,6 +213,7 @@ def register_specialization(model_type: str, task: str, class_name: str) -> None
     "WinMLCompositeModel",
     "WinMLDecoderOnlyModel",
     "WinMLEncoderDecoderModel",
+    "WinMLModelForDepthEstimation",
     "WinMLModelForFeatureExtraction",
     "WinMLModelForGenericTask",
     "WinMLModelForImageClassification",
diff --git a/src/winml/modelkit/models/winml/depth_estimation.py b/src/winml/modelkit/models/winml/depth_estimation.py
new file mode 100644
index 000000000..dda6b5c3e
--- /dev/null
+++ b/src/winml/modelkit/models/winml/depth_estimation.py
@@ -0,0 +1,51 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""WinML Model for Depth Estimation.
+
+Thin wrapper for monocular depth estimation inference.
+Pipeline execution (export/optimize/compile) is done by WinMLAutoModel factory.
+"""
+
+from __future__ import annotations
+
+import logging
+from typing import Any
+
+from transformers.modeling_outputs import DepthEstimatorOutput
+
+from .base import WinMLPreTrainedModel
+
+
+logger = logging.getLogger(__name__)
+
+
+class WinMLModelForDepthEstimation(WinMLPreTrainedModel):
+    """WinML model for monocular depth estimation.
+
+    Returns ``DepthEstimatorOutput`` with ``predicted_depth`` so HF's
+    depth-estimation pipeline can run post-processing via
+    ``image_processor.post_process_depth_estimation()``.
+    """
+
+    def forward(self, **kwargs: Any) -> DepthEstimatorOutput:
+        """Run depth estimation inference.
+
+        Accepts all processor outputs via ``**kwargs`` and passes them
+        directly to the ONNX session, keeping the implementation
+        architecture-agnostic.
+
+        Returns:
+            DepthEstimatorOutput with the ``predicted_depth`` tensor populated.
+        """
+        formatted = self._format_inputs(**kwargs)
+        outputs = self._run_inference(formatted)
+
+        predicted_depth = outputs.get("predicted_depth")
+        if predicted_depth is None:
+            # Fall back to first output for non-standard output names.
+            predicted_depth = next(iter(outputs.values()))
+
+        return DepthEstimatorOutput(predicted_depth=predicted_depth)
diff --git a/src/winml/modelkit/utils/eval_utils.py b/src/winml/modelkit/utils/eval_utils.py
index 7807640c7..589331b1d 100644
--- a/src/winml/modelkit/utils/eval_utils.py
+++ b/src/winml/modelkit/utils/eval_utils.py
@@ -242,6 +242,45 @@ class TaskSchema:
     roles=("image-encoder", "text-encoder"),
 )
 
+_DEPTH_ESTIMATION_SCHEMA = TaskSchema(
+    columns=(
+        SchemaItem(
+            "input_column", "input image (PIL.Image)",
+            default="image", remap_hint="<your_image_column>",
+        ),
+        SchemaItem(
+            "depth_column", "single-channel ground-truth depth image",
+            default="depth_map", remap_hint="<your_depth_column>",
+        ),
+    ),
+    params=(
+        SchemaItem(
+            "align",
+            "alignment strategy for predictions",
+            default="affine",
+            remap_hint="<affine|median|none>",
+        ),
+        SchemaItem(
+            "depth_kind",
+            "prediction space",
+            default="depth",
+            remap_hint="<depth|disparity>",
+        ),
+        SchemaItem(
+            "min_depth",
+            "minimum valid ground-truth depth",
+            default="1e-3",
+            remap_hint="<float>",
+        ),
+        SchemaItem(
+            "max_depth",
+            "maximum valid ground-truth depth",
+            default="10.0",
+            remap_hint="<float|none>",
+        ),
+    ),
+)
+
 TASK_SCHEMAS: dict[str, TaskSchema] = {
     "image-classification": _IMAGE_CLASSIFICATION_SCHEMA,
     "text-classification": _TEXT_CLASSIFICATION_SCHEMA,
@@ -258,6 +297,7 @@ class TaskSchema:
     "fill-mask": _FILL_MASK_SCHEMA,
     "zero-shot-classification": _ZERO_SHOT_CLASSIFICATION_SCHEMA,
     "zero-shot-image-classification": _ZERO_SHOT_IMAGE_CLASSIFICATION_SCHEMA,
+    "depth-estimation": _DEPTH_ESTIMATION_SCHEMA,
 }
 
 
diff --git a/tests/integration/datasets/test_depth_estimation.py b/tests/integration/datasets/test_depth_estimation.py
new file mode 100644
index 000000000..12f9f2468
--- /dev/null
+++ b/tests/integration/datasets/test_depth_estimation.py
@@ -0,0 +1,352 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Tests for DepthEstimationDataset."""
+
+from __future__ import annotations
+
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+
+class TestDepthEstimationDatasetDeriveOverrides:
+    """Tests for DepthEstimationDataset._derive_overrides method."""
+
+    @pytest.fixture
+    def dataset_class(self) -> type:
+        """Get DepthEstimationDataset class without instantiation."""
+        from winml.modelkit.datasets import DepthEstimationDataset
+
+        return DepthEstimationDataset
+
+    def test_no_io_config_returns_static_overrides(self, dataset_class: type) -> None:
+        """Even without io_config, static overrides are set."""
+        instance = object.__new__(dataset_class)
+        overrides = instance._derive_overrides(None)
+
+        assert overrides == {"keep_aspect_ratio": False, "do_pad": False}
+
+    def test_always_disables_keep_aspect_ratio(self, dataset_class: type) -> None:
+        """keep_aspect_ratio=False is always set (Depth-Anything default is True)."""
+        instance = object.__new__(dataset_class)
+        io_config = {"pixel_values": {"shape": [1, 3, 518, 518]}}
+
+        overrides = instance._derive_overrides(io_config)
+
+        assert overrides["keep_aspect_ratio"] is False
+
+    def test_always_disables_do_pad(self, dataset_class: type) -> None:
+        """do_pad=False is always set (depth processors may pad otherwise)."""
+        instance = object.__new__(dataset_class)
+        io_config = {"pixel_values": {"shape": [1, 3, 518, 518]}}
+
+        overrides = instance._derive_overrides(io_config)
+
+        assert overrides["do_pad"] is False
+
+    def test_extracts_size_from_pixel_values_shape(self, dataset_class: type) -> None:
+        """Should extract height/width from pixel_values shape."""
+        instance = object.__new__(dataset_class)
+        io_config = {"pixel_values": {"shape": [1, 3, 518, 518]}}
+
+        overrides = instance._derive_overrides(io_config)
+
+        assert overrides["size"] == {"height": 518, "width": 518}
+
+    def test_handles_dynamic_dimensions(self, dataset_class: type) -> None:
+        """Should not set size when dimensions are dynamic (None)."""
+        instance = object.__new__(dataset_class)
+        io_config = {"pixel_values": {"shape": [None, 3, None, None]}}
+
+        overrides = instance._derive_overrides(io_config)
+
+        assert "size" not in overrides
+        # Static overrides still present
+        assert overrides["keep_aspect_ratio"] is False
+        assert overrides["do_pad"] is False
+
+    def test_handles_missing_shape_key(self, dataset_class: type) -> None:
+        """Should handle pixel_values without shape key."""
+        instance = object.__new__(dataset_class)
+        io_config = {"pixel_values": {}}
+
+        overrides = instance._derive_overrides(io_config)
+
+        assert "size" not in overrides
+        assert overrides["keep_aspect_ratio"] is False
+        assert overrides["do_pad"] is False
+
+    def test_handles_short_shape_list(self, dataset_class: type) -> None:
+        """Should handle shape with fewer than 4 dimensions."""
+        instance = object.__new__(dataset_class)
+        io_config = {"pixel_values": {"shape": [518, 518]}}
+
+        overrides = instance._derive_overrides(io_config)
+
+        assert "size" not in overrides
+
+
+class TestDepthEstimationDatasetWithMockedDeps:
+    """Tests with mocked dependencies for faster execution."""
+
+    @patch("winml.modelkit.datasets.image.load_dataset")
+    @patch("winml.modelkit.datasets.depth_estimation.AutoImageProcessor")
+    def test_processor_created_with_static_shape(
+        self,
+        mock_processor_cls: MagicMock,
+        mock_load_dataset: MagicMock,
+    ) -> None:
+        """Processor receives size matching pixel_values shape + static overrides."""
+        from datasets.features import Image
+
+        from winml.modelkit.datasets import DepthEstimationDataset
+
+        mock_processor = MagicMock()
+        mock_processor.return_value = {"pixel_values": MagicMock()}
+        mock_processor_cls.from_pretrained.return_value = mock_processor
+
+        mock_ds = MagicMock()
+        mock_ds.features = {"image": Image(), "depth_map": Image()}
+        mock_ds.__len__ = MagicMock(return_value=2)
+        mock_mapped = MagicMock()
+        mock_mapped.__len__ = MagicMock(return_value=2)
+        mock_mapped.with_format.return_value = mock_mapped
+        mock_ds.map.return_value = mock_mapped
+        mock_ds.select.return_value = mock_ds
+        mock_load_dataset.return_value = mock_ds
+
+        io_config = {"pixel_values": {"shape": [1, 3, 518, 518]}}
+
+        DepthEstimationDataset(
+            model_name="depth-anything/Depth-Anything-V2-Small-hf",
+            dataset_name="mock-dataset",
+            max_samples=2,
+            data_split="validation",
+            io_config=io_config,
+        )
+
+        call_kwargs = mock_processor_cls.from_pretrained.call_args[1]
+        assert call_kwargs.get("size") == {"height": 518, "width": 518}
+        assert call_kwargs.get("keep_aspect_ratio") is False
+        assert call_kwargs.get("do_pad") is False
+
+    @patch("winml.modelkit.datasets.image.load_dataset")
+    @patch("winml.modelkit.datasets.depth_estimation.AutoImageProcessor")
+    def test_uses_default_size_when_no_io_config(
+        self,
+        mock_processor_cls: MagicMock,
+        mock_load_dataset: MagicMock,
+    ) -> None:
+        """Falls back to DEFAULT_DEPTH_ESTIMATION_SIZE when io_config is absent."""
+        from datasets.features import Image
+
+        from winml.modelkit.datasets import (
+            DEFAULT_DEPTH_ESTIMATION_SIZE,
+            DepthEstimationDataset,
+        )
+
+        mock_processor = MagicMock()
+        mock_processor.return_value = {"pixel_values": MagicMock()}
+        mock_processor_cls.from_pretrained.return_value = mock_processor
+
+        mock_ds = MagicMock()
+        mock_ds.features = {"image": Image(), "depth_map": Image()}
+        mock_ds.__len__ = MagicMock(return_value=2)
+        mock_mapped = MagicMock()
+        mock_mapped.__len__ = MagicMock(return_value=2)
+        mock_mapped.with_format.return_value = mock_mapped
+        mock_ds.map.return_value = mock_mapped
+        mock_ds.select.return_value = mock_ds
+        mock_load_dataset.return_value = mock_ds
+
+        DepthEstimationDataset(
+            model_name="depth-anything/Depth-Anything-V2-Small-hf",
+            dataset_name="mock-dataset",
+            max_samples=2,
+            data_split="validation",
+        )
+
+        call_kwargs = mock_processor_cls.from_pretrained.call_args[1]
+        assert call_kwargs.get("size") == {
+            "height": DEFAULT_DEPTH_ESTIMATION_SIZE,
+            "width": DEFAULT_DEPTH_ESTIMATION_SIZE,
+        }
+
+
+class TestDepthEstimationDatasetColumnDetection:
+    """Tests for column detection without ClassLabel."""
+
+    @patch("winml.modelkit.datasets.image.load_dataset")
+    @patch("winml.modelkit.datasets.depth_estimation.AutoImageProcessor")
+    def test_detects_image_column_only(
+        self,
+        mock_processor_cls: MagicMock,
+        mock_load_dataset: MagicMock,
+    ) -> None:
+        """Detects the Image column used for calibration; no label column."""
+        from datasets.features import Image
+
+        from winml.modelkit.datasets import DepthEstimationDataset
+
+        mock_processor = MagicMock()
+        mock_processor.return_value = {"pixel_values": MagicMock()}
+        mock_processor_cls.from_pretrained.return_value = mock_processor
+
+        mock_ds = MagicMock()
+        mock_ds.features = {"image": Image(), "depth_map": Image()}
+        mock_ds.__len__ = MagicMock(return_value=1)
+        mock_mapped = MagicMock()
+        mock_mapped.__len__ = MagicMock(return_value=1)
+        mock_mapped.with_format.return_value = mock_mapped
+        mock_ds.map.return_value = mock_mapped
+        mock_ds.select.return_value = mock_ds
+        mock_load_dataset.return_value = mock_ds
+
+        ds = DepthEstimationDataset(
+            model_name="depth-anything/Depth-Anything-V2-Small-hf",
+            dataset_name="mock-dataset",
+            max_samples=1,
+            data_split="validation",
+        )
+
+        assert ds._image_col == "image"
+        assert ds.label_col == ""
+        assert ds.label_names == []
+
+    @patch("winml.modelkit.datasets.image.load_dataset")
+    @patch("winml.modelkit.datasets.depth_estimation.AutoImageProcessor")
+    def test_raises_when_no_image_column(
+        self,
+        mock_processor_cls: MagicMock,
+        mock_load_dataset: MagicMock,
+    ) -> None:
+        """Raises ValueError when the dataset has no Image column."""
+        from winml.modelkit.datasets import DepthEstimationDataset
+
+        mock_processor = MagicMock()
+        mock_processor_cls.from_pretrained.return_value = mock_processor
+
+        mock_ds = MagicMock()
+        mock_ds.features = {"text": MagicMock()}
+        mock_ds.select.return_value = mock_ds
+        mock_load_dataset.return_value = mock_ds
+
+        with pytest.raises(ValueError, match="No Image column"):
+            DepthEstimationDataset(
+                model_name="depth-anything/Depth-Anything-V2-Small-hf",
+                dataset_name="mock-dataset",
+                max_samples=1,
+                data_split="validation",
+            )
+
+
+class TestDepthEstimationDatasetCalibrationDefaults:
+    """Tests for the calibration path where dataset_name is not provided.
+
+    Quantization's universal_calib_dataset constructs DepthEstimationDataset
+    without dataset_name. The class must therefore apply task-specific
+    defaults (NYU + parquet revision) inside _initialize().
+    """
+
+    @patch("winml.modelkit.datasets.image.load_dataset")
+    @patch("winml.modelkit.datasets.depth_estimation.AutoImageProcessor")
+    def test_uses_nyu_with_parquet_revision_when_no_dataset_name(
+        self,
+        mock_processor_cls: MagicMock,
+        mock_load_dataset: MagicMock,
+    ) -> None:
+        """When dataset_name is None, load_dataset is called with NYU + parquet revision."""
+        from datasets.features import Image
+
+        from winml.modelkit.datasets import DepthEstimationDataset
+        from winml.modelkit.datasets.depth_estimation import (
+            DEFAULT_DEPTH_ESTIMATION_DATASET,
+            DEFAULT_DEPTH_ESTIMATION_REVISION,
+            DEFAULT_DEPTH_ESTIMATION_SPLIT,
+        )
+
+        mock_processor = MagicMock()
+        mock_processor.return_value = {"pixel_values": MagicMock()}
+        mock_processor_cls.from_pretrained.return_value = mock_processor
+
+        mock_ds = MagicMock()
+        mock_ds.features = {"image": Image(), "depth_map": Image()}
+        mock_ds.__len__ = MagicMock(return_value=1)
+        mock_mapped = MagicMock()
+        mock_mapped.__len__ = MagicMock(return_value=1)
+        mock_mapped.with_format.return_value = mock_mapped
+        mock_ds.map.return_value = mock_mapped
+        mock_ds.select.return_value = mock_ds
+        mock_load_dataset.return_value = mock_ds
+
+        DepthEstimationDataset(
+            model_name="depth-anything/Depth-Anything-V2-Small-hf",
+            max_samples=1,
+        )
+
+        # load_dataset must be called with NYU + parquet revision
+        args, kwargs = mock_load_dataset.call_args
+        assert args[0] == DEFAULT_DEPTH_ESTIMATION_DATASET
+        assert kwargs.get("split") == DEFAULT_DEPTH_ESTIMATION_SPLIT
+        assert kwargs.get("revision") == DEFAULT_DEPTH_ESTIMATION_REVISION
+
+    @patch("winml.modelkit.datasets.image.load_dataset")
+    @patch("winml.modelkit.datasets.depth_estimation.AutoImageProcessor")
+    def test_no_revision_when_user_specifies_dataset(
+        self,
+        mock_processor_cls: MagicMock,
+        mock_load_dataset: MagicMock,
+    ) -> None:
+        """When the user explicitly specifies a dataset, no revision is forced."""
+        from datasets.features import Image
+
+        from winml.modelkit.datasets import DepthEstimationDataset
+
+        mock_processor = MagicMock()
+        mock_processor.return_value = {"pixel_values": MagicMock()}
+        mock_processor_cls.from_pretrained.return_value = mock_processor
+
+        mock_ds = MagicMock()
+        mock_ds.features = {"image": Image(), "depth_map": Image()}
+        mock_ds.__len__ = MagicMock(return_value=1)
+        mock_mapped = MagicMock()
+        mock_mapped.__len__ = MagicMock(return_value=1)
+        mock_mapped.with_format.return_value = mock_mapped
+        mock_ds.map.return_value = mock_mapped
+        mock_ds.select.return_value = mock_ds
+        mock_load_dataset.return_value = mock_ds
+
+        DepthEstimationDataset(
+            model_name="depth-anything/Depth-Anything-V2-Small-hf",
+            dataset_name="custom/depth-dataset",
+            data_split="validation",
+            max_samples=1,
+        )
+
+        args, kwargs = mock_load_dataset.call_args
+        assert args[0] == "custom/depth-dataset"
+        assert kwargs.get("revision") is None
+
+
+class TestDepthEstimationDatasetExports:
+    """Tests for module exports and public API."""
+
+    def test_depth_estimation_dataset_in_all(self) -> None:
+        """DepthEstimationDataset should be in __all__."""
+        from winml.modelkit import datasets
+
+        assert "DepthEstimationDataset" in datasets.__all__
+
+    def test_default_size_constant_in_all(self) -> None:
+        """DEFAULT_DEPTH_ESTIMATION_SIZE should be in __all__."""
+        from winml.modelkit import datasets
+
+        assert "DEFAULT_DEPTH_ESTIMATION_SIZE" in datasets.__all__
+
+    def test_task_mapping_uses_depth_estimation_dataset(self) -> None:
+        """TASK_DATASET_MAPPING should map depth-estimation to DepthEstimationDataset."""
+        from winml.modelkit.datasets import TASK_DATASET_MAPPING, DepthEstimationDataset
+
+        assert TASK_DATASET_MAPPING["depth-estimation"] is DepthEstimationDataset
diff --git a/tests/integration/eval/test_depth_estimation.py b/tests/integration/eval/test_depth_estimation.py
new file mode 100644
index 000000000..9d711b9a4
--- /dev/null
+++ b/tests/integration/eval/test_depth_estimation.py
@@ -0,0 +1,60 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""End-to-end integration test for depth-estimation evaluation.
+
+Downloads small depth-estimation checkpoints and runs the evaluator
+against a few NYU validation samples. Skipped by default via
+``pytest -m "not slow"``.
+"""
+
+from __future__ import annotations
+
+import pytest
+
+from winml.modelkit.eval import WinMLDepthEstimationEvaluator
+from winml.modelkit.eval.config import DatasetConfig, WinMLEvaluationConfig
+
+
+# Representative checkpoints across the three families listed in issue #326.
+_MODEL_IDS = [
+    "depth-anything/Depth-Anything-V2-Small-hf",
+    "Intel/zoedepth-nyu-kitti",
+    "Intel/dpt-hybrid-midas",
+]
+
+
+@pytest.mark.slow
+@pytest.mark.network
+@pytest.mark.integration
+@pytest.mark.parametrize("model_id", _MODEL_IDS)
+def test_depth_estimation_end_to_end(model_id: str) -> None:
+    from transformers import AutoModelForDepthEstimation
+
+    model = AutoModelForDepthEstimation.from_pretrained(model_id)
+
+    config = WinMLEvaluationConfig(
+        model_id=model_id,
+        task="depth-estimation",
+        dataset=DatasetConfig(
+            path="sayakpaul/nyu_depth_v2",
+            split="validation",
+            samples=3,
+            shuffle=False,
+            columns_mapping={
+                "input_column": "image",
+                "depth_column": "depth_map",
+            },
+        ),
+    )
+
+    results = WinMLDepthEstimationEvaluator(config, model).compute()
+
+    assert "abs_rel" in results
+    assert "rmse" in results
+    assert "delta1" in results
+    assert results["num_images"] == 3
+    assert results["abs_rel"] >= 0.0
+    assert results["rmse"] >= 0.0
+    assert 0.0 <= results["delta1"] <= 1.0
diff --git a/tests/unit/datasets/test_image_streaming.py b/tests/unit/datasets/test_image_streaming.py
index 68d58ce7a..4e0315dab 100644
--- a/tests/unit/datasets/test_image_streaming.py
+++ b/tests/unit/datasets/test_image_streaming.py
@@ -85,7 +85,7 @@ def shuffle(self, *a, **kw):
             def select(self, *a, **kw):
                 return self
 
-        def fake_load(name, split, streaming):
+        def fake_load(name, split, streaming, **kwargs):
             captured["streaming"] = streaming
             return _FakeBulk()
 
@@ -108,7 +108,7 @@ def test_streaming_buffer_is_1000_for_class_diversity(
 
         fake = _FakeStreamingDataset()
 
-        def fake_load(name, split, streaming):
+        def fake_load(name, split, streaming, **kwargs):
             assert streaming is True
             return fake
 
diff --git a/tests/unit/eval/test_depth_estimation_evaluator.py b/tests/unit/eval/test_depth_estimation_evaluator.py
new file mode 100644
index 000000000..74966cc2c
--- /dev/null
+++ b/tests/unit/eval/test_depth_estimation_evaluator.py
@@ -0,0 +1,335 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Unit tests for WinMLDepthEstimationEvaluator schema validation,
+column-mapping handling, and pipeline-output extraction."""
+
+import numpy as np
+import pytest
+import torch
+from PIL import Image as PILImage
+
+from winml.modelkit.eval import WinMLDepthEstimationEvaluator
+from winml.modelkit.eval.evaluate import _EVALUATOR_REGISTRY
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+class MockModel:
+    def __init__(self):
+        self.config = type("Cfg", (), {})()
+
+    @property
+    def io_config(self):
+        return {"input_shapes": [[1, 3, 224, 224]]}
+
+
+def make_evaluator(
+    input_col: str = "image",
+    depth_col: str = "depth_map",
+    align: str = "affine",
+    depth_kind: str = "depth",
+    min_depth: float = 1e-3,
+    max_depth=10.0,
+):
+    """Create evaluator without triggering __init__ data loading."""
+    ev = object.__new__(WinMLDepthEstimationEvaluator)
+    ev.model = MockModel()
+    ev._input_col = input_col
+    ev._depth_col = depth_col
+    ev._align = align
+    ev._depth_kind = depth_kind
+    ev._min_depth = float(min_depth)
+    ev._max_depth = None if max_depth is None else float(max_depth)
+    return ev
+
+
+def create_rgb_image(width: int, height: int):
+    return PILImage.new("RGB", (width, height), (128, 128, 128))
+
+
+# ---------------------------------------------------------------------------
+# Registry
+# ---------------------------------------------------------------------------
+
+
+class TestRegistry:
+    def test_depth_estimation_registered(self):
+        assert "depth-estimation" in _EVALUATOR_REGISTRY
+        spec = _EVALUATOR_REGISTRY["depth-estimation"]
+        assert spec == (
+            "winml.modelkit.eval.depth_estimation_evaluator:WinMLDepthEstimationEvaluator"
+        )
+
+    def test_depth_estimation_get_evaluator_class(self):
+        from winml.modelkit.eval import get_evaluator_class
+
+        assert get_evaluator_class("depth-estimation") is WinMLDepthEstimationEvaluator
+
+
+# ---------------------------------------------------------------------------
+# Task schema
+# ---------------------------------------------------------------------------
+
+
+class TestTaskSchema:
+    def test_schema_registered_in_task_schemas(self):
+        from winml.modelkit.utils.eval_utils import TASK_SCHEMAS
+
+        assert "depth-estimation" in TASK_SCHEMAS
+
+    def test_schema_columns_have_image_and_depth(self):
+        from winml.modelkit.utils.eval_utils import TASK_SCHEMAS
+
+        schema = TASK_SCHEMAS["depth-estimation"]
+        col_names = [c.name for c in schema.columns]
+        assert "input_column" in col_names
+        assert "depth_column" in col_names
+
+    def test_schema_column_defaults(self):
+        from winml.modelkit.utils.eval_utils import TASK_SCHEMAS
+
+        cols = {c.name: c for c in TASK_SCHEMAS["depth-estimation"].columns}
+        assert cols["input_column"].default == "image"
+        assert cols["depth_column"].default == "depth_map"
+
+    def test_schema_exposes_align_and_depth_kind_params(self):
+        from winml.modelkit.utils.eval_utils import TASK_SCHEMAS
+
+        params = {p.name: p for p in TASK_SCHEMAS["depth-estimation"].params}
+        assert "align" in params
+        assert "depth_kind" in params
+        assert params["align"].default == "affine"
+        assert params["depth_kind"].default == "depth"
+
+
+# ---------------------------------------------------------------------------
+# Pipeline output extraction
+# ---------------------------------------------------------------------------
+
+
+class TestExtractPredictedDepth:
+    def test_torch_tensor_output(self):
+        out = {"predicted_depth": torch.tensor([[1.0, 2.0], [3.0, 4.0]])}
+        arr = WinMLDepthEstimationEvaluator._extract_predicted_depth(out)
+        assert isinstance(arr, np.ndarray)
+        assert arr.shape == (2, 2)
+        assert arr.dtype == np.float32
+
+    def test_singleton_dim_squeezed(self):
+        out = {"predicted_depth": torch.zeros((1, 1, 4, 5))}
+        arr = WinMLDepthEstimationEvaluator._extract_predicted_depth(out)
+        assert arr.shape == (4, 5)
+
+    def test_numpy_predicted_depth(self):
+        out = {"predicted_depth": np.ones((3, 3), dtype=np.float64)}
+        arr = WinMLDepthEstimationEvaluator._extract_predicted_depth(out)
+        assert arr.shape == (3, 3)
+        assert arr.dtype == np.float32
+
+    def test_pil_depth_only_raises(self):
+        # The pipeline's "depth" key is an 8-bit grayscale visualization,
+        # not the numeric tensor. Don't silently use it as a metric input.
+        depth_img = PILImage.new("L", (4, 3), 0)
+        with pytest.raises(ValueError, match="missing"):
+            WinMLDepthEstimationEvaluator._extract_predicted_depth({"depth": depth_img})
+
+    def test_missing_keys_raise(self):
+        with pytest.raises(ValueError, match="missing"):
+            WinMLDepthEstimationEvaluator._extract_predicted_depth({"foo": 1})
+
+    def test_non_dict_raises(self):
+        with pytest.raises(TypeError, match="dict"):
+            WinMLDepthEstimationEvaluator._extract_predicted_depth([1, 2, 3])
+
+
+# ---------------------------------------------------------------------------
+# prepare_pipeline — image processor alignment to ONNX input shape
+# ---------------------------------------------------------------------------
+
+
+class _FakeImageProcessor:
+    """Stand-in for HF AutoImageProcessor with attribute-based knobs."""
+
+    def __init__(
+        self,
+        size=None,
+        keep_aspect_ratio: bool | None = None,
+        do_pad: bool | None = None,
+    ):
+        self.size = size if size is not None else {"height": 0, "width": 0}
+        if keep_aspect_ratio is not None:
+            self.keep_aspect_ratio = keep_aspect_ratio
+        if do_pad is not None:
+            self.do_pad = do_pad
+
+
+class _FakePreparedPipe:
+    """Stand-in for the parent ``prepare_pipeline()`` return value."""
+
+    def __init__(self, image_processor):
+        self.image_processor = image_processor
+
+
+class TestPreparePipeline:
+    """Verify processor is forced to the static ONNX shape exactly."""
+
+    @staticmethod
+    def _patch_super_pipeline(monkeypatch, processor):
+        """Make ``WinMLEvaluator.prepare_pipeline`` return a pipe with `processor`."""
+        from winml.modelkit.eval import base_evaluator
+
+        monkeypatch.setattr(
+            base_evaluator.WinMLEvaluator,
+            "prepare_pipeline",
+            lambda self: _FakePreparedPipe(processor),
+        )
+
+    def test_sets_size_from_io_config(self, monkeypatch):
+        """ONNX (h, w) is written into ``image_processor.size``."""
+        proc = _FakeImageProcessor(size={"height": 0, "width": 0})
+        self._patch_super_pipeline(monkeypatch, proc)
+
+        ev = make_evaluator()
+        ev.model = type("M", (), {"io_config": {"input_shapes": [[1, 3, 518, 518]]}})()
+
+        pipe = ev.prepare_pipeline()
+        assert pipe.image_processor.size == {"height": 518, "width": 518}
+
+    def test_disables_keep_aspect_ratio_when_present(self, monkeypatch):
+        """``keep_aspect_ratio`` is turned off so the resize hits the exact target."""
+        proc = _FakeImageProcessor(
+            size={"height": 0, "width": 0},
+            keep_aspect_ratio=True,
+        )
+        self._patch_super_pipeline(monkeypatch, proc)
+
+        ev = make_evaluator()
+        ev.model = type("M", (), {"io_config": {"input_shapes": [[1, 3, 518, 518]]}})()
+
+        pipe = ev.prepare_pipeline()
+        assert pipe.image_processor.keep_aspect_ratio is False
+
+    def test_disables_do_pad_when_present(self, monkeypatch):
+        """``do_pad`` is turned off so the processor doesn't pad past target."""
+        proc = _FakeImageProcessor(
+            size={"height": 0, "width": 0},
+            do_pad=True,
+        )
+        self._patch_super_pipeline(monkeypatch, proc)
+
+        ev = make_evaluator()
+        ev.model = type("M", (), {"io_config": {"input_shapes": [[1, 3, 384, 384]]}})()
+
+        pipe = ev.prepare_pipeline()
+        assert pipe.image_processor.do_pad is False
+
+    def test_no_attrs_does_not_raise(self, monkeypatch):
+        """Processors without keep_aspect_ratio / do_pad work unchanged."""
+        proc = _FakeImageProcessor(size={"height": 0, "width": 0})
+        # No keep_aspect_ratio, no do_pad attributes.
+        assert not hasattr(proc, "keep_aspect_ratio")
+        assert not hasattr(proc, "do_pad")
+        self._patch_super_pipeline(monkeypatch, proc)
+
+        ev = make_evaluator()
+        ev.model = type("M", (), {"io_config": {"input_shapes": [[1, 3, 224, 224]]}})()
+
+        pipe = ev.prepare_pipeline()
+        assert pipe.image_processor.size == {"height": 224, "width": 224}
+        assert not hasattr(pipe.image_processor, "keep_aspect_ratio")
+        assert not hasattr(pipe.image_processor, "do_pad")
+
+    def test_missing_io_config_is_noop(self, monkeypatch):
+        """If model lacks io_config, processor is not modified."""
+        proc = _FakeImageProcessor(size={"height": 0, "width": 0}, keep_aspect_ratio=True)
+        self._patch_super_pipeline(monkeypatch, proc)
+
+        ev = make_evaluator()
+        ev.model = type("M", (), {})()  # no io_config
+
+        pipe = ev.prepare_pipeline()
+        # Untouched
+        assert pipe.image_processor.size == {"height": 0, "width": 0}
+        assert pipe.image_processor.keep_aspect_ratio is True
+
+
+# ---------------------------------------------------------------------------
+# compute() integration with mocked pipeline
+# ---------------------------------------------------------------------------
+
+
+class _FakePipe:
+    """Minimal pipe stand-in that returns a perfect prediction."""
+
+    def __init__(self):
+        self.calls = 0
+
+    def __call__(self, image):
+        self.calls += 1
+        # Use the image's size to produce a same-shape prediction.
+        h = getattr(image, "height", 4)
+        w = getattr(image, "width", 4)
+        return {"predicted_depth": torch.full((h, w), 5.0)}
+
+
+class TestCompute:
+    def test_compute_perfect_prediction(self):
+        ev = make_evaluator(align="none", min_depth=0.0, max_depth=None)
+        ev.pipe = _FakePipe()
+
+        img = create_rgb_image(4, 4)
+        depth = np.full((4, 4), 5.0, dtype=np.float32)
+        ev.data = [{"image": img, "depth_map": depth}]
+
+        result = ev.compute()
+        assert result["abs_rel"] == pytest.approx(0.0)
+        assert result["rmse"] == pytest.approx(0.0)
+        assert result["delta1"] == pytest.approx(1.0)
+        assert result["num_images"] == 1
+
+    def test_compute_skips_missing_samples(self):
+        ev = make_evaluator(align="none", min_depth=0.0, max_depth=None)
+        ev.pipe = _FakePipe()
+
+        img = create_rgb_image(4, 4)
+        depth = np.full((4, 4), 5.0, dtype=np.float32)
+        ev.data = [
+            {"image": None, "depth_map": depth},  # skipped
+            {"image": img, "depth_map": None},  # skipped
+            {"image": img, "depth_map": depth},  # valid
+        ]
+        result = ev.compute()
+        assert result["num_images"] == 1
+
+    def test_compute_raises_on_pred_gt_shape_mismatch(self):
+        """Pred/GT shape mismatch propagates as a clear error from DepthMetric."""
+
+        class MismatchPipe:
+            def __call__(self, image):
+                return {"predicted_depth": torch.full((8, 8), 5.0)}
+
+        ev = make_evaluator(align="none", min_depth=0.0, max_depth=None)
+        ev.pipe = MismatchPipe()
+        img = create_rgb_image(4, 4)
+        depth = np.full((4, 4), 5.0, dtype=np.float32)
+        ev.data = [{"image": img, "depth_map": depth}]
+
+        with pytest.raises(ValueError, match="share shape"):
+            ev.compute()
+
+    def test_compute_rejects_tensor_image_input(self):
+        """Tensor-formatted image columns fail with a task-specific message."""
+        ev = make_evaluator(align="none", min_depth=0.0, max_depth=None)
+        ev.pipe = _FakePipe()
+        depth = np.full((4, 4), 5.0, dtype=np.float32)
+        ev.data = [{"image": torch.zeros((3, 4, 4)), "depth_map": depth}]
+
+        with pytest.raises(TypeError, match="must yield PIL images"):
+            ev.compute()
+        assert ev.pipe.calls == 0
diff --git a/tests/unit/eval/test_depth_metric.py b/tests/unit/eval/test_depth_metric.py
new file mode 100644
index 000000000..061bf5850
--- /dev/null
+++ b/tests/unit/eval/test_depth_metric.py
@@ -0,0 +1,346 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Unit tests for DepthMetric (AbsRel, RMSE, delta1)."""
+
+import numpy as np
+import pytest
+import torch
+
+from winml.modelkit.eval import DepthMetric
+
+
+# ---------------------------------------------------------------------------
+# Construction
+# ---------------------------------------------------------------------------
+
+
+class TestConstruction:
+    def test_default_construction(self):
+        m = DepthMetric()
+        # Reset state available for compute after at least one update.
+        with pytest.raises(ValueError, match="no valid pixels"):
+            m.compute()
+
+    def test_invalid_align_raises(self):
+        with pytest.raises(ValueError, match="align must be"):
+            DepthMetric(align="mean")  # type: ignore[arg-type]
+
+    def test_invalid_delta_threshold_raises(self):
+        with pytest.raises(ValueError, match="delta_threshold"):
+            DepthMetric(delta_threshold=1.0)
+
+
+# ---------------------------------------------------------------------------
+# Perfect & known-value cases
+# ---------------------------------------------------------------------------
+
+
+class TestKnownValues:
+    def test_perfect_prediction_align_none(self):
+        gt = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32)
+        pred = gt.copy()
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["abs_rel"] == pytest.approx(0.0)
+        assert result["rmse"] == pytest.approx(0.0)
+        assert result["delta1"] == pytest.approx(1.0)
+        assert result["num_images"] == 1
+        assert result["num_valid_pixels"] == 4
+
+    def test_perfect_prediction_align_median(self):
+        """Scaled-by-constant prediction is perfect after median alignment."""
+        gt = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32)
+        pred = gt * 7.0  # arbitrary scale
+        m = DepthMetric(align="median", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["abs_rel"] == pytest.approx(0.0, abs=1e-6)
+        assert result["rmse"] == pytest.approx(0.0, abs=1e-6)
+        assert result["delta1"] == pytest.approx(1.0)
+
+    def test_known_abs_rel(self):
+        """AbsRel = mean(|pred-gt|/gt) = mean({1, 0.5}) = 0.75."""
+        gt = np.array([[1.0, 2.0]], dtype=np.float32)
+        pred = np.array([[2.0, 3.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["abs_rel"] == pytest.approx(0.75)
+
+    def test_known_rmse(self):
+        """RMSE = sqrt(mean((pred-gt)^2)) = sqrt(mean({1, 1})) = 1."""
+        gt = np.array([[1.0, 2.0]], dtype=np.float32)
+        pred = np.array([[2.0, 3.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["rmse"] == pytest.approx(1.0)
+
+    def test_known_delta1(self):
+        """ratios = {2, 1.5}; both >= 1.25, so delta1 = 0."""
+        gt = np.array([[1.0, 2.0]], dtype=np.float32)
+        pred = np.array([[2.0, 3.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["delta1"] == pytest.approx(0.0)
+
+    def test_delta1_partial_within_threshold(self):
+        """Two pixels: one ratio 1.1 (< 1.25), one 2.0 (>= 1.25). delta1 = 0.5."""
+        gt = np.array([1.0, 1.0], dtype=np.float32).reshape(1, 2)
+        pred = np.array([1.1, 2.0], dtype=np.float32).reshape(1, 2)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["delta1"] == pytest.approx(0.5)
+
+
+# ---------------------------------------------------------------------------
+# Valid mask
+# ---------------------------------------------------------------------------
+
+
+class TestValidMask:
+    def test_zero_gt_pixels_excluded(self):
+        gt = np.array([[0.0, 1.0]], dtype=np.float32)
+        pred = np.array([[5.0, 1.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=1e-3, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        # Only the second pixel counted.
+        assert result["abs_rel"] == pytest.approx(0.0)
+        assert result["num_valid_pixels"] == 1
+
+    def test_nan_inf_excluded(self):
+        gt = np.array([[np.nan, np.inf, 2.0]], dtype=np.float32)
+        pred = np.array([[1.0, 1.0, 2.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["num_valid_pixels"] == 1
+        assert result["abs_rel"] == pytest.approx(0.0)
+
+    def test_max_depth_clip(self):
+        gt = np.array([[5.0, 100.0]], dtype=np.float32)
+        pred = np.array([[5.0, 5.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=10.0)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["num_valid_pixels"] == 1
+
+    def test_max_depth_none_keeps_all(self):
+        gt = np.array([[5.0, 100.0]], dtype=np.float32)
+        pred = gt.copy()
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["num_valid_pixels"] == 2
+
+    def test_negative_predictions_excluded(self):
+        gt = np.array([[1.0, 1.0]], dtype=np.float32)
+        pred = np.array([[-1.0, 1.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["num_valid_pixels"] == 1
+
+    def test_all_invalid_image_counted_no_pixels(self):
+        gt = np.zeros((2, 2), dtype=np.float32)
+        pred = np.ones((2, 2), dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=1e-3, max_depth=None)
+        m.update(pred, gt)
+        with pytest.raises(ValueError, match="no valid pixels"):
+            m.compute()
+
+
+# ---------------------------------------------------------------------------
+# Multi-image accumulation
+# ---------------------------------------------------------------------------
+
+
+class TestAccumulation:
+    def test_multi_image_pixel_weighted(self):
+        gt1 = np.array([[1.0, 1.0]], dtype=np.float32)
+        pred1 = gt1.copy()  # perfect
+        gt2 = np.array([[1.0]], dtype=np.float32)
+        pred2 = np.array([[2.0]], dtype=np.float32)  # error
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred1, gt1)
+        m.update(pred2, gt2)
+        result = m.compute()
+        # AbsRel = (0 + 0 + 1) / 3 = 1/3
+        assert result["abs_rel"] == pytest.approx(1.0 / 3.0)
+        assert result["num_images"] == 2
+        assert result["num_valid_pixels"] == 3
+
+    def test_reset_clears_state(self):
+        gt = np.array([[1.0]], dtype=np.float32)
+        pred = np.array([[2.0]], dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        m.reset()
+        with pytest.raises(ValueError, match="no valid pixels"):
+            m.compute()
+
+
+# ---------------------------------------------------------------------------
+# Input types
+# ---------------------------------------------------------------------------
+
+
+class TestInputTypes:
+    def test_torch_tensor_input(self):
+        gt = np.array([[1.0, 2.0]], dtype=np.float32)
+        pred_t = torch.tensor([[1.0, 2.0]])
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred_t, gt)
+        result = m.compute()
+        assert result["rmse"] == pytest.approx(0.0)
+
+    def test_extra_singleton_dims_squeezed(self):
+        gt = np.ones((1, 2, 2), dtype=np.float32)
+        pred = np.ones((1, 1, 2, 2), dtype=np.float32)
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["num_valid_pixels"] == 4
+
+    def test_shape_mismatch_raises(self):
+        gt = np.ones((2, 2), dtype=np.float32)
+        pred = np.ones((3, 3), dtype=np.float32)
+        m = DepthMetric()
+        with pytest.raises(ValueError, match="shape"):
+            m.update(pred, gt)
+
+
+# ---------------------------------------------------------------------------
+# Median alignment
+# ---------------------------------------------------------------------------
+
+
+class TestMedianAlignment:
+    def test_median_alignment_recovers_perfect(self):
+        rng = np.random.default_rng(0)
+        gt = rng.uniform(1.0, 10.0, size=(8, 8)).astype(np.float32)
+        pred = gt * 0.25  # uniform scale
+        m = DepthMetric(align="median", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["abs_rel"] == pytest.approx(0.0, abs=1e-5)
+
+    def test_align_none_keeps_scale_error(self):
+        gt = np.array([[1.0, 2.0]], dtype=np.float32)
+        pred = gt * 0.5
+        m = DepthMetric(align="none", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["abs_rel"] > 0.4
+
+
+# ---------------------------------------------------------------------------
+# Affine alignment (scale + shift)
+# ---------------------------------------------------------------------------
+
+
+class TestAffineAlignment:
+    def test_affine_recovers_perfect_under_scale(self):
+        rng = np.random.default_rng(1)
+        gt = rng.uniform(1.0, 10.0, size=(8, 8)).astype(np.float32)
+        pred = gt * 0.25  # scale only
+        m = DepthMetric(align="affine", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["abs_rel"] == pytest.approx(0.0, abs=1e-5)
+        assert result["delta1"] == pytest.approx(1.0)
+
+    def test_affine_recovers_perfect_under_scale_and_shift(self):
+        """Affine alignment must recover pred = s * gt + t exactly."""
+        rng = np.random.default_rng(2)
+        gt = rng.uniform(1.0, 10.0, size=(8, 8)).astype(np.float32)
+        pred = gt * 0.3 + 2.5
+        m = DepthMetric(align="affine", min_depth=0.0, max_depth=None)
+        m.update(pred, gt)
+        result = m.compute()
+        assert result["abs_rel"] == pytest.approx(0.0, abs=1e-5)
+        assert result["rmse"] == pytest.approx(0.0, abs=1e-5)
+
+    def test_affine_beats_median_when_shift_present(self):
+        rng = np.random.default_rng(3)
+        gt = rng.uniform(1.0, 10.0, size=(16, 16)).astype(np.float32)
+        pred = gt * 0.5 + 4.0
+        m_aff = DepthMetric(align="affine", min_depth=0.0, max_depth=None)
+        m_med = DepthMetric(align="median", min_depth=0.0, max_depth=None)
+        m_aff.update(pred, gt)
+        m_med.update(pred, gt)
+        assert m_aff.compute()["abs_rel"] < m_med.compute()["abs_rel"]
+
+    def test_affine_is_default(self):
+        m = DepthMetric()
+        assert m._align == "affine"
+
+    def test_invalid_align_lists_all_options(self):
+        with pytest.raises(ValueError, match="align must be one of"):
+            DepthMetric(align="bogus")
+
+
+# ---------------------------------------------------------------------------
+# Disparity prediction
+# ---------------------------------------------------------------------------
+
+
+class TestDisparity:
+    def test_disparity_inverts_then_aligns(self):
+        """pred = k / gt (disparity); after invert + affine should be perfect."""
+        rng = np.random.default_rng(4)
+        gt = rng.uniform(1.0, 10.0, size=(8, 8)).astype(np.float32)
+        disparity = 2.0 / gt  # scale-free disparity
+        m = DepthMetric(
+            align="affine",
+            depth_kind="disparity",
+            min_depth=0.0,
+            max_depth=None,
+        )
+        m.update(disparity, gt)
+        result = m.compute()
+        assert result["abs_rel"] == pytest.approx(0.0, abs=1e-5)
+
+    def test_disparity_with_align_none_is_bad(self):
+        """Sanity: forgetting to invert disparity yields a poor score."""
+        gt = np.array([[1.0, 2.0, 5.0]], dtype=np.float32)
+        disparity = 1.0 / gt
+        m = DepthMetric(
+            align="none",
+            depth_kind="depth",  # WRONG kind on purpose
+            min_depth=0.0,
+            max_depth=None,
+        )
+        m.update(disparity, gt)
+        assert m.compute()["abs_rel"] > 0.5
+
+    def test_disparity_default_is_depth(self):
+        m = DepthMetric()
+        assert m._depth_kind == "depth"
+
+    def test_invalid_depth_kind_raises(self):
+        with pytest.raises(ValueError, match="depth_kind must be one of"):
+            DepthMetric(depth_kind="invalid")
+
+    def test_disparity_zero_pixels_excluded(self):
+        """Zero disparity yields infinite depth — must be excluded as invalid."""
+        gt = np.array([[1.0, 2.0]], dtype=np.float32)
+        disparity = np.array([[0.0, 0.5]], dtype=np.float32)
+        m = DepthMetric(
+            align="none",
+            depth_kind="disparity",
+            min_depth=0.0,
+            max_depth=None,
+        )
+        m.update(disparity, gt)
+        # 1/0.5 = 2.0 matches gt[1] = 2.0; first pixel must be dropped.
+        result = m.compute()
+        assert result["num_valid_pixels"] == 1
+        assert result["abs_rel"] == pytest.approx(0.0)
diff --git a/tests/unit/eval/test_eval.py b/tests/unit/eval/test_eval.py
index c7ca4af6d..568df6dbb 100644
--- a/tests/unit/eval/test_eval.py
+++ b/tests/unit/eval/test_eval.py
@@ -36,6 +36,25 @@ def test_config_roundtrip(self):
         assert restored.dataset.path == config.dataset.path
         assert restored.dataset.columns_mapping == config.dataset.columns_mapping
 
+    def test_config_roundtrip_preserves_revision(self):
+        """DatasetConfig.revision survives to_dict/from_dict roundtrip."""
+        config = WinMLEvaluationConfig(
+            model_id="test/model",
+            task="depth-estimation",
+            dataset=DatasetConfig(
+                path="sayakpaul/nyu_depth_v2",
+                revision="refs/convert/parquet",
+            ),
+        )
+        restored = WinMLEvaluationConfig.from_dict(config.to_dict())
+        assert restored.dataset.revision == "refs/convert/parquet"
+
+    def test_dataset_config_revision_default_is_none(self):
+        """Revision defaults to None when not specified."""
+        ds = DatasetConfig(path="some-dataset")
+        assert ds.revision is None
+        assert "revision" not in ds.to_dict()
+
     def test_eval_result_to_dict(self):
         config = WinMLEvaluationConfig(
             model_id="test/model",
@@ -371,6 +390,78 @@ def test_samples_capped_when_exceeds_dataset_size(
         # config.dataset.samples should NOT be mutated
         assert ev.config.dataset.samples == 100
 
+    @patch("evaluate.evaluator")
+    @patch("transformers.pipeline")
+    @patch("datasets.load_dataset")
+    def test_revision_passed_to_load_dataset(
+        self,
+        mock_load_ds,
+        mock_pipeline,
+        mock_hf_eval,
+    ):
+        """DatasetConfig.revision is forwarded to load_dataset()."""
+        from winml.modelkit.eval import WinMLEvaluator
+
+        mock_ds = MagicMock()
+        mock_ds.__len__ = lambda self: 10
+        mock_ds.shuffle.return_value = mock_ds
+        mock_ds.select.return_value = mock_ds
+        mock_load_ds.return_value = mock_ds
+        mock_pipeline.return_value = MagicMock()
+        mock_hf_eval.return_value = MagicMock(compute=MagicMock(return_value={}))
+
+        model = MagicMock()
+        model.config.label2id = None
+
+        config = WinMLEvaluationConfig(
+            model_id="test/model",
+            task="image-classification",
+            dataset=DatasetConfig(
+                path="some/dataset",
+                samples=5,
+                revision="refs/convert/parquet",
+            ),
+        )
+
+        WinMLEvaluator(config, model)
+
+        mock_load_ds.assert_called_once()
+        assert mock_load_ds.call_args.kwargs["revision"] == "refs/convert/parquet"
+
+    @patch("evaluate.evaluator")
+    @patch("transformers.pipeline")
+    @patch("datasets.load_dataset")
+    def test_revision_defaults_to_none(
+        self,
+        mock_load_ds,
+        mock_pipeline,
+        mock_hf_eval,
+    ):
+        """When revision is unset, load_dataset receives revision=None."""
+        from winml.modelkit.eval import WinMLEvaluator
+
+        mock_ds = MagicMock()
+        mock_ds.__len__ = lambda self: 10
+        mock_ds.shuffle.return_value = mock_ds
+        mock_ds.select.return_value = mock_ds
+        mock_load_ds.return_value = mock_ds
+        mock_pipeline.return_value = MagicMock()
+        mock_hf_eval.return_value = MagicMock(compute=MagicMock(return_value={}))
+
+        model = MagicMock()
+        model.config.label2id = None
+
+        config = WinMLEvaluationConfig(
+            model_id="test/model",
+            task="image-classification",
+            dataset=DatasetConfig(path="some/dataset", samples=5),
+        )
+
+        WinMLEvaluator(config, model)
+
+        mock_load_ds.assert_called_once()
+        assert mock_load_ds.call_args.kwargs["revision"] is None
+
     @patch("evaluate.evaluator")
     @patch("transformers.pipeline")
     @patch("datasets.load_dataset")
diff --git a/tests/unit/models/depth_anything/test_onnx_config.py b/tests/unit/models/depth_anything/test_onnx_config.py
new file mode 100644
index 000000000..e95ddbba7
--- /dev/null
+++ b/tests/unit/models/depth_anything/test_onnx_config.py
@@ -0,0 +1,151 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Tests for Depth-Anything ONNX export config and input generator.
+
+Depth-Anything's ``backbone_config.image_size`` is the DINOv2 backbone
+pretraining resolution (518). When users provide a non-default shape via
+``--shape-config`` (e.g. to align ONNX input shape with a non-square dataset
+after preprocessing), the explicit ``height`` / ``width`` kwargs must take
+precedence. The ``_DepthAnythingVisionInputGenerator`` enforces this priority,
+mirroring the pattern used by ``_SegformerVisionInputGenerator``.
+"""
+
+from __future__ import annotations
+
+import pytest
+from optimum.utils import NormalizedConfig
+from transformers import DepthAnythingConfig
+
+from winml.modelkit.export.io import _get_onnx_config  # Testing internal implementation
+from winml.modelkit.models.hf.depth_anything import (
+    DepthAnythingIOConfig,
+    _DepthAnythingVisionInputGenerator,
+)
+
+
+# =============================================================================
+# Test Constants
+# =============================================================================
+
+BATCH_SIZE = 1
+BACKBONE_IMAGE_SIZE = 518  # DINOv2 default in DepthAnythingConfig
+
+
+# =============================================================================
+# Fixtures
+# =============================================================================
+
+
+@pytest.fixture(scope="module")
+def depth_anything_config():
+    """DepthAnythingConfig with default DINOv2 backbone (image_size=518)."""
+    return DepthAnythingConfig()
+
+
+@pytest.fixture()
+def normalized_config(depth_anything_config):
+    """NormalizedConfig wrapping DepthAnythingConfig.
+
+    Resolves image_size and num_channels from the nested backbone_config
+    via dotted-path access — the same pattern used by DepthAnythingIOConfig.
+    """
+    nc = NormalizedConfig.with_args(
+        image_size="backbone_config.image_size",
+        num_channels="backbone_config.num_channels",
+        allow_new=True,
+    )
+    return nc(depth_anything_config)
+
+
+# =============================================================================
+# _DepthAnythingVisionInputGenerator — Override priority tests
+# =============================================================================
+
+
+class TestDepthAnythingVisionInputGenerator:
+    """Tests for ``_DepthAnythingVisionInputGenerator`` override priority."""
+
+    def test_explicit_height_width_overrides_backbone_image_size(
+        self, normalized_config
+    ):
+        """User-provided height/width override backbone_config.image_size."""
+        gen = _DepthAnythingVisionInputGenerator(
+            task="depth-estimation",
+            normalized_config=normalized_config,
+            height=518,
+            width=686,
+        )
+
+        assert gen.height == 518
+        assert gen.width == 686
+        assert gen.image_size == (518, 686)
+
+    def test_default_kwargs_keep_backbone_image_size(self, normalized_config):
+        """Without explicit height/width, backbone_config.image_size is used."""
+        gen = _DepthAnythingVisionInputGenerator(
+            task="depth-estimation",
+            normalized_config=normalized_config,
+        )
+
+        assert gen.height == BACKBONE_IMAGE_SIZE
+        assert gen.width == BACKBONE_IMAGE_SIZE
+
+    def test_generated_tensor_uses_override_size(self, normalized_config):
+        """``generate()`` produces a tensor with the user-provided shape."""
+        gen = _DepthAnythingVisionInputGenerator(
+            task="depth-estimation",
+            normalized_config=normalized_config,
+            batch_size=BATCH_SIZE,
+            height=518,
+            width=686,
+        )
+
+        tensor = gen.generate("pixel_values", framework="pt")
+        assert tuple(tensor.shape) == (BATCH_SIZE, 3, 518, 686)
+
+
+# =============================================================================
+# DepthAnythingIOConfig — Registration tests
+# =============================================================================
+
+
+class TestDepthAnythingIOConfig:
+    """Tests for DepthAnythingIOConfig ONNX export registration."""
+
+    def test_onnx_config_registered(self, depth_anything_config):
+        """DepthAnythingIOConfig is registered for depth-estimation."""
+        config = _get_onnx_config(
+            depth_anything_config.model_type,
+            "depth-estimation",
+            depth_anything_config,
+        )
+        assert isinstance(config, DepthAnythingIOConfig)
+
+    def test_inputs_contain_pixel_values(self, depth_anything_config):
+        """Inputs spec includes pixel_values with correct dynamic axes."""
+        config = _get_onnx_config(
+            depth_anything_config.model_type,
+            "depth-estimation",
+            depth_anything_config,
+        )
+        assert "pixel_values" in config.inputs
+        assert config.inputs["pixel_values"][0] == "batch_size"
+
+    def test_outputs_contain_predicted_depth(self, depth_anything_config):
+        """Outputs spec includes predicted_depth."""
+        config = _get_onnx_config(
+            depth_anything_config.model_type,
+            "depth-estimation",
+            depth_anything_config,
+        )
+        assert "predicted_depth" in config.outputs
+
+    def test_uses_overridable_generator(self):
+        """The IO config registers the override-aware generator class."""
+        assert (
+            _DepthAnythingVisionInputGenerator
+            in DepthAnythingIOConfig.DUMMY_INPUT_GENERATOR_CLASSES
+        )
diff --git a/tests/unit/models/winml/test_depth_estimation.py b/tests/unit/models/winml/test_depth_estimation.py
new file mode 100644
index 000000000..c64677376
--- /dev/null
+++ b/tests/unit/models/winml/test_depth_estimation.py
@@ -0,0 +1,89 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Unit tests for ``WinMLModelForDepthEstimation``.
+
+Verifies the depth-estimation forward pass wraps raw ONNX outputs in
+a ``DepthEstimatorOutput`` so HF's ``DepthEstimationPipeline`` and
+``image_processor.post_process_depth_estimation`` can use attribute access.
+"""
+
+from __future__ import annotations
+
+import torch
+from transformers.modeling_outputs import DepthEstimatorOutput
+
+from winml.modelkit.models.winml import (
+    TASK_TO_WINML_CLASS,
+    WinMLModelForDepthEstimation,
+    get_winml_class,
+)
+
+
+# ---------------------------------------------------------------------------
+# Registry mapping
+# ---------------------------------------------------------------------------
+
+
+class TestRegistry:
+    def test_task_mapped(self):
+        assert (
+            TASK_TO_WINML_CLASS["depth-estimation"]
+            == "WinMLModelForDepthEstimation"
+        )
+
+    def test_get_winml_class_returns_depth_estimation(self):
+        cls = get_winml_class(model_type="depth_anything", task="depth-estimation")
+        assert cls is WinMLModelForDepthEstimation
+
+
+# ---------------------------------------------------------------------------
+# forward() output wrapping
+# ---------------------------------------------------------------------------
+
+
+def _make_model(onnx_outputs: dict[str, torch.Tensor]) -> WinMLModelForDepthEstimation:
+    """Construct a ``WinMLModelForDepthEstimation`` with ``_run_inference`` stubbed."""
+    model = object.__new__(WinMLModelForDepthEstimation)
+    model._format_inputs = lambda **kw: kw
+    model._run_inference = lambda formatted: onnx_outputs
+    return model
+
+
+class TestForward:
+    def test_returns_depth_estimator_output(self):
+        depth = torch.zeros((1, 518, 518))
+        model = _make_model({"predicted_depth": depth})
+
+        out = model.forward(pixel_values=torch.zeros((1, 3, 518, 518)))
+
+        assert isinstance(out, DepthEstimatorOutput)
+
+    def test_predicted_depth_passthrough(self):
+        depth = torch.full((1, 32, 32), 5.0)
+        model = _make_model({"predicted_depth": depth})
+
+        out = model.forward(pixel_values=torch.zeros((1, 3, 32, 32)))
+
+        assert out.predicted_depth is depth
+
+    def test_attribute_and_dict_access(self):
+        """``ModelOutput`` supports both ``.attr`` and ``["key"]`` access."""
+        depth = torch.zeros((1, 16, 16))
+        model = _make_model({"predicted_depth": depth})
+
+        out = model.forward(pixel_values=torch.zeros((1, 3, 16, 16)))
+
+        assert out.predicted_depth is depth
+        assert out["predicted_depth"] is depth
+
+    def test_falls_back_to_first_output_when_name_differs(self):
+        """Non-standard output names use the first tensor (architecture-agnostic)."""
+        depth = torch.full((1, 8, 8), 2.5)
+        model = _make_model({"some_unconventional_name": depth})
+
+        out = model.forward(pixel_values=torch.zeros((1, 3, 8, 8)))
+
+        assert out.predicted_depth is depth

From 4b476883b7cd8d040654b14e1fd75106ee9d1122 Mon Sep 17 00:00:00 2001
From: Zhipeng Wang <zhiwang@microsoft.com>
Date: Thu, 4 Jun 2026 12:05:03 +0800
Subject: [PATCH 031/143] Enable telemetry in shipped wheel and reword consent
 as "unlinked pseudonymized" (#810)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- **Re-enable telemetry in the shipped wheel.** Restore the
`InstrumentationKey` injection + post-build verify steps in
`modelkit-official-build.yml` (reverting the pipeline portion of #728).
The PyPI **sdist** still ships the empty placeholder; only the **wheel**
carries the injected key, gated on the `INSTRUMENTATION_KEY` build
secret defined in the ADO pipeline.
- **Reword the consent notice** from "anonymous" to "unlinked
pseudonymized" in the first-run prompt (`consent.py`) and
`docs/Privacy.md`. This is the more accurate classification — the
persisted per-machine device-id hash is pseudonymized, not anonymous.
- **Bump `_CONSENT_VERSION` 1 → 2** so already-consented users see and
re-accept the updated notice.

## Notes

- Source `constants.py` keeps the empty iKey placeholder, so dev/source
installs (`pip install -e .`) still never emit.
- Telemetry remains off in CI and non-TTY contexts regardless of stored
consent.
- ⚠️ **Ops dependency:** the injection step reads the
`INSTRUMENTATION_KEY` ADO pipeline secret. #728 removed only the
consuming steps, so the secret should still exist — please confirm it's
present/valid before the next official build, or the build fails closed
with `INSTRUMENTATION_KEY env var is empty or missing`.

Verified: telemetry unit + integration suites pass (185 + 4); pipeline
block confirmed byte-identical to the proven pre-#728 version.
---
 .pipelines/modelkit-official-build.yml  | 76 ++++++++++++++++++++++---
 README.md                               | 18 ++++++
 docs/Privacy.md                         |  4 +-
 src/winml/modelkit/telemetry/consent.py |  4 +-
 4 files changed, 90 insertions(+), 12 deletions(-)

diff --git a/.pipelines/modelkit-official-build.yml b/.pipelines/modelkit-official-build.yml
index 73df877ff..2b7ca7d89 100644
--- a/.pipelines/modelkit-official-build.yml
+++ b/.pipelines/modelkit-official-build.yml
@@ -130,14 +130,74 @@ extends:
               - script: python -m pip install --upgrade build twine packaging
                 displayName: 'Install build tools'
 
-              # Telemetry is disabled in shipped artifacts: the empty iKey
-              # placeholder in src/winml/modelkit/telemetry/constants.py
-              # ships unchanged in both sdist and wheel. The telemetry init
-              # path short-circuits to disabled when iKey is empty (no
-              # events emitted, no LoggerProvider constructed). To re-enable,
-              # restore an iKey-injection step before the wheel build.
-              - script: python -m build --outdir "$(ob_outputDirectory)"
-                displayName: 'Build sdist and wheel'
+              # Build sdist BEFORE iKey injection so the source archive
+              # ships with the empty placeholder. The PyPI sdist must
+              # never carry the real iKey.
+              - script: python -m build --sdist --outdir "$(ob_outputDirectory)"
+                displayName: 'Build sdist (with empty iKey placeholder)'
+
+              - powershell: |
+                  $path = "$(Build.SourcesDirectory)\src\winml\modelkit\telemetry\constants.py"
+                  $key = $env:INSTRUMENTATION_KEY
+                  if (-not $key) { throw "INSTRUMENTATION_KEY env var is empty or missing" }
+                  $content = [System.IO.File]::ReadAllText($path)
+                  $placeholder = 'INSTRUMENTATION_KEY = ""'
+                  if (-not $content.Contains($placeholder)) { throw "placeholder not found in $path" }
+                  $newContent = $content.Replace($placeholder, "INSTRUMENTATION_KEY = ""$key""")
+                  [System.IO.File]::WriteAllText($path, $newContent)
+                  Write-Host "Injected iKey into $path"
+                env:
+                  INSTRUMENTATION_KEY: $(INSTRUMENTATION_KEY)
+                displayName: 'Inject InstrumentationKey into constants.py'
+
+              # Build wheel AFTER injection so the wheel carries the
+              # real iKey.
+              - script: python -m build --wheel --outdir "$(ob_outputDirectory)"
+                displayName: 'Build wheel (with injected iKey)'
+
+              # Verify the wheel carries the real iKey AND the sdist
+              # carries the empty placeholder. Reading constants.py out
+              # of each archive directly catches build-cache / ordering
+              # bugs that disk-only checks would miss.
+              - powershell: |
+                  Add-Type -AssemblyName System.IO.Compression
+                  Add-Type -AssemblyName System.IO.Compression.FileSystem
+
+                  function Read-ConstantsFromArchive($archivePath) {
+                    $zip = [System.IO.Compression.ZipFile]::OpenRead($archivePath)
+                    try {
+                      $entry = $zip.Entries | Where-Object { $_.FullName -like "*telemetry/constants.py" } | Select-Object -First 1
+                      if (-not $entry) { throw "telemetry/constants.py not found in $archivePath" }
+                      $reader = [System.IO.StreamReader]::new($entry.Open())
+                      try { return $reader.ReadToEnd() } finally { $reader.Close() }
+                    } finally { $zip.Dispose() }
+                  }
+
+                  $wheel = Get-ChildItem "$(ob_outputDirectory)\*.whl" | Select-Object -First 1
+                  if (-not $wheel) { throw "no wheel found in $(ob_outputDirectory)" }
+                  $wheelContent = Read-ConstantsFromArchive $wheel.FullName
+                  if ($wheelContent -match 'INSTRUMENTATION_KEY\s*=\s*""') {
+                    throw "wheel contains empty INSTRUMENTATION_KEY placeholder - injection failed"
+                  }
+                  Write-Host "Wheel verified - constants.py has non-empty iKey"
+
+                  $sdist = Get-ChildItem "$(ob_outputDirectory)\*.tar.gz" | Select-Object -First 1
+                  if (-not $sdist) { throw "no sdist found in $(ob_outputDirectory)" }
+                  # tar.gz inside a zip-aware reader: open via SharpCompress would
+                  # add a dependency. Inspect by extracting via tar (Windows 10+
+                  # ships with tar.exe) and reading the file from disk.
+                  $sdistTemp = Join-Path "$(Agent.TempDirectory)" "sdist_verify"
+                  if (Test-Path $sdistTemp) { Remove-Item -Recurse -Force $sdistTemp }
+                  New-Item -ItemType Directory -Path $sdistTemp | Out-Null
+                  tar -xzf $sdist.FullName -C $sdistTemp
+                  $sdistConstants = Get-ChildItem -Path $sdistTemp -Recurse -Filter "constants.py" | Where-Object { $_.FullName -like "*telemetry*" } | Select-Object -First 1
+                  if (-not $sdistConstants) { throw "constants.py not found in sdist" }
+                  $sdistContent = Get-Content $sdistConstants.FullName -Raw
+                  if ($sdistContent -notmatch 'INSTRUMENTATION_KEY\s*=\s*""') {
+                    throw "sdist contains a non-empty iKey - sdist must ship clean"
+                  }
+                  Write-Host "Sdist verified - constants.py has empty placeholder"
+                displayName: 'Verify wheel has iKey, sdist does not'
 
               - script: python -m twine check "$(ob_outputDirectory)\*.whl" "$(ob_outputDirectory)\*.tar.gz"
                 continueOnError: true
diff --git a/README.md b/README.md
index db90d3d99..056013a09 100644
--- a/README.md
+++ b/README.md
@@ -272,6 +272,24 @@ Each arrow is a WinML CLI command. You can enter the pipeline at any stage (for
 
 ---
 
+## :lock: Data / Telemetry
+
+Official WinML CLI releases can collect **unlinked pseudonymized** usage telemetry
+to help improve the product. Telemetry is classified as **Optional**. A one-time
+prompt on your first run asks for consent (default: accept — press **Enter** to
+enable, type `n` to decline).
+
+**Control** — edit `%USERPROFILE%\.winml\config.json`:
+
+- Set `telemetry.consent` to `"disabled"` to opt out
+- Set `telemetry.consent` to `"enabled"` to opt in
+- Delete the file to re-show the first-run prompt on the next run
+
+See [docs/Privacy.md](docs/Privacy.md) for the full list of what is and is not
+collected, event schemas, CI auto-disable behavior, and storage locations.
+
+---
+
 ## :handshake: Contributing
 
 We welcome contributions! Please see the [contribution guidelines](CONTRIBUTING.md).
diff --git a/docs/Privacy.md b/docs/Privacy.md
index 97add4ba1..2dd9094bc 100644
--- a/docs/Privacy.md
+++ b/docs/Privacy.md
@@ -1,7 +1,7 @@
 # WinML CLI Privacy Statement
 
-WinML CLI collects limited, anonymous telemetry to help improve the
-product. This page describes exactly what is collected, what is not,
+WinML CLI collects limited, unlinked pseudonymized telemetry to help
+improve the product. This page describes exactly what is collected, what is not,
 and how to control it.
 
 ## Data category
diff --git a/src/winml/modelkit/telemetry/consent.py b/src/winml/modelkit/telemetry/consent.py
index 359c61b9e..61c6a9343 100644
--- a/src/winml/modelkit/telemetry/consent.py
+++ b/src/winml/modelkit/telemetry/consent.py
@@ -45,10 +45,10 @@ def _default_config_path() -> Path | None:
 # stored records with an older version are treated as unrecorded on
 # read so the user sees the updated notice and re-consents. Records
 # predating the version field are grandfathered as the current version.
-_CONSENT_VERSION: int = 1
+_CONSENT_VERSION: int = 2
 
 _PROMPT_TEXT = """\
-WinML CLI can collect anonymous usage data to help improve the product.
+WinML CLI can collect unlinked pseudonymized usage data to help improve the product.
 
 What is collected:
   - Command name, duration, success/failure

From 881f1dc8d480bbf0f247197fdd0eddd0a2c25439 Mon Sep 17 00:00:00 2001
From: Yue Sun <yuesu@microsoft.com>
Date: Thu, 4 Jun 2026 15:35:02 +0800
Subject: [PATCH 032/143] ci: split scheduled pipelines into weekly Eval Report
 and daily E2E Test (#756)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

Refactor the previous **Modelkit E2E Test** pipeline (which actually
runs the full model registry and produces a markdown report) into two
distinct pipelines with different cadences and scopes.

### Renamed (no behaviour change)

- `Modelkit E2E Test.yml` → `Modelkit Eval Report.yml`
- `templates/e2e-eval-jobs.yml` → `templates/eval-report-jobs.yml`
- Stage `displayName`s: `E2E Eval — {QNN, OV, AMD}` → `Eval Report — …`
- Continues to run **weekly on Friday 08:00 (UTC+8)** against the full
model registry with sharding, `--list-json`, `--continue`,
`--retry-failed`, and report generation.

### New: `Modelkit E2E Test.yml` (daily scheduled)

- **Schedule:** `0 16 * * *` UTC = 00:00 UTC+8 every day, staggered 8 h
away from the weekly Eval Report cron.
- Three parallel stages (QNN / OV / AMD), each running on its dedicated
self-hosted agent.
- Two phases per stage, both gated by queue-time parameters so a one-off
run can be trimmed easily:

1. **`winml perf` phase** — runs `winml perf` once per `(model ×
EP/device pair)` against an inline `models` parameter. Default list
covers one small representative model per supported task (P0 first,
P1/P2 filling the remainder).
2. **pytest e2e phase** — runs a configurable list of
`tests/e2e/test_<name>_e2e.py` suites (default: all 11). Tests use
`require_ep()` to self-skip when the target EP is absent, so the same
list is safe to run on all three agents.

- Each `winml perf` step uses `condition: always()` so every combination
runs and the stage fails on any non-zero exit. No matrix sharding, no
report generation.
- Reuses the eval-report setup helpers (parquet copy, `uv` venv,
`PipAuthenticate`, `pip install -e .[dev]`).

## Why not PR-gating?

E2E runs on self-hosted hardware are too long and too flaky (driver /
firmware variance) to gate every PR. The daily cadence keeps regressions
surfaced within ~24 h without blocking developer throughput. Per-PR
validation continues to rely on the existing unit / integration suites.

## Portal actions (not YAML-controllable)

- Repoint the existing pipeline definition to `Modelkit Eval
Report.yml`.
- Create a new pipeline definition for `Modelkit E2E Test.yml`.
- **Do not** add the new pipeline as a required branch-policy check on
`main` — it is informational only.

## Files

- `.pipelines/Modelkit Eval Report.yml` (renamed)
- `.pipelines/templates/eval-report-jobs.yml` (renamed)
- `.pipelines/Modelkit E2E Test.yml` (new, daily)
- `.pipelines/templates/e2e-test-jobs.yml` (new)
---
 .pipelines/Modelkit E2E Test.yml              | 316 ++++++++++++------
 .pipelines/Modelkit Eval Report.yml           | 102 ++++++
 .pipelines/templates/e2e-test-jobs.yml        | 243 ++++++++++++++
 ...e2e-eval-jobs.yml => eval-report-jobs.yml} |   7 +
 src/winml/modelkit/commands/build.py          |   2 +-
 tests/e2e/test_config_e2e.py                  |  25 +-
 tests/e2e/test_perf_e2e.py                    |   8 +-
 7 files changed, 595 insertions(+), 108 deletions(-)
 create mode 100644 .pipelines/Modelkit Eval Report.yml
 create mode 100644 .pipelines/templates/e2e-test-jobs.yml
 rename .pipelines/templates/{e2e-eval-jobs.yml => eval-report-jobs.yml} (97%)

diff --git a/.pipelines/Modelkit E2E Test.yml b/.pipelines/Modelkit E2E Test.yml
index 0bd2d5744..f794f9ad0 100644
--- a/.pipelines/Modelkit E2E Test.yml	
+++ b/.pipelines/Modelkit E2E Test.yml	
@@ -1,102 +1,214 @@
-trigger: none
-
-schedules:
-  - cron: '0 0 * * Fri'
-    displayName: 'Weekly run — Friday 08:00 (UTC+8)'
-    branches:
-      include:
-        - main
-    always: true
-
-resources:
-  repositories:
-    - repository: ModelKitArtifacts
-      type: github
-      endpoint: github.com_yuesu_microsoft
-      name: gim-home/ModelKitArtifacts
-      ref: main
-
-parameters:
-  - name: evalDate
-    displayName: 'Eval date (auto = today, e.g. 2026-04-01)'
-    type: string
-    default: 'auto'
-  - name: continueRun
-    displayName: 'Skip already-evaluated models (--continue)'
-    type: boolean
-    default: true
-  - name: retryFailed
-    displayName: 'Retry previously failed models (--retry-failed)'
-    type: boolean
-    default: false
-  - name: qnnAgentName
-    displayName: 'QNN: self-hosted agent name (Agent.Name)'
-    type: string
-    default: NPU-QNN
-  - name: qnnPairs
-    displayName: 'QNN: EP/device pairs (any of: qnn_npu, qnn_gpu)'
-    type: object
-    default:
-      - qnn_npu
-      - qnn_gpu
-  - name: ovAgentName
-    displayName: 'OV: self-hosted agent name (Agent.Name)'
-    type: string
-    default: NPU-OV2
-  - name: ovPairs
-    displayName: 'OV: EP/device pairs (any of: dml_gpu, mlas_cpu, ov_cpu, ov_gpu, ov_npu)'
-    type: object
-    default:
-      - dml_gpu
-      - mlas_cpu
-      - ov_cpu
-      - ov_gpu
-      - ov_npu
-  - name: amdAgentName
-    displayName: 'AMD: self-hosted agent name (Agent.Name)'
-    type: string
-    default: NPU-AMD
-  - name: amdPairs
-    displayName: 'AMD: EP/device pairs (any of: vitisai_npu)'
-    type: object
-    default:
-      - vitisai_npu
-
-stages:
-  - stage: NPU_QNN
-    displayName: 'E2E Eval — QNN'
-    jobs:
-      - template: templates/e2e-eval-jobs.yml
-        parameters:
-          agentName: ${{ parameters.qnnAgentName }}
-          agentSuffix: qnn
-          evalDate: ${{ parameters.evalDate }}
-          continueRun: ${{ parameters.continueRun }}
-          retryFailed: ${{ parameters.retryFailed }}
-          pairNames: ${{ parameters.qnnPairs }}
-
-  - stage: NPU_OV
-    displayName: 'E2E Eval — OV'
-    dependsOn: []
-    jobs:
-      - template: templates/e2e-eval-jobs.yml
-        parameters:
-          agentName: ${{ parameters.ovAgentName }}
-          agentSuffix: ov
-          evalDate: ${{ parameters.evalDate }}
-          continueRun: ${{ parameters.continueRun }}
-          retryFailed: ${{ parameters.retryFailed }}
-          pairNames: ${{ parameters.ovPairs }}
-
-  - stage: NPU_AMD
-    displayName: 'E2E Eval — AMD'
-    dependsOn: []
-    jobs:
-      - template: templates/e2e-eval-jobs.yml
-        parameters:
-          agentName: ${{ parameters.amdAgentName }}
-          agentSuffix: amd
-          evalDate: ${{ parameters.evalDate }}
-          continueRun: ${{ parameters.continueRun }}
-          retryFailed: ${{ parameters.retryFailed }}
-          pairNames: ${{ parameters.amdPairs }}
+# Modelkit E2E Test — scheduled daily e2e validation pipeline.
+#
+# Runs two kinds of e2e checks on the three self-hosted agents (QNN, OV,
+# AMD) in parallel stages:
+#
+#   1. winml perf against a small inline list of models (the `models`
+#      parameter, defaults to facebook/convnext-tiny-224), once per
+#      EP/device pair available on each agent. Gated by `runEval`.
+#
+#   2. A configurable list of pytest e2e suites under `tests/e2e/`,
+#      controlled by the `pytestTargets` parameter (default: all 11).
+#      Edit the list at queue time for a minimal run (e.g. keep only
+#      `analyze`). Each entry maps 1:1 to `tests/e2e/test_<name>_e2e.py`.
+#      Tests use `require_ep()` to self-skip when the target EP is not
+#      present on the host, so it is safe to run all suites on all
+#      three agents — irrelevant tests just skip.
+#
+# Trigger model:
+#   * Manual queue-time runs (with selectable parameters) for ad-hoc
+#     validation.
+#   * Daily scheduled run at 00:00 Beijing time (16:00 UTC the prior
+#     day), staggered 8 h away from the weekly Eval Report cron.
+#   * No PR/CI trigger. Failures do not block PR merges.
+
+trigger: none
+
+schedules:
+  - cron: '0 16 * * *'
+    displayName: 'Daily run — 00:00 (UTC+8) every day'
+    branches:
+      include:
+        - main
+    always: true
+
+resources:
+  repositories:
+    - repository: ModelKitArtifacts
+      type: github
+      endpoint: github.com_yuesu_microsoft
+      name: gim-home/ModelKitArtifacts
+      ref: main
+
+parameters:
+  # --------------------------------------------------------------------
+  # winml perf section
+  # --------------------------------------------------------------------
+  - name: runEval
+    displayName: 'Run winml perf eval on inline `models` list'
+    type: boolean
+    default: true
+
+  # Inline model list. Each entry: { hf_id: <str>, task: <str> }.
+  # Curated to cover one small representative model per supported task
+  # (P0 first, P1/P2 supplements to fill missing tasks). Edit at queue
+  # time (delete rows) to trim a one-off run.
+  - name: models
+    displayName: 'Models to test (inline list, used by runEval)'
+    type: object
+    default:
+      # --- P0 ---
+      - hf_id: facebook/convnext-tiny-224
+        task: image-classification
+      - hf_id: openai/clip-vit-base-patch32
+        task: feature-extraction
+      # NOTE: zero-shot-classification ideally would be covered by the
+      # P0 model openai/clip-vit-base-patch32, but CLIP does not yet
+      # pass on every supported EP for this task. cross-encoder/nli-deberta-v3-small
+      # was tried as a temporary substitute but also fails on some EPs, so it is
+      # commented out for now — out of scope for this round. Re-enable once CLIP
+      # (or another candidate) passes on all EPs.
+      # - hf_id: cross-encoder/nli-deberta-v3-small
+      #   task: zero-shot-classification
+      - hf_id: openai/clip-vit-base-patch32
+        task: zero-shot-image-classification
+      # facebook/detr-resnet-50 cannot pass on every supported EP yet — commented
+      # out and treated as out of scope for this round. Re-enable once it passes
+      # on all EPs.
+      # - hf_id: facebook/detr-resnet-50
+      #   task: object-detection
+      - hf_id: google-bert/bert-base-multilingual-cased
+        task: fill-mask
+      - hf_id: google-bert/bert-base-multilingual-cased
+        task: masked-lm
+      # --- P1/P2 to cover remaining supported tasks ---
+      # NOTE: P1/P2 models are commented out — not in scope for this round.
+      # Re-enable individual entries (or the whole block) as coverage expands.
+      # - hf_id: Intel/dpt-hybrid-midas
+      #   task: depth-estimation
+      # - hf_id: facebook/dinov2-small
+      #   task: image-feature-extraction
+      # - hf_id: deepset/roberta-base-squad2
+      #   task: question-answering
+      # - hf_id: sentence-transformers/all-MiniLM-L6-v2
+      #   task: sentence-similarity
+      # - hf_id: cross-encoder/ms-marco-MiniLM-L4-v2
+      #   task: text-classification
+      # - hf_id: dslim/bert-base-NER
+      #   task: token-classification
+
+  - name: modelTimeout
+    displayName: 'Per-model perf timeout (seconds)'
+    type: number
+    default: 1800
+
+  # --------------------------------------------------------------------
+  # pytest e2e section — single editable list of test file short names.
+  # Each entry `<name>` runs `tests/e2e/test_<name>_e2e.py`.
+  # Edit the list at queue time (delete unwanted entries) to do a
+  # minimal run. Setting it to an empty list skips the pytest phase
+  # entirely.
+  # --------------------------------------------------------------------
+  - name: pytestTargets
+    displayName: 'pytest e2e targets (one entry per tests/e2e/test_<name>_e2e.py)'
+    type: object
+    default:
+      - analyze
+      - inspect
+      - build
+      - compile
+      - config
+      - export
+      - optimize
+      - quantize
+      - sys
+      - perf
+      - eval
+
+  - name: pytestTimeout
+    displayName: 'Per-test pytest --timeout (seconds)'
+    type: number
+    default: 1000
+
+  # --------------------------------------------------------------------
+  # Agent / EP-device pair configuration
+  # --------------------------------------------------------------------
+  - name: qnnAgentName
+    displayName: 'QNN: self-hosted agent name (Agent.Name)'
+    type: string
+    default: NPU-QNN
+  - name: qnnPairs
+    displayName: 'QNN: EP/device pairs (any of: qnn_npu, qnn_gpu)'
+    type: object
+    default:
+      - qnn_npu
+      - qnn_gpu
+
+  - name: ovAgentName
+    displayName: 'OV: self-hosted agent name (Agent.Name)'
+    type: string
+    default: NPU-OV2
+  - name: ovPairs
+    displayName: 'OV: EP/device pairs (any of: dml_gpu, mlas_cpu, ov_cpu, ov_gpu, ov_npu)'
+    type: object
+    default:
+      - dml_gpu
+      - mlas_cpu
+      - ov_cpu
+      - ov_gpu
+      - ov_npu
+
+  - name: amdAgentName
+    displayName: 'AMD: self-hosted agent name (Agent.Name)'
+    type: string
+    default: NPU-AMD
+  - name: amdPairs
+    displayName: 'AMD: EP/device pairs (any of: vitisai_npu)'
+    type: object
+    default:
+      - vitisai_npu
+
+stages:
+  - stage: NPU_QNN
+    displayName: 'E2E Test — QNN'
+    jobs:
+      - template: templates/e2e-test-jobs.yml
+        parameters:
+          agentName: ${{ parameters.qnnAgentName }}
+          agentSuffix: qnn
+          pairNames: ${{ parameters.qnnPairs }}
+          models: ${{ parameters.models }}
+          modelTimeout: ${{ parameters.modelTimeout }}
+          runEval: ${{ parameters.runEval }}
+          pytestTargets: ${{ parameters.pytestTargets }}
+          pytestTimeout: ${{ parameters.pytestTimeout }}
+
+  - stage: NPU_OV
+    displayName: 'E2E Test — OV'
+    dependsOn: []
+    jobs:
+      - template: templates/e2e-test-jobs.yml
+        parameters:
+          agentName: ${{ parameters.ovAgentName }}
+          agentSuffix: ov
+          pairNames: ${{ parameters.ovPairs }}
+          models: ${{ parameters.models }}
+          modelTimeout: ${{ parameters.modelTimeout }}
+          runEval: ${{ parameters.runEval }}
+          pytestTargets: ${{ parameters.pytestTargets }}
+          pytestTimeout: ${{ parameters.pytestTimeout }}
+
+  - stage: NPU_AMD
+    displayName: 'E2E Test — AMD'
+    dependsOn: []
+    jobs:
+      - template: templates/e2e-test-jobs.yml
+        parameters:
+          agentName: ${{ parameters.amdAgentName }}
+          agentSuffix: amd
+          pairNames: ${{ parameters.amdPairs }}
+          models: ${{ parameters.models }}
+          modelTimeout: ${{ parameters.modelTimeout }}
+          runEval: ${{ parameters.runEval }}
+          pytestTargets: ${{ parameters.pytestTargets }}
+          pytestTimeout: ${{ parameters.pytestTimeout }}
diff --git a/.pipelines/Modelkit Eval Report.yml b/.pipelines/Modelkit Eval Report.yml
new file mode 100644
index 000000000..1665f2682
--- /dev/null
+++ b/.pipelines/Modelkit Eval Report.yml	
@@ -0,0 +1,102 @@
+trigger: none
+
+schedules:
+  - cron: '0 0 * * Fri'
+    displayName: 'Weekly run — Friday 08:00 (UTC+8)'
+    branches:
+      include:
+        - main
+    always: true
+
+resources:
+  repositories:
+    - repository: ModelKitArtifacts
+      type: github
+      endpoint: github.com_yuesu_microsoft
+      name: gim-home/ModelKitArtifacts
+      ref: main
+
+parameters:
+  - name: evalDate
+    displayName: 'Eval date (auto = today, e.g. 2026-04-01)'
+    type: string
+    default: 'auto'
+  - name: continueRun
+    displayName: 'Skip already-evaluated models (--continue)'
+    type: boolean
+    default: true
+  - name: retryFailed
+    displayName: 'Retry previously failed models (--retry-failed)'
+    type: boolean
+    default: false
+  - name: qnnAgentName
+    displayName: 'QNN: self-hosted agent name (Agent.Name)'
+    type: string
+    default: NPU-QNN
+  - name: qnnPairs
+    displayName: 'QNN: EP/device pairs (any of: qnn_npu, qnn_gpu)'
+    type: object
+    default:
+      - qnn_npu
+      - qnn_gpu
+  - name: ovAgentName
+    displayName: 'OV: self-hosted agent name (Agent.Name)'
+    type: string
+    default: NPU-OV2
+  - name: ovPairs
+    displayName: 'OV: EP/device pairs (any of: dml_gpu, mlas_cpu, ov_cpu, ov_gpu, ov_npu)'
+    type: object
+    default:
+      - dml_gpu
+      - mlas_cpu
+      - ov_cpu
+      - ov_gpu
+      - ov_npu
+  - name: amdAgentName
+    displayName: 'AMD: self-hosted agent name (Agent.Name)'
+    type: string
+    default: NPU-AMD
+  - name: amdPairs
+    displayName: 'AMD: EP/device pairs (any of: vitisai_npu)'
+    type: object
+    default:
+      - vitisai_npu
+
+stages:
+  - stage: NPU_QNN
+    displayName: 'Eval Report — QNN'
+    jobs:
+      - template: templates/eval-report-jobs.yml
+        parameters:
+          agentName: ${{ parameters.qnnAgentName }}
+          agentSuffix: qnn
+          evalDate: ${{ parameters.evalDate }}
+          continueRun: ${{ parameters.continueRun }}
+          retryFailed: ${{ parameters.retryFailed }}
+          pairNames: ${{ parameters.qnnPairs }}
+
+  - stage: NPU_OV
+    displayName: 'Eval Report — OV'
+    dependsOn: []
+    jobs:
+      - template: templates/eval-report-jobs.yml
+        parameters:
+          agentName: ${{ parameters.ovAgentName }}
+          agentSuffix: ov
+          evalDate: ${{ parameters.evalDate }}
+          continueRun: ${{ parameters.continueRun }}
+          retryFailed: ${{ parameters.retryFailed }}
+          pairNames: ${{ parameters.ovPairs }}
+
+  - stage: NPU_AMD
+    displayName: 'Eval Report — AMD'
+    dependsOn: []
+    jobs:
+      - template: templates/eval-report-jobs.yml
+        parameters:
+          agentName: ${{ parameters.amdAgentName }}
+          agentSuffix: amd
+          evalDate: ${{ parameters.evalDate }}
+          continueRun: ${{ parameters.continueRun }}
+          retryFailed: ${{ parameters.retryFailed }}
+          pairNames: ${{ parameters.amdPairs }}
diff --git a/.pipelines/templates/e2e-test-jobs.yml b/.pipelines/templates/e2e-test-jobs.yml
new file mode 100644
index 000000000..4109ba2f2
--- /dev/null
+++ b/.pipelines/templates/e2e-test-jobs.yml
@@ -0,0 +1,243 @@
+# Job template for the Modelkit E2E Test (daily scheduled) pipeline.
+#
+# Two phases run on each agent, after a shared env setup:
+#
+#   1. winml perf loop — gated by `runEval`. One step per (model × pair).
+#   2. pytest e2e suites — one step per entry in `pytestTargets`. Each
+#      writes a junit XML; one PublishTestResults task at the end
+#      aggregates them into the ADO Tests tab.
+#
+# Each test step uses `condition: always()` so later steps run even if
+# an earlier one failed (we want to see every failure in a daily run).
+# Steps still propagate their real exit codes, so the job is marked
+# failed whenever any step fails.
+parameters:
+  - name: agentName
+    type: string
+  - name: agentSuffix
+    type: string
+  # See catalog in `eval-report-jobs.yml`. Valid names:
+  #   qnn_npu, qnn_gpu, vitisai_npu, dml_gpu,
+  #   mlas_cpu, ov_cpu, ov_gpu, ov_npu
+  - name: pairNames
+    type: object
+  # Inline list of models to test. Each entry: { hf_id: <str>, task: <str> }.
+  # `task` is optional — omit (or leave empty) to let winml infer.
+  - name: models
+    type: object
+  - name: modelTimeout
+    type: number
+    default: 1800
+
+  - name: runEval
+    type: boolean
+    default: true
+
+  # List of pytest e2e test short names. Each entry `<name>` runs
+  # `tests/e2e/test_<name>_e2e.py`. Empty list skips the pytest phase.
+  - name: pytestTargets
+    type: object
+    default: []
+
+  - name: pytestTimeout
+    type: number
+    default: 1000
+
+jobs:
+  - job: E2ETest_${{ parameters.agentSuffix }}
+    displayName: 'E2E Test (${{ parameters.agentSuffix }})'
+    pool:
+      name: modelkit-selfhost-pool
+      demands:
+        - Agent.Name -equals ${{ parameters.agentName }}
+    timeoutInMinutes: 360
+    cancelTimeoutInMinutes: 2
+
+    steps:
+      - checkout: self
+        clean: false
+        fetchDepth: 1
+        path: s
+
+      - checkout: ModelKitArtifacts
+        fetchDepth: 1
+        lfs: true
+        path: artifacts
+
+      - powershell: |
+          Write-Host "Agent.BuildDirectory : $(Agent.BuildDirectory)"
+          Write-Host "Build.SourcesDirectory: $(Build.SourcesDirectory)"
+          $repoDir = "$(Agent.BuildDirectory)/artifacts"
+          if (-not (Test-Path "$repoDir/rules")) {
+            $repoDir = "$(Agent.BuildDirectory)/ModelKitArtifacts"
+          }
+          $src = "$repoDir/rules"
+          $dst = "$(Build.SourcesDirectory)/src/winml/modelkit/analyze/rules/runtime_check_rules"
+          if (Test-Path $src) {
+            New-Item -ItemType Directory -Path $dst -Force | Out-Null
+            $parquets = Get-ChildItem "$src" -Filter "*.parquet" -Recurse
+            foreach ($parquet in $parquets) {
+              $relativePath = $parquet.FullName.Substring($src.Length).TrimStart('\\')
+              $targetPath = Join-Path $dst $relativePath
+              New-Item -ItemType Directory -Path (Split-Path $targetPath -Parent) -Force | Out-Null
+              Copy-Item $parquet.FullName -Destination $targetPath -Force
+            }
+            Write-Host "Copied $($parquets.Count) rule parquet file(s) to $dst"
+            $bad = $parquets | Where-Object { $_.Length -lt 1024 }
+            if ($bad) {
+              Write-Error "The following parquet files are suspiciously small (likely unresolved LFS pointers):"
+              $bad | ForEach-Object { Write-Host "  $($_.Name): $($_.Length) bytes" }
+              exit 1
+            }
+          } else {
+            Write-Error "Rules source not found at: $src"
+            Get-ChildItem $repoDir -Recurse -Depth 2 | Select-Object FullName
+            exit 1
+          }
+        displayName: 'Copy runtime check rules from ModelKitArtifacts'
+
+      - powershell: |
+          $uvBin = "$env:USERPROFILE\.local\bin"
+          if (-not (Get-Command uv -ErrorAction SilentlyContinue)) {
+            Invoke-RestMethod https://astral.sh/uv/0.10.12/install.ps1 | Invoke-Expression
+            $env:PATH = "$uvBin;$env:PATH"
+          }
+          uv python install 3.11
+          Remove-Item -Recurse -Force "$(Build.SourcesDirectory)\.venv" -ErrorAction SilentlyContinue
+          uv venv $(Build.SourcesDirectory)\.venv --python 3.11
+          $venvDir = "$(Build.SourcesDirectory)\.venv\Scripts"
+          Write-Host "##vso[task.prependpath]$uvBin"
+          Write-Host "##vso[task.prependpath]$venvDir"
+        displayName: 'Install uv 0.10.12 and Python'
+
+      - script: python --version
+        displayName: 'Check Python version'
+
+      - task: PipAuthenticate@1
+        inputs:
+          artifactFeeds: 'windows.ai.toolkit/Modelkit'
+        displayName: 'Authenticate pip with Azure Artifacts'
+
+      # NOTE: This is the single source of truth for installing project deps
+      # into the venv. Every later step uses `uv run --no-sync` to *skip*
+      # uv's lockfile sync and rely on what was installed here. Do NOT remove
+      # this step assuming `uv run` will install on demand -- it won't, with
+      # `--no-sync` -- and do NOT drop `--no-sync` below assuming it's a
+      # no-op: without it, `uv run` would re-resolve from uv.lock and may
+      # silently replace this editable install.
+      - script: |
+          uv pip install -e .[dev]
+        workingDirectory: $(Build.SourcesDirectory)
+        displayName: 'Install dependencies'
+
+      # ----------------------------------------------------------------------
+      # Phase 1: winml perf (gated by runEval).
+      # One step per (model × pair); template-time expanded for granular UI.
+      # All `uv run` calls below use `--no-sync` -- see install step above.
+      # ----------------------------------------------------------------------
+      - ${{ if eq(parameters.runEval, true) }}:
+          - ${{ each model in parameters.models }}:
+              - ${{ each pairName in parameters.pairNames }}:
+                  - powershell: |
+                      $catalog = @{
+                          'qnn_npu'     = @{ ep = 'qnn';      device = 'npu'; displayName = 'QNN NPU' }
+                          'qnn_gpu'     = @{ ep = 'qnn';      device = 'gpu'; displayName = 'QNN GPU' }
+                          'vitisai_npu' = @{ ep = 'vitisai';  device = 'npu'; displayName = 'VitisAI NPU' }
+                          'dml_gpu'     = @{ ep = 'dml';      device = 'gpu'; displayName = 'DML GPU' }
+                          'mlas_cpu'    = @{ ep = 'cpu';      device = 'cpu'; displayName = 'MLAS CPU' }
+                          'ov_cpu'      = @{ ep = 'openvino'; device = 'cpu'; displayName = 'OV CPU' }
+                          'ov_gpu'      = @{ ep = 'openvino'; device = 'gpu'; displayName = 'OV GPU' }
+                          'ov_npu'      = @{ ep = 'openvino'; device = 'npu'; displayName = 'OV NPU' }
+                      }
+                      $name = '${{ pairName }}'
+                      if (-not $catalog.ContainsKey($name)) {
+                          Write-Error "Unknown EP/device pair name: '$name'. Valid: $($catalog.Keys -join ', ')"
+                          exit 1
+                      }
+                      $pair = $catalog[$name]
+
+                      $hfModel = '${{ model.hf_id }}'
+                      $task    = '${{ model.task }}'
+                      $output  = "$(Agent.TempDirectory)/e2e-test/$name"
+
+                      # Determine if this is the LAST pair for this model so we
+                      # only clean caches once per model -- HF download and WinML
+                      # build cache are reused across EPs for the same model.
+                      # ADO template expressions have no arithmetic so we can't
+                      # compute "last index" at compile time, but `join()` IS
+                      # supported and lets us pass the full pair list as a CSV
+                      # string for runtime PowerShell to inspect. This mirrors
+                      # the pattern in eval-report-jobs.yml (which does the same
+                      # check via a single PowerShell `for` loop over pairs).
+                      $allPairs = @(('${{ join(',', parameters.pairNames) }}' -split ',') |
+                                    Where-Object { $_ -and $_.Trim() -ne '' } |
+                                    ForEach-Object { $_.Trim() })
+                      $isLastPair = ($name -eq $allPairs[-1])
+
+                      Write-Host "============================================================"
+                      Write-Host "Eval: $hfModel  (task='$task')  on  $($pair.displayName) [name=$name, last=$isLastPair]"
+                      Write-Host "============================================================"
+
+                      $uvArgs = @(
+                          "run", "--no-sync", "python", "scripts/e2e_eval/run_eval.py",
+                          "--hf-model", $hfModel,
+                          "--device", $pair.device,
+                          "--ep", $pair.ep,
+                          "--eval-type", "perf",
+                          "--no-report",
+                          "--verbose",
+                          "--timeout", "${{ parameters.modelTimeout }}",
+                          "--output-dir", $output
+                      )
+                      if ($task) {
+                          $uvArgs += @("--task", $task)
+                      }
+                      if ($isLastPair) {
+                          $uvArgs += "--clean-cache"
+                      }
+
+                      & uv @uvArgs
+                      $code = $LASTEXITCODE
+                      if ($code -ne 0) {
+                          Write-Error "Eval FAILED: $hfModel (task='$task') on $($pair.displayName) (exit=$code)"
+                      } else {
+                          Write-Host "Eval PASSED: $hfModel (task='$task') on $($pair.displayName)"
+                      }
+                      exit $code
+                    workingDirectory: $(Build.SourcesDirectory)
+                    # Always run so a failure on one pair does not hide failures on others.
+                    # The job is still marked failed by ADO because of the non-zero exit.
+                    condition: always()
+                    displayName: 'Eval: ${{ model.hf_id }} / ${{ model.task }} on ${{ pairName }}'
+
+      # ----------------------------------------------------------------------
+      # Phase 2: pytest e2e suites. One step per entry in `pytestTargets`.
+      # Each `<name>` runs `tests/e2e/test_<name>_e2e.py` and writes a
+      # per-step junit XML; aggregated by PublishTestResults at the end.
+      # Tests use `require_ep()` to self-skip on irrelevant EPs, so it is
+      # safe to run all of them on every agent.
+      # ----------------------------------------------------------------------
+      - ${{ each target in parameters.pytestTargets }}:
+          - powershell: |
+              uv run --no-sync python -m pytest tests/e2e/test_${{ target }}_e2e.py -m e2e --timeout=${{ parameters.pytestTimeout }} --junitxml="$(Agent.TempDirectory)/junit-${{ parameters.agentSuffix }}-${{ target }}.xml" -v
+              exit $LASTEXITCODE
+            workingDirectory: $(Build.SourcesDirectory)
+            condition: always()
+            displayName: 'pytest: test_${{ target }}_e2e.py'
+
+      # ----------------------------------------------------------------------
+      # Publish all pytest junit XMLs as ADO test results. Runs always so
+      # we see partial results even when an earlier step failed.
+      # failTaskOnFailedTests is false because the pytest step itself
+      # already failed the job on non-zero exit; we don't want a second
+      # failure source from this aggregator.
+      # ----------------------------------------------------------------------
+      - task: PublishTestResults@2
+        condition: always()
+        inputs:
+          testResultsFormat: 'JUnit'
+          testResultsFiles: '$(Agent.TempDirectory)/junit-${{ parameters.agentSuffix }}-*.xml'
+          testRunTitle: 'pytest e2e (${{ parameters.agentSuffix }})'
+          mergeTestResults: true
+          failTaskOnFailedTests: false
+        displayName: 'Publish pytest e2e results'
diff --git a/.pipelines/templates/e2e-eval-jobs.yml b/.pipelines/templates/eval-report-jobs.yml
similarity index 97%
rename from .pipelines/templates/e2e-eval-jobs.yml
rename to .pipelines/templates/eval-report-jobs.yml
index b8ec04bc5..dedb39273 100644
--- a/.pipelines/templates/e2e-eval-jobs.yml
+++ b/.pipelines/templates/eval-report-jobs.yml
@@ -108,6 +108,13 @@ jobs:
           artifactFeeds: 'windows.ai.toolkit/Modelkit'
         displayName: 'Authenticate pip with Azure Artifacts'
 
+      # NOTE: This is the single source of truth for installing project deps
+      # into the venv. Every later step uses `uv run --no-sync` to *skip*
+      # uv's lockfile sync and rely on what was installed here. Do NOT remove
+      # this step assuming `uv run` will install on demand -- it won't, with
+      # `--no-sync` -- and do NOT drop `--no-sync` below assuming it's a
+      # no-op: without it, `uv run` would re-resolve from uv.lock and may
+      # silently replace this editable install.
       - script: |
           uv pip install -e .[dev]
         workingDirectory: $(Build.SourcesDirectory)
diff --git a/src/winml/modelkit/commands/build.py b/src/winml/modelkit/commands/build.py
index bc21b274f..7ba2850c1 100644
--- a/src/winml/modelkit/commands/build.py
+++ b/src/winml/modelkit/commands/build.py
@@ -603,7 +603,7 @@ def build(
             def _patch_device(cfg: WinMLBuildConfig) -> None:
                 from ..config import resolve_quant_compile_config
 
-                resolved_quant, _ = resolve_quant_compile_config(device=device)
+                resolved_quant, _ = resolve_quant_compile_config(device=device, ep=ep)
                 if no_quant or resolved_quant is None:
                     cfg.quant = None
                 elif cfg.quant is None:
diff --git a/tests/e2e/test_config_e2e.py b/tests/e2e/test_config_e2e.py
index 60d10c958..327f7d7da 100644
--- a/tests/e2e/test_config_e2e.py
+++ b/tests/e2e/test_config_e2e.py
@@ -46,19 +46,38 @@
 @pytest.fixture(autouse=True)
 def _mock_resolve_device():
     """Mock hardware detection to avoid failures in CI/test environments."""
+    from winml.modelkit.utils.constants import EP_SUPPORTED_DEVICES, normalize_ep_name
 
     def _resolve_device_mock(
         device: str = "auto", *, ep: str | None = None
     ) -> tuple[str, list[str]]:
         # Keep tests deterministic while preserving explicit device requests.
+        ep_name = normalize_ep_name(ep)
         normalized = (device or "auto").lower()
+        if ep_name in EP_SUPPORTED_DEVICES:
+            supported = list(EP_SUPPORTED_DEVICES[ep_name])
+            # Real resolve_check_device_ep rejects incompatible explicit
+            # combos before this mock runs, so anything other than "auto"
+            # reaching here must already be supported — assert it so a
+            # future change that drops that pre-validation is caught loudly
+            # instead of producing a silently rewritten device.
+            if normalized != "auto":
+                assert normalized in supported, (
+                    f"Incompatible mock combo: ep={ep_name}, device={normalized}. "
+                    f"Supported: {supported}"
+                )
+                return normalized, supported
+            return supported[0], supported
         if normalized in {"cpu", "gpu", "npu"}:
             return normalized, [normalized, "cpu"]
         return "cpu", ["cpu"]
 
-    with patch(
-        "winml.modelkit.sysinfo.resolve_device",
-        side_effect=_resolve_device_mock,
+    # Patch at the definition site so callers using ``from .device import`` —
+    # notably ``resolve_check_device_ep`` inside the same module — see the
+    # mock. Also patch the ``sysinfo`` re-export for direct importers.
+    with (
+        patch("winml.modelkit.sysinfo.device.resolve_device", side_effect=_resolve_device_mock),
+        patch("winml.modelkit.sysinfo.resolve_device", side_effect=_resolve_device_mock),
     ):
         yield
 
diff --git a/tests/e2e/test_perf_e2e.py b/tests/e2e/test_perf_e2e.py
index 48c7da303..de25b3e09 100644
--- a/tests/e2e/test_perf_e2e.py
+++ b/tests/e2e/test_perf_e2e.py
@@ -518,7 +518,9 @@ def test_benchmark_ep_gpu(self, ep: str, tmp_path: Path, model_arg: str):
         assert output_file.exists()
         data = json.loads(output_file.read_text())
         assert data["benchmark_info"]["ep"] == EP_ALIASES[ep]
-        _assert_monitor_result(data, device="gpu")
+        # Not all EPs bump PDH GPU-engine counters (OpenVINO routes via its own
+        # compute path); validate structure only, not utilization magnitude.
+        _assert_monitor_result(data, device="gpu", require_utilization=False)
 
     @pytest.mark.parametrize("ep", NPU_EPS)
     def test_benchmark_ep_npu(self, ep: str, tmp_path: Path, model_arg: str):
@@ -542,7 +544,9 @@ def test_benchmark_ep_npu(self, ep: str, tmp_path: Path, model_arg: str):
         assert output_file.exists()
         data = json.loads(output_file.read_text())
         assert data["benchmark_info"]["ep"] == EP_ALIASES[ep]
-        _assert_monitor_result(data, device="npu")
+        # Not all EPs bump PDH NPU-engine counters reliably for short runs;
+        # validate structure only, not utilization magnitude.
+        _assert_monitor_result(data, device="npu", require_utilization=False)
 
 
 # ===========================================================================

From af1cbe266abafedf2aeb0ca99d5a24cf0da6acc4 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Thu, 4 Jun 2026 15:53:56 +0800
Subject: [PATCH 033/143] feat: show tqdm for _ensure_provider_ready (#788)

Closes https://github.com/microsoft/winml-cli/issues/725

winml sys <

<img width="2082" height="252" alt="image"
src="https://github.com/user-attachments/assets/6520c1e3-3e17-43cb-907c-4696b0fd6256"
/>


Failed

<img width="2078" height="128" alt="image"
src="https://github.com/user-attachments/assets/56de05db-68e2-478a-8326-d1b3bafd605d"
/>

---------

Co-authored-by: hualxie <hualxie@microsoft.com>
---
 src/winml/modelkit/session/ep_registry.py | 208 ++++++++-
 tests/unit/session/test_ep_registry.py    | 514 ++++++++++++++++++++++
 2 files changed, 719 insertions(+), 3 deletions(-)
 create mode 100644 tests/unit/session/test_ep_registry.py

diff --git a/src/winml/modelkit/session/ep_registry.py b/src/winml/modelkit/session/ep_registry.py
index 47810ad59..5426518d1 100644
--- a/src/winml/modelkit/session/ep_registry.py
+++ b/src/winml/modelkit/session/ep_registry.py
@@ -11,7 +11,8 @@
 from __future__ import annotations
 
 import logging
-from typing import TYPE_CHECKING, cast
+import os
+from typing import TYPE_CHECKING, Any, cast
 
 
 if TYPE_CHECKING:
@@ -20,6 +21,197 @@
 
 logger = logging.getLogger(__name__)
 
+
+def _ep_download_timeout_default() -> int:
+    """Read ``WINMLCLI_EP_DOWNLOAD_TIMEOUT`` (seconds) or fall back to 5 minutes.
+
+    Lets users on slow networks raise the cap without code changes. Falls back
+    to the default when the env var is unset, empty, or non-integer.
+    """
+    raw = os.environ.get("WINMLCLI_EP_DOWNLOAD_TIMEOUT")
+    if not raw:
+        return 5 * 60
+    try:
+        return int(raw)
+    except ValueError:
+        logger.warning("Invalid WINMLCLI_EP_DOWNLOAD_TIMEOUT=%r; using default 300s.", raw)
+        return 5 * 60
+
+
+# Evaluated once at module import. Changing WINMLCLI_EP_DOWNLOAD_TIMEOUT
+# after import does NOT take effect for the running process; tests that need
+# a different value should monkeypatch ep_registry.EP_DOWNLOAD_TIMEOUT_SECONDS
+# directly.
+EP_DOWNLOAD_TIMEOUT_SECONDS = _ep_download_timeout_default()
+
+
+class _NoopBar:
+    """No-op stand-in for tqdm when the optional dependency is missing.
+
+    Exposes the attribute (``n``) and methods (``refresh``, ``close``) that
+    ``_ensure_provider_ready`` touches, so the helper can stay branch-free.
+    """
+
+    def __init__(self) -> None:
+        self.n = 0
+
+    def refresh(self) -> None:
+        return None
+
+    def close(self) -> None:
+        return None
+
+
+def _make_progress_bar() -> Any:
+    """Return a tqdm bar if tqdm is installed, else a silent no-op stand-in.
+
+    tqdm is a dev-only optional dep in this package, so production installs
+    without it must still complete EP downloads — they just lose the live bar.
+    The pre-download Console notice is emitted by the caller and is unaffected.
+
+    Format: ``Downloading... ████████████░░░░░░ 62%``
+    """
+    try:
+        from tqdm import tqdm
+    except ImportError:
+        return _NoopBar()
+    return tqdm(
+        total=100,
+        bar_format="Downloading... {bar} {percentage:3.0f}%",
+        ascii="░█",
+        leave=True,
+    )
+
+
+def _parse_ep_metadata_from_path(library_path: str) -> tuple[str, str]:
+    r"""Best-effort ``(version, package_family_name)`` from an EP's install path.
+
+    WinML's ``ExecutionProvider`` handle sometimes returns empty ``version`` /
+    ``package_family_name`` even after the EP is Ready. When the EP is delivered
+    as an MSIX package its ``library_path`` lives under ``WindowsApps`` in a
+    folder named with the full package identity::
+
+        ...\\WindowsApps\\<Name>_<Version>_<Arch>_<ResourceId>_<PublisherId>\\...
+
+    e.g. ``MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8_1.8.79.0_x64__8wekyb3d8bbwe``
+    yields version ``1.8.79.0`` and package family name
+    ``MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8_8wekyb3d8bbwe`` (the
+    package family name is ``<Name>_<PublisherId>``).
+
+    Returns ``("", "")`` when the path is empty or does not match this layout.
+    """
+    import re
+    from itertools import pairwise
+    from pathlib import PurePath
+
+    if not library_path:
+        return "", ""
+
+    parts = PurePath(library_path).parts
+    pkg_folder = next(
+        (child for parent, child in pairwise(parts) if parent.lower() == "windowsapps"),
+        "",
+    )
+    # Full MSIX package name: Name_Version_Arch_ResourceId_PublisherId (ResourceId
+    # is usually empty, giving the doubled "__" before the publisher id).
+    segments = pkg_folder.split("_")
+    if len(segments) < 5:
+        return "", ""
+
+    name, version, publisher = segments[0], segments[1], segments[-1]
+    # Guard against unexpected folder shapes: version must be dotted-numeric.
+    if not re.fullmatch(r"\d+(\.\d+)*", version):
+        version = ""
+    package_family_name = f"{name}_{publisher}" if name and publisher else ""
+    return version, package_family_name
+
+
+def _ensure_provider_ready(provider: Any) -> None:
+    """Ensure an EP is ready, showing a tqdm progress bar when downloading.
+
+    Providers already in the ``Ready`` state take the synchronous fast path so
+    cached EPs do not flash a 0-100% bar. Otherwise drives a tqdm bar from
+    ``ensure_ready_async``'s ``on_progress`` callback (cumulative fraction
+    0.0-1.0, per windowsml docs) and waits for the ``on_complete`` callback
+    via a threading.Event with a ``EP_DOWNLOAD_TIMEOUT_SECONDS`` timeout. On
+    timeout the async op is cancelled and ``TimeoutError`` is raised.
+    """
+    import threading
+
+    from windowsml import EpReadyState
+
+    if provider.ready_state == EpReadyState.Ready:
+        provider.ensure_ready()
+        return
+
+    # Lazy-import to keep ep_registry import cheap (rich pulls in pygments etc.);
+    # this branch only runs on the cold "EP needs download" path.
+    from ..utils.console import get_console
+
+    console = get_console()
+    console.print(f"[WinML] Installing Execution Provider: [bold]{provider.name}[/bold]")
+
+    bar = _make_progress_bar()
+    done = threading.Event()
+
+    def _on_progress(fraction: float) -> None:
+        # Native ops may fire a stale on_progress after on_complete; once done
+        # is set the main thread owns bar.n (forces it to 100 and closes the
+        # bar), so silently drop late callbacks instead of clobbering 100 with
+        # an earlier fraction or writing to a closed bar.
+        if done.is_set():
+            return
+        bar.n = max(0, min(100, int(fraction * 100)))
+        bar.refresh()
+
+    op = None
+    success = False
+    try:
+        op = provider.ensure_ready_async(on_complete=done.set, on_progress=_on_progress)
+        if not done.wait(timeout=EP_DOWNLOAD_TIMEOUT_SECONDS):
+            op.cancel()
+            raise TimeoutError(
+                f"EP {provider.name!r} download did not complete within "
+                f"{EP_DOWNLOAD_TIMEOUT_SECONDS}s; cancelled."
+            )
+        # Surface any native failure (raises OSError on error).
+        op.get_status()
+        # Success: providers usually fire on_progress(1.0) before on_complete,
+        # but force the bar to 100 in case they didn't.
+        bar.n = 100
+        bar.refresh()
+        success = True
+    finally:
+        bar.close()
+        if op is not None:
+            op.close()
+        if not success:
+            # Failure-path notice — kept in finally so it fires for every
+            # non-success exit (launch failure, timeout, get_status OSError).
+            # Printed after bar.close() so it appears below the bar's last frame.
+            console.print(f"[red]❌ Failed to download {provider.name} EP[/red]")
+            console.print("Try:")
+            console.print("  1. Check your internet connection")
+            console.print("  2. Troubleshoot: https://aka.ms/winmlcli/ep-errors")
+
+    console.print(f"{provider.name} EP installed successfully.")
+
+    # The native handle sometimes reports empty version/PFN even once Ready;
+    # fall back to parsing them from the MSIX install path. Skip a line entirely
+    # when its value can't be determined rather than printing a blank field.
+    version = provider.version
+    package_family_name = provider.package_family_name
+    if not version or not package_family_name:
+        parsed_version, parsed_pfn = _parse_ep_metadata_from_path(provider.library_path)
+        version = version or parsed_version
+        package_family_name = package_family_name or parsed_pfn
+    if version:
+        console.print(f"- Version: {version}", soft_wrap=True)
+    if package_family_name:
+        # soft_wrap so long package family names aren't hard-wrapped mid-string.
+        console.print(f"- Package Family Name: {package_family_name}", soft_wrap=True)
+
+
 # Singleton instance
 _winml_ep_registry: WinMLEPRegistry | None = None
 
@@ -80,9 +272,19 @@ def _load_ep_catalog(self) -> None:
         with EpCatalog() as catalog:
             for provider in catalog.find_all_providers():
                 try:
-                    provider.ensure_ready()
+                    _ensure_provider_ready(provider)
+                except OSError as e:
+                    # windowsml maps native HRESULT failures to OSError; surface
+                    # winerror so the HRESULT is grep-able in logs.
+                    logger.info(
+                        "Failed to ensure EP %s is ready: %s (winerror=%s)",
+                        provider.name,
+                        e,
+                        getattr(e, "winerror", None),
+                    )
+                    continue
                 except Exception as e:
-                    logger.debug("Failed to ensure EP %s is ready: %s", provider.name, e)
+                    logger.info("Failed to ensure EP %s is ready: %s", provider.name, e)
                     continue
                 if provider.library_path == "":
                     continue
diff --git a/tests/unit/session/test_ep_registry.py b/tests/unit/session/test_ep_registry.py
new file mode 100644
index 000000000..f7c4b15c8
--- /dev/null
+++ b/tests/unit/session/test_ep_registry.py
@@ -0,0 +1,514 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+"""Tests for ep_registry module helpers."""
+
+from __future__ import annotations
+
+import logging
+import sys
+import types
+from unittest.mock import MagicMock
+
+import pytest
+
+
+def _install_fake_windowsml(monkeypatch: pytest.MonkeyPatch) -> types.SimpleNamespace:
+    """Inject a fake ``windowsml`` module exposing only what the helper needs."""
+
+    class _EpReadyState:
+        Ready = 0
+        NotReady = 1
+        NotPresent = 2
+
+    fake = types.ModuleType("windowsml")
+    fake.EpReadyState = _EpReadyState  # type: ignore[attr-defined]
+    monkeypatch.setitem(sys.modules, "windowsml", fake)
+    return types.SimpleNamespace(EpReadyState=_EpReadyState)
+
+
+def _install_fake_tqdm(monkeypatch: pytest.MonkeyPatch) -> MagicMock:
+    """Inject a fake ``tqdm.tqdm``. Helper writes ``bar.n`` directly + refresh()."""
+
+    fake_bar = MagicMock()
+    fake_bar.n = 0
+
+    tqdm_mod = types.ModuleType("tqdm")
+    tqdm_mod.tqdm = MagicMock(return_value=fake_bar)  # type: ignore[attr-defined]
+    monkeypatch.setitem(sys.modules, "tqdm", tqdm_mod)
+    return fake_bar
+
+
+def test_ensure_provider_ready_skips_progress_when_already_ready(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """Ready providers take the sync fast path and skip the async/progress flow."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    provider = MagicMock()
+    provider.ready_state = ns.EpReadyState.Ready
+
+    ep_registry._ensure_provider_ready(provider)
+
+    provider.ensure_ready.assert_called_once_with()
+    provider.ensure_ready_async.assert_not_called()
+
+
+def test_ensure_provider_ready_drives_progress_bar(monkeypatch: pytest.MonkeyPatch) -> None:
+    """NotReady providers go through ensure_ready_async; on_progress drives a tqdm bar."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    fake_bar = _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        # Simulate cumulative-fraction progress callbacks, then completion.
+        for fraction in (0.0, 0.25, 0.5, 1.0):
+            on_progress(fraction)
+        on_complete()
+        return op
+
+    provider = MagicMock()
+    provider.name = "FakeEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    provider.ensure_ready.assert_not_called()
+    provider.ensure_ready_async.assert_called_once()
+    op.get_status.assert_called_once_with()
+    op.cancel.assert_not_called()
+    op.close.assert_called_once_with()
+    fake_bar.close.assert_called_once_with()
+    # Success path forces bar.n to 100 even though the last fraction was 1.0.
+    assert fake_bar.n == 100
+    fake_bar.refresh.assert_called()
+
+
+def test_ensure_provider_ready_ignores_stale_progress_after_complete(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """A stale on_progress fired after on_complete must NOT clobber bar.n.
+
+    The native layer can fire a late progress callback from its own thread
+    after on_complete; once `done` is set, the main thread owns bar.n and the
+    callback should drop instead of writing an earlier fraction back over the
+    forced 100 (or worse, writing to an already-closed bar)."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    fake_bar = _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+    saved_progress: list = []
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        on_progress(0.5)
+        on_complete()
+        saved_progress.append(on_progress)  # Fire a stale callback after.
+        return op
+
+    provider = MagicMock()
+    provider.name = "FakeEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    # Fire the stale callback as if it arrived after _ensure_provider_ready
+    # already set bar.n = 100 and closed the bar.
+    assert fake_bar.n == 100
+    saved_progress[0](0.62)
+    # Stale callback must have been dropped — bar.n stays at 100.
+    assert fake_bar.n == 100
+
+
+def test_ensure_provider_ready_warns_before_download(
+    monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
+) -> None:
+    """A yellow notice is printed to the stderr Console before download
+    so users know the wait is expected."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        on_complete()
+        return op
+
+    provider = MagicMock()
+    provider.name = "FakeEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    err = capsys.readouterr().err
+    assert "[WinML] Installing Execution Provider" in err
+    assert "FakeEP" in err
+
+
+def test_ensure_provider_ready_forces_bar_to_100_on_success_without_progress(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """If the async op completes successfully without ever firing on_progress,
+    the success path forces the bar to 100 so the final render shows full."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    fake_bar = _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        on_complete()  # No progress firings, but completes immediately.
+        return op
+
+    provider = MagicMock()
+    provider.name = "FakeEP"
+    provider.ready_state = ns.EpReadyState.NotReady
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    assert fake_bar.n == 100
+    fake_bar.close.assert_called_once_with()
+    op.close.assert_called_once_with()
+
+
+def test_ensure_provider_ready_times_out_and_cancels(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """When on_complete never fires within the timeout, cancel and raise TimeoutError."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    fake_bar = _install_fake_tqdm(monkeypatch)
+
+    # Shrink the timeout so the test runs in milliseconds, not minutes.
+    monkeypatch.setattr(ep_registry, "EP_DOWNLOAD_TIMEOUT_SECONDS", 0.05)
+
+    op = MagicMock()
+    provider = MagicMock()
+    provider.name = "SlowEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    # ensure_ready_async returns op but never calls on_complete -> times out.
+    provider.ensure_ready_async.return_value = op
+
+    with pytest.raises(TimeoutError, match="SlowEP"):
+        ep_registry._ensure_provider_ready(provider)
+
+    op.cancel.assert_called_once_with()
+    op.close.assert_called_once_with()
+    fake_bar.close.assert_called_once_with()
+    # Bar must NOT be force-filled on timeout — it should reflect where the
+    # download stalled (here: 0 because no progress callbacks ever fired).
+    assert fake_bar.n == 0
+
+
+def test_ensure_provider_ready_surfaces_get_status_error(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """A failure surfaced by get_status() propagates after cleanup."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    fake_bar = _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+    op.get_status.side_effect = OSError("native error")
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        on_complete()
+        return op
+
+    provider = MagicMock()
+    provider.name = "FakeEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    with pytest.raises(OSError, match="native error"):
+        ep_registry._ensure_provider_ready(provider)
+
+    fake_bar.close.assert_called_once_with()
+    op.close.assert_called_once_with()
+    # Native error must NOT force-fill the bar — it should reflect where
+    # the download failed (here: 0 because no progress callbacks fired).
+    assert fake_bar.n == 0
+
+
+def test_ensure_provider_ready_prints_success_with_metadata(
+    monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
+) -> None:
+    """After a successful install, print '<EP> EP installed successfully.'
+    followed by Version and Package Family Name lines."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        on_complete()
+        return op
+
+    provider = MagicMock()
+    provider.name = "OpenVINOExecutionProvider"
+    provider.version = "1.2.0"
+    provider.package_family_name = "Microsoft.OpenVINOExecutionProvider_8wekyb3d8bbwe"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    err = capsys.readouterr().err
+    assert "OpenVINOExecutionProvider EP installed successfully." in err
+    assert "- Version: 1.2.0" in err
+    assert "- Package Family Name: Microsoft.OpenVINOExecutionProvider_8wekyb3d8bbwe" in err
+
+
+def test_ensure_provider_ready_falls_back_to_path_metadata_when_native_empty(
+    monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
+) -> None:
+    """When the native handle reports empty version/PFN, recover both from the
+    MSIX install path."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        on_complete()
+        return op
+
+    provider = MagicMock()
+    provider.name = "OpenVINOExecutionProvider"
+    provider.version = ""
+    provider.package_family_name = ""
+    provider.library_path = (
+        r"C:\Program Files\WindowsApps"
+        r"\MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8_1.8.79.0_x64__8wekyb3d8bbwe"
+        r"\ExecutionProvider\onnxruntime_providers_openvino_plugin.dll"
+    )
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    err = capsys.readouterr().err
+    assert "- Version: 1.8.79.0" in err
+    assert (
+        "- Package Family Name: MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8_8wekyb3d8bbwe"
+        in err
+    )
+
+
+def test_ensure_provider_ready_skips_metadata_lines_when_unavailable(
+    monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
+) -> None:
+    """If neither the native handle nor the path yields metadata, omit the
+    Version / Package Family Name lines entirely (no blank fields)."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    _install_fake_tqdm(monkeypatch)
+
+    op = MagicMock()
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        on_complete()
+        return op
+
+    provider = MagicMock()
+    provider.name = "MysteryEP"
+    provider.version = ""
+    provider.package_family_name = ""
+    provider.library_path = r"C:\some\local\path\provider.dll"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    err = capsys.readouterr().err
+    assert "MysteryEP EP installed successfully." in err
+    assert "- Version:" not in err
+    assert "- Package Family Name:" not in err
+
+
+def test_ensure_provider_ready_prints_failure_message_on_timeout(
+    monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
+) -> None:
+    """A timed-out download prints the ❌ failure notice with retry hints,
+    and does NOT emit the 'installed successfully.' line."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    _install_fake_tqdm(monkeypatch)
+    monkeypatch.setattr(ep_registry, "EP_DOWNLOAD_TIMEOUT_SECONDS", 0.05)
+
+    provider = MagicMock()
+    provider.name = "SlowEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.return_value = MagicMock()
+
+    with pytest.raises(TimeoutError):
+        ep_registry._ensure_provider_ready(provider)
+
+    err = capsys.readouterr().err
+    assert "installed successfully" not in err
+    assert "Failed to download SlowEP EP" in err
+    assert "Check your internet connection" in err
+    assert "Troubleshoot:" in err
+    assert "https://aka.ms/winmlcli/ep-errors" in err
+
+
+def test_ensure_provider_ready_prints_failure_message_on_async_launch_error(
+    monkeypatch: pytest.MonkeyPatch, capsys: pytest.CaptureFixture[str]
+) -> None:
+    """A failure at the ensure_ready_async() launch also prints the ❌ block."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    _install_fake_tqdm(monkeypatch)
+
+    provider = MagicMock()
+    provider.name = "BadEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = OSError("native launch failed")
+
+    with pytest.raises(OSError, match="native launch failed"):
+        ep_registry._ensure_provider_ready(provider)
+
+    err = capsys.readouterr().err
+    assert "Failed to download BadEP EP" in err
+    assert "installed successfully" not in err
+
+
+def test_ensure_provider_ready_closes_bar_when_ensure_ready_async_raises(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """If ensure_ready_async itself raises, the bar must still be closed
+    (op was never assigned, so op.close() is skipped via the None sentinel)."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    fake_bar = _install_fake_tqdm(monkeypatch)
+
+    provider = MagicMock()
+    provider.name = "FakeEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = RuntimeError("native init failed")
+
+    with pytest.raises(RuntimeError, match="native init failed"):
+        ep_registry._ensure_provider_ready(provider)
+
+    fake_bar.close.assert_called_once_with()
+
+
+def test_ensure_provider_ready_works_without_tqdm(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    """When tqdm (a dev-only optional dep) is missing, the download still
+    completes via the _NoopBar fallback — no ImportError, no progress UI."""
+    from winml.modelkit.session import ep_registry
+
+    ns = _install_fake_windowsml(monkeypatch)
+    # Simulate tqdm being uninstalled: make `from tqdm import tqdm` raise.
+    monkeypatch.setitem(sys.modules, "tqdm", None)
+
+    op = MagicMock()
+
+    def fake_ensure_async(on_complete=None, on_progress=None):
+        # Drive a progress update too — the no-op bar must tolerate bar.n = ...
+        on_progress(0.5)
+        on_complete()
+        return op
+
+    provider = MagicMock()
+    provider.name = "FakeEP"
+    provider.ready_state = ns.EpReadyState.NotPresent
+    provider.ensure_ready_async.side_effect = fake_ensure_async
+
+    ep_registry._ensure_provider_ready(provider)
+
+    op.get_status.assert_called_once_with()
+    op.close.assert_called_once_with()
+
+
+class TestParseEpMetadataFromPath:
+    """`_parse_ep_metadata_from_path` recovers (version, PFN) from install paths."""
+
+    def test_parses_openvino_windowsapps_path(self) -> None:
+        from winml.modelkit.session import ep_registry
+
+        path = (
+            r"C:\Program Files\WindowsApps"
+            r"\MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8_1.8.79.0_x64__8wekyb3d8bbwe"
+            r"\ExecutionProvider\onnxruntime_providers_openvino_plugin.dll"
+        )
+        version, pfn = ep_registry._parse_ep_metadata_from_path(path)
+        assert version == "1.8.79.0"
+        assert pfn == "MicrosoftCorporationII.WinML.Intel.OpenVINO.EP.1.8_8wekyb3d8bbwe"
+
+    def test_empty_path_returns_empty(self) -> None:
+        from winml.modelkit.session import ep_registry
+
+        assert ep_registry._parse_ep_metadata_from_path("") == ("", "")
+
+    def test_non_windowsapps_path_returns_empty(self) -> None:
+        from winml.modelkit.session import ep_registry
+
+        assert ep_registry._parse_ep_metadata_from_path(r"C:\local\ep\provider.dll") == ("", "")
+
+    def test_non_numeric_version_segment_dropped_but_pfn_kept(self) -> None:
+        """A folder that doesn't carry a dotted-numeric version still yields a
+        PFN, but the version is left empty rather than guessed."""
+        from winml.modelkit.session import ep_registry
+
+        path = r"C:\Program Files\WindowsApps\Some.Package_notaversion_x64__pubhash\ep.dll"
+        version, pfn = ep_registry._parse_ep_metadata_from_path(path)
+        assert version == ""
+        assert pfn == "Some.Package_pubhash"
+
+
+class TestEpDownloadTimeoutDefault:
+    """`_ep_download_timeout_default` reads ``WINMLCLI_EP_DOWNLOAD_TIMEOUT``."""
+
+    def test_default_when_unset(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        from winml.modelkit.session import ep_registry
+
+        monkeypatch.delenv("WINMLCLI_EP_DOWNLOAD_TIMEOUT", raising=False)
+        assert ep_registry._ep_download_timeout_default() == 5 * 60
+
+    def test_override_via_env(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        from winml.modelkit.session import ep_registry
+
+        monkeypatch.setenv("WINMLCLI_EP_DOWNLOAD_TIMEOUT", "1800")
+        assert ep_registry._ep_download_timeout_default() == 1800
+
+    def test_falls_back_to_default_on_invalid_value(
+        self, monkeypatch: pytest.MonkeyPatch, caplog: pytest.LogCaptureFixture
+    ) -> None:
+        from winml.modelkit.session import ep_registry
+
+        monkeypatch.setenv("WINMLCLI_EP_DOWNLOAD_TIMEOUT", "not-a-number")
+        with caplog.at_level(logging.WARNING, logger=ep_registry.logger.name):
+            assert ep_registry._ep_download_timeout_default() == 5 * 60
+        assert any("WINMLCLI_EP_DOWNLOAD_TIMEOUT" in r.getMessage() for r in caplog.records)
+
+    def test_empty_string_uses_default(self, monkeypatch: pytest.MonkeyPatch) -> None:
+        from winml.modelkit.session import ep_registry
+
+        monkeypatch.setenv("WINMLCLI_EP_DOWNLOAD_TIMEOUT", "")
+        assert ep_registry._ep_download_timeout_default() == 5 * 60

From be6a9398c6f0fb51f6d2ae36d3b40aa214f3aa1d Mon Sep 17 00:00:00 2001
From: vortex-captain <75063846+vortex-captain@users.noreply.github.com>
Date: Thu, 4 Jun 2026 17:04:06 +0800
Subject: [PATCH 034/143] fix(export): resolve timm image_size from
 pretrained_cfg (#806)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

- Fix `winml config` emitting `[1, 3, 64, 64]` instead of `[1, 3, 224,
224]` for timm image-classification models (e.g.
`timm/repghostnet_200.in1k`).
- Root cause: `TimmWrapperConfig` stores shape info in
`pretrained_cfg["input_size"]` (a plain dict). Optimum's
`NormalizedConfig` only walks `PretrainedConfig` children, so the value
is invisible to `DummyVisionInputGenerator`, which then defaults to
64x64. timm models also have no `preprocessor_config.json` on the hub,
so winml's existing fallback misses too.
- Fix: when `preprocessor_config.json` is unavailable, synthesize a
preprocessor-style dict from `hf_config.pretrained_cfg.input_size`. The
existing size-parsing block (`size` int / dict-with-height-width /
shortest_edge) is unchanged — the timm concern is isolated at the
data-fetch boundary in `_get_preprocessor_dict` /
`_synthesize_preprocessor_dict`.
- Registry: added `timm/mobilenetv3_small_100.lamb_in1k` and
`timm/repghostnet_200.in1k` to `models_all.json`; both PASS perf on CPU.
Perf output confirms `pixel_values [1, 3, 224, 224] float32`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Yi Ren <reny@microsoft.com>
---
 scripts/e2e_eval/testsets/models_all.json |  22 +++
 src/winml/modelkit/export/io.py           | 111 +++++++++++----
 src/winml/modelkit/inspect/resolver.py    |   8 +-
 tests/unit/export/test_io.py              | 165 ++++++++++++++++++++--
 4 files changed, 266 insertions(+), 40 deletions(-)

diff --git a/scripts/e2e_eval/testsets/models_all.json b/scripts/e2e_eval/testsets/models_all.json
index f2cc97f46..d0710b409 100644
--- a/scripts/e2e_eval/testsets/models_all.json
+++ b/scripts/e2e_eval/testsets/models_all.json
@@ -4839,6 +4839,28 @@
     "optimum_supported": false,
     "order": 6
   },
+  {
+    "hf_id": "timm/mobilenetv3_small_100.lamb_in1k",
+    "task": "image-classification",
+    "model_type": "timm_wrapper",
+    "group": "ISV",
+    "priority": "P1",
+    "downloads": 0,
+    "last_update_time": "2024-01-01T00:00:00+00:00",
+    "optimum_supported": true,
+    "order": 1
+  },
+  {
+    "hf_id": "timm/repghostnet_200.in1k",
+    "task": "image-classification",
+    "model_type": "timm_wrapper",
+    "group": "ISV",
+    "priority": "P1",
+    "downloads": 0,
+    "last_update_time": "2024-01-01T00:00:00+00:00",
+    "optimum_supported": true,
+    "order": 2
+  },
   {
     "hf_id": "timpal0l/mdeberta-v3-base-squad2",
     "task": "question-answering",
diff --git a/src/winml/modelkit/export/io.py b/src/winml/modelkit/export/io.py
index bc8011660..a2111a86b 100644
--- a/src/winml/modelkit/export/io.py
+++ b/src/winml/modelkit/export/io.py
@@ -206,16 +206,22 @@ def _get_onnx_config(
 def _populate_image_size_from_preprocessor(
     model_id: str | None,
     shape_kwargs: dict,
+    hf_config: PretrainedConfig | None = None,
 ) -> None:
-    """Populate height/width in shape_kwargs from preprocessor_config.json.
+    """Populate height/width in shape_kwargs from preprocessor metadata.
 
     Optimum's DummyVisionInputGenerator falls back to 64x64 when model config
-    lacks image_size (e.g., ResNet). This reads the correct size from
-    preprocessor_config.json and injects it into shape_kwargs.
+    lacks image_size (e.g., ResNet, timm). This reads the correct size from
+    a preprocessor_config-style dict obtained via :func:`_get_preprocessor_dict`
+    (which consults the hub's ``preprocessor_config.json`` first and, when that
+    is unavailable, synthesizes one from wrapper-config metadata such as
+    ``TimmWrapperConfig.pretrained_cfg``).
 
     Args:
         model_id: HuggingFace model identifier (e.g., "microsoft/resnet-50")
         shape_kwargs: Mutable dict to update with height/width if found
+        hf_config: HuggingFace PretrainedConfig used to synthesize a
+            preprocessor dict when ``preprocessor_config.json`` is missing.
     """
     if not model_id:
         return
@@ -223,32 +229,89 @@ def _populate_image_size_from_preprocessor(
     if "height" in shape_kwargs or "width" in shape_kwargs:
         return
 
+    config = _get_preprocessor_dict(model_id, hf_config)
+    size = config.get("size")
+
+    if isinstance(size, int):
+        shape_kwargs["height"] = size
+        shape_kwargs["width"] = size
+    elif isinstance(size, dict):
+        if "height" in size:
+            shape_kwargs["height"] = size["height"]
+            shape_kwargs["width"] = size["width"]
+        elif "shortest_edge" in size:
+            shape_kwargs["height"] = size["shortest_edge"]
+            shape_kwargs["width"] = size["shortest_edge"]
+
+    if "height" in shape_kwargs:
+        logger.debug(
+            "Loaded image size from preprocessor dict: %dx%d",
+            shape_kwargs["height"],
+            shape_kwargs["width"],
+        )
+
+
+def _get_preprocessor_dict(
+    model_id: str | None,
+    hf_config: PretrainedConfig | None,
+) -> dict:
+    """Return a ``preprocessor_config.json``-style dict for the model.
+
+    Resolution order:
+
+    1. ``preprocessor_config.json`` fetched from the hub (standard HF vision),
+       used only when it carries a ``size`` key.
+    2. Synthesized from a nested plain-dict attribute on ``hf_config``
+       carrying ``input_size`` or ``image_size`` (e.g.
+       ``TimmWrapperConfig.pretrained_cfg``). Reached when the hub file is
+       unavailable *or* present but missing ``size`` (a partial config).
+
+    Returns the dict in the standard preprocessor schema (``{"size": ...}``)
+    so downstream parsing logic does not need to know which source it came
+    from. Returns an empty dict when neither source yields a usable size.
+    """
     try:
         from transformers.image_processing_utils import ImageProcessingMixin
 
         config, _ = ImageProcessingMixin.get_image_processor_dict(model_id)
-        size = config.get("size")
-
-        if isinstance(size, int):
-            shape_kwargs["height"] = size
-            shape_kwargs["width"] = size
-        elif isinstance(size, dict):
-            if "height" in size:
-                shape_kwargs["height"] = size["height"]
-                shape_kwargs["width"] = size["width"]
-            elif "shortest_edge" in size:
-                shape_kwargs["height"] = size["shortest_edge"]
-                shape_kwargs["width"] = size["shortest_edge"]
-
-        if "height" in shape_kwargs:
-            logger.debug(
-                "Loaded image size from preprocessor_config.json: %dx%d",
-                shape_kwargs["height"],
-                shape_kwargs["width"],
-            )
+        if "size" in config:
+            return config
+        # Partial preprocessor_config.json without a "size" key: fall through
+        # to synthesis so we don't silently use Optimum's 64x64 default.
     except (OSError, ValueError, KeyError) as e:
+        # if model_id is None, OSError is raised
         logger.debug("Could not load preprocessor_config.json for %s: %s", model_id, e)
 
+    if hf_config is not None:
+        return _synthesize_preprocessor_dict(hf_config)
+    return {}
+
+
+def _synthesize_preprocessor_dict(hf_config: PretrainedConfig) -> dict:
+    """Build a ``preprocessor_config.json``-style dict from ``hf_config.pretrained_cfg``.
+
+    timm wrapper configs (``TimmWrapperConfig``) stash shape metadata in a
+    ``pretrained_cfg`` dict carrying ``input_size = [C, H, W]``. Optimum's
+    NormalizedConfig only walks ``PretrainedConfig`` children, so this
+    dict-wrapped value is invisible to the dummy-input generator and it
+    falls back to 64x64.
+
+    Preprocessing keys (``mean``/``std``/``interpolation``/``crop_pct``)
+    don't affect export tensor shapes and are intentionally ignored.
+    """
+    pretrained_cfg = getattr(hf_config, "pretrained_cfg", None)
+    if not isinstance(pretrained_cfg, dict):
+        return {}
+
+    input_size = pretrained_cfg.get("input_size")
+    if isinstance(input_size, (list, tuple)):
+        if len(input_size) == 3:
+            return {"size": {"height": input_size[1], "width": input_size[2]}}
+        if len(input_size) == 1:
+            return {"size": input_size[0]}
+
+    return {}
+
 
 # Practical cap for export dummy input sequence length.
 # LLMs have max_position_embeddings of 40K-131K which would OOM during export.
@@ -339,7 +402,7 @@ def generate_dummy_inputs(
     onnx_config.float_dtype = float_dtype
 
     shape_kwargs["batch_size"] = batch_size
-    _populate_image_size_from_preprocessor(model_id, shape_kwargs)
+    _populate_image_size_from_preprocessor(model_id, shape_kwargs, hf_config)
     _populate_sequence_length_from_config(hf_config, shape_kwargs)
 
     logger.debug(
@@ -402,7 +465,7 @@ def resolve_io_specs(
 
     # Populate shapes from model config / preprocessor
     shape_kwargs["batch_size"] = batch_size
-    _populate_image_size_from_preprocessor(model_id, shape_kwargs)
+    _populate_image_size_from_preprocessor(model_id, shape_kwargs, hf_config)
     _populate_sequence_length_from_config(hf_config, shape_kwargs)
 
     # Generate dummy inputs for concrete shapes and dtypes,
diff --git a/src/winml/modelkit/inspect/resolver.py b/src/winml/modelkit/inspect/resolver.py
index 91a791bb9..49b31aa41 100644
--- a/src/winml/modelkit/inspect/resolver.py
+++ b/src/winml/modelkit/inspect/resolver.py
@@ -812,14 +812,16 @@ def get_config_attr(
         if val is not None:
             extra[attr] = val
 
-    # Step 5: Fallback — read image_size from preprocessor_config.json
-    # for models like ResNet where HF config lacks image_size
+    # Step 5: Fallback — read image_size from a preprocessor-style dict
+    # (preprocessor_config.json on the hub, or synthesized from a nested
+    # dict on hf_config such as TimmWrapperConfig.pretrained_cfg) when the
+    # top-level HF config lacks image_size.
     if image_size is None and model_id is not None:
         try:
             from ..export.io import _populate_image_size_from_preprocessor
 
             shape_kwargs: dict = {}
-            _populate_image_size_from_preprocessor(model_id, shape_kwargs)
+            _populate_image_size_from_preprocessor(model_id, shape_kwargs, config)
             if "height" in shape_kwargs:
                 h, w = shape_kwargs["height"], shape_kwargs["width"]
                 image_size = h if h == w else (h, w)
diff --git a/tests/unit/export/test_io.py b/tests/unit/export/test_io.py
index 36f065f49..d030b5e7a 100644
--- a/tests/unit/export/test_io.py
+++ b/tests/unit/export/test_io.py
@@ -19,7 +19,6 @@
 import pytest
 import torch
 from transformers import (
-    AutoConfig,
     CLIPTextConfig,
     CLIPTextModelWithProjection,
     CLIPVisionConfig,
@@ -676,6 +675,122 @@ def test_no_size_key_in_config(self) -> None:
         assert "height" not in shape_kwargs
         assert "width" not in shape_kwargs
 
+    def test_partial_preprocessor_without_size_falls_back_to_synthesis(self) -> None:
+        """A partial preprocessor_config.json (no ``size``) synthesizes from hf_config.
+
+        Without the fall-through, a hub dict carrying only mean/std would leave
+        ``size`` unresolved and Optimum would default to 64x64.
+        """
+        mock_config = {"mean": [0.5, 0.5, 0.5], "std": [0.5, 0.5, 0.5]}  # no "size"
+        hf_config = SimpleNamespace(pretrained_cfg={"input_size": [3, 224, 224]})
+        shape_kwargs: dict = {}
+
+        with patch(
+            "transformers.image_processing_utils.ImageProcessingMixin.get_image_processor_dict",
+            return_value=(mock_config, {}),
+        ):
+            _populate_image_size_from_preprocessor(
+                "timm/some-model",
+                shape_kwargs,
+                hf_config,
+            )
+
+        assert shape_kwargs["height"] == 224
+        assert shape_kwargs["width"] == 224
+
+    def test_nested_dict_input_size_chw(self) -> None:
+        """``pretrained_cfg.input_size = [C, H, W]`` (timm) synthesizes a size dict."""
+        hf_config = SimpleNamespace(
+            pretrained_cfg={"input_size": [3, 224, 224], "mean": [0.485, 0.456, 0.406]},
+        )
+        shape_kwargs: dict = {}
+
+        # No preprocessor_config.json on the hub -> synthesize from hf_config.
+        with patch(
+            "transformers.image_processing_utils.ImageProcessingMixin.get_image_processor_dict",
+            side_effect=OSError("404"),
+        ):
+            _populate_image_size_from_preprocessor(
+                "timm/some-model",
+                shape_kwargs,
+                hf_config,
+            )
+
+        assert shape_kwargs["height"] == 224
+        assert shape_kwargs["width"] == 224
+
+    def test_preprocessor_takes_precedence_over_nested_dict(self) -> None:
+        """When preprocessor_config.json resolves, nested dict is not consulted."""
+        hf_config = SimpleNamespace(pretrained_cfg={"input_size": [3, 320, 320]})
+        shape_kwargs: dict = {}
+
+        with patch(
+            "transformers.image_processing_utils.ImageProcessingMixin.get_image_processor_dict",
+            return_value=({"size": 384}, {}),
+        ):
+            _populate_image_size_from_preprocessor(
+                "some-model/id",
+                shape_kwargs,
+                hf_config,
+            )
+
+        assert shape_kwargs["height"] == 384
+        assert shape_kwargs["width"] == 384
+
+    def test_nested_dict_input_size_scalar(self) -> None:
+        """``pretrained_cfg.input_size = [side]`` (length-1) maps to a square size."""
+        hf_config = SimpleNamespace(pretrained_cfg={"input_size": [320]})
+        shape_kwargs: dict = {}
+
+        with patch(
+            "transformers.image_processing_utils.ImageProcessingMixin.get_image_processor_dict",
+            side_effect=OSError("404"),
+        ):
+            _populate_image_size_from_preprocessor(
+                "some/model",
+                shape_kwargs,
+                hf_config,
+            )
+
+        assert shape_kwargs["height"] == 320
+        assert shape_kwargs["width"] == 320
+
+    def test_pretrained_cfg_without_input_size_ignored(self) -> None:
+        """``pretrained_cfg`` without ``input_size`` (e.g. only mean/std) is skipped."""
+        hf_config = SimpleNamespace(
+            pretrained_cfg={"mean": [0.5, 0.5, 0.5], "std": [0.5, 0.5, 0.5]},
+        )
+        shape_kwargs: dict = {}
+
+        with patch(
+            "transformers.image_processing_utils.ImageProcessingMixin.get_image_processor_dict",
+            side_effect=OSError("404"),
+        ):
+            _populate_image_size_from_preprocessor(
+                "some/model",
+                shape_kwargs,
+                hf_config,
+            )
+
+        assert shape_kwargs == {}
+
+    def test_existing_height_blocks_nested_dict_too(self) -> None:
+        """If height/width already set, nested-dict path must also be skipped."""
+        hf_config = SimpleNamespace(pretrained_cfg={"input_size": [3, 224, 224]})
+        shape_kwargs = {"height": 128}
+
+        with patch(
+            "transformers.image_processing_utils.ImageProcessingMixin.get_image_processor_dict",
+            side_effect=OSError("404"),
+        ):
+            _populate_image_size_from_preprocessor(
+                "some/model",
+                shape_kwargs,
+                hf_config,
+            )
+
+        assert shape_kwargs == {"height": 128}
+
 
 # =============================================================================
 # PastKeyValueInputGenerator — shared KV cache dummy input generation
@@ -699,18 +814,42 @@ def _make_normalized_config(
 
 @pytest.fixture(scope="module")
 def t5_config():
-    """T5-small config with n_positions overridden to 32 for fast tests."""
-    cfg = AutoConfig.from_pretrained("google-t5/t5-small")
-    cfg.n_positions = 32
-    return cfg
+    """Synthetic T5Config — small dims, no network.
+
+    ``n_positions`` maps to ``max_cache_len`` (decoder static buffer size) via
+    the T5 NormalizedConfig, so it fixes the KV cache length at 32.
+    """
+    from transformers import T5Config
+
+    return T5Config(
+        d_model=32,
+        num_layers=2,
+        num_heads=2,
+        d_kv=16,
+        vocab_size=100,
+        n_positions=32,
+    )
 
 
 @pytest.fixture(scope="module")
 def qwen_config():
-    """Qwen3-0.6B config with max_position_embeddings overridden to 256."""
-    cfg = AutoConfig.from_pretrained("Qwen/Qwen3-0.6B")
-    cfg.max_position_embeddings = 256
-    return cfg
+    """Synthetic Qwen3Config — small dims, no network.
+
+    ``max_position_embeddings`` maps to ``max_cache_len`` via the Qwen
+    NormalizedConfig, so it fixes the KV cache length at 256.
+    """
+    from transformers import Qwen3Config
+
+    return Qwen3Config(
+        hidden_size=32,
+        num_hidden_layers=2,
+        num_attention_heads=4,
+        num_key_value_heads=2,
+        head_dim=8,
+        vocab_size=100,
+        intermediate_size=64,
+        max_position_embeddings=256,
+    )
 
 
 class TestPastKeyValueInputGenerator:
@@ -776,7 +915,7 @@ class TestT5DecoderKVInputs:
 
     def test_kv_input_names(self, t5_config) -> None:
         inputs = generate_dummy_inputs("t5", "text2text-generation", t5_config)
-        num_layers = t5_config.num_layers  # 6
+        num_layers = t5_config.num_layers  # 2 (synthetic)
         for i in range(num_layers):
             assert f"past_{i}_key" in inputs
             assert f"past_{i}_value" in inputs
@@ -784,7 +923,7 @@ def test_kv_input_names(self, t5_config) -> None:
     def test_kv_shape(self, t5_config) -> None:
         inputs = generate_dummy_inputs("t5", "text2text-generation", t5_config)
         kv = inputs["past_0_key"]
-        # [batch=1, heads=8, max_cache_len=32, d_kv=64]
+        # [batch=1, heads=num_heads, max_cache_len=32 (n_positions), d_kv]
         assert kv.shape == (1, t5_config.num_heads, 32, t5_config.d_kv)
 
     def test_decoder_attention_mask_matches_cache_len(self, t5_config) -> None:
@@ -802,7 +941,7 @@ class TestQwenPrefillKVInputs:
 
     def test_kv_input_names(self, qwen_config) -> None:
         inputs = generate_dummy_inputs("qwen3", "feature-extraction", qwen_config)
-        num_layers = qwen_config.num_hidden_layers  # 28
+        num_layers = qwen_config.num_hidden_layers  # 2 (synthetic)
         for i in range(num_layers):
             assert f"past_{i}_key" in inputs
             assert f"past_{i}_value" in inputs
@@ -810,7 +949,7 @@ def test_kv_input_names(self, qwen_config) -> None:
     def test_kv_shape(self, qwen_config) -> None:
         inputs = generate_dummy_inputs("qwen3", "feature-extraction", qwen_config)
         kv = inputs["past_0_key"]
-        # [batch=1, kv_heads=8, max_cache_len=256, head_dim=128]
+        # [batch=1, kv_heads, max_cache_len=256 (max_position_embeddings), head_dim]
         assert kv.shape == (1, qwen_config.num_key_value_heads, 256, qwen_config.head_dim)
 
     def test_attention_mask_matches_cache_len(self, qwen_config) -> None:

From cf96c1461f8f2bef3cb641c2baf24e6dc8b6b021 Mon Sep 17 00:00:00 2001
From: xieofxie <xieofxie@126.com>
Date: Thu, 4 Jun 2026 17:25:25 +0800
Subject: [PATCH 035/143] feat: update color for --list-device (#812)

before

<img width="512" height="125" alt="image"
src="https://github.com/user-attachments/assets/dd37f99c-72d9-43fd-aa53-99dbd3f0dca0"
/>

after

<img width="529" height="135" alt="image"
src="https://github.com/user-attachments/assets/e72b80ba-77dd-448a-a71c-5d6f01910750"
/>
---
 src/winml/modelkit/commands/sys.py | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/winml/modelkit/commands/sys.py b/src/winml/modelkit/commands/sys.py
index 312397008..e68e2d9b2 100644
--- a/src/winml/modelkit/commands/sys.py
+++ b/src/winml/modelkit/commands/sys.py
@@ -566,16 +566,24 @@ def _output_device_text(devices: list[dict[str, Any]]) -> None:
     console = _get_console()
     console.print("\n[bold blue]Available Devices (priority order)[/bold blue]")
     for dev in devices:
+        # highlight=False keeps Rich's auto-highlighter (which would otherwise
+        # colorize numbers, IDs, paths, etc.) off the device name. Markup styles
+        # ([bold], [cyan]) on the prefix still apply.
         console.print(
-            f"  [bold]#{dev['priority']}[/bold]  [cyan]{dev['type']:5s}[/cyan] {dev['name']}"
+            f"  [bold]#{dev['priority']}[/bold]  [cyan]{dev['type']:5s}[/cyan] {dev['name']}",
+            highlight=False,
         )
         details = dev.get("details", {})
         if "error" in details:
             console.print(f"             [red]Error: {details['error']}[/red]")
         elif dev["type"] in ("NPU", "GPU"):
+            # Explicit [green] markup with highlight=False — Rich's auto-highlighter
+            # mis-styles driver version strings, so opt out of it here and apply
+            # the green via markup we control.
             console.print(
-                f"             Driver: {details.get('driver', 'N/A')} | "
-                f"Manufacturer: {details.get('manufacturer', 'N/A')}"
+                f"             Driver: [bold bright_green]{details.get('driver', 'N/A')}[/] | "
+                f"Manufacturer: {details.get('manufacturer', 'N/A')}",
+                highlight=False,
             )
         elif dev["type"] == "CPU":
             console.print(

From 244f7b80c0764ed589385788d0d0cb1d5093e7a6 Mon Sep 17 00:00:00 2001
From: "Qiong Wu (qiowu)" <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 17:31:51 +0800
Subject: [PATCH 036/143] update eval scripts: add ONNX size tracking and
 output sanitization (#755)

## Summary
- Add `_compute_onnx_size()` to measure combined ONNX + `.data` file
sizes and include `onnx_size_bytes` in eval results
- Add `_sanitize_output()` to strip CLI chrome (Rich tables, device/IO
banners) from `eval_result.json`, keeping only error-relevant content
- Minor formatting fixes in `reporter.py`
---
 scripts/e2e_eval/run_eval.py       | 109 ++++++++++++++++++++++++++++-
 scripts/e2e_eval/utils/reporter.py |  29 ++++++--
 tests/unit/eval/test_eval.py       |  43 +++++++++++-
 3 files changed, 171 insertions(+), 10 deletions(-)

diff --git a/scripts/e2e_eval/run_eval.py b/scripts/e2e_eval/run_eval.py
index d47b6d3b1..365357c01 100644
--- a/scripts/e2e_eval/run_eval.py
+++ b/scripts/e2e_eval/run_eval.py
@@ -31,8 +31,10 @@
 import argparse
 import contextlib
 import json
+import logging
 import os
 import platform
+import re
 import shutil
 import subprocess
 import sys
@@ -69,6 +71,8 @@
 )
 
 
+logger = logging.getLogger(__name__)
+
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
@@ -233,6 +237,82 @@ def _utc_now() -> str:
     return datetime.now(timezone.utc).isoformat()
 
 
+def _compute_onnx_size(onnx_paths: dict[str, str]) -> int | None:
+    """Return combined size in bytes of all ONNX files + their external data companions.
+
+    Parses the ONNX proto to discover all referenced external data files (not just
+    the conventional `.data` suffix). Falls back to the `.data` companion heuristic
+    if proto parsing is unavailable.
+
+    Returns None if onnx_paths is empty or no files exist on disk.
+    """
+    if not onnx_paths:
+        return None
+    total = 0
+    found_any = False
+    for path_str in onnx_paths.values():
+        p = Path(path_str)
+        if not p.exists():
+            continue
+        total += p.stat().st_size
+        found_any = True
+        # Try to enumerate all external data files from the proto
+        try:
+            from winml.modelkit.onnx.external_data import get_external_data_files
+
+            ext_files = get_external_data_files(p)
+            for ext_name in ext_files:
+                ext_path = p.parent / ext_name
+                if ext_path.exists():
+                    total += ext_path.stat().st_size
+        except Exception:
+            # Fallback: check conventional .data companion
+            data_p = p.with_suffix(p.suffix + ".data")
+            if data_p.exists():
+                total += data_p.stat().st_size
+    return total if found_any else None
+
+
+# Lines that carry no diagnostic value in eval_result.json.
+# Matching is case-insensitive, anchored at line start.
+_NOISE_PATTERNS = (
+    "benchmarking onnx",
+    "device:",
+    "task:",
+    "latency (ms)",
+    "throughput:",
+    "results saved to",
+    "inputs:",
+    "outputs:",
+    "samples/sec",
+)
+_NOISE_RE = re.compile("|".join(re.escape(p) for p in _NOISE_PATTERNS), re.IGNORECASE)
+
+# Box-drawing characters used by Rich tables.
+_BOX_CHARS = frozenset("─│┌┐└┘├┤┬┴┼")
+
+
+def _sanitize_output(text: str) -> str:
+    """Strip routine CLI chrome from subprocess output, keeping error content.
+
+    Removes Rich benchmark tables, device/IO banners, and path lines that
+    bloat eval_result.json without aiding failure diagnosis. All classifier
+    patterns (see classifier.py) are error-related and survive this filter.
+    """
+    kept: list[str] = []
+    for line in text.splitlines():
+        stripped = line.strip()
+        if not stripped:
+            continue
+        # Drop box-drawing table rows
+        if stripped[0] in _BOX_CHARS:
+            continue
+        if _NOISE_RE.match(stripped):
+            continue
+        kept.append(stripped)
+    return "\n".join(kept)
+
+
 def _kill_process_tree(pid: int) -> None:
     """Kill a process and all its children.
 
@@ -1028,6 +1108,20 @@ def save_environment_info(path: Path) -> None:
     except (subprocess.TimeoutExpired, FileNotFoundError):
         pass  # git not available or timed out; commit info stays empty
 
+    # `winml sys --format json` captures hardware details (devices, EPs,
+    # backends) that the lightweight package-version probes above miss.
+    try:
+        result = subprocess.run(  # noqa: S603
+            [sys.executable, "-m", "winml", "sys", "--format", "json"],
+            capture_output=True,
+            text=True,
+            timeout=30,
+        )
+        if result.returncode == 0:
+            info["winml_sys"] = json.loads(result.stdout)
+    except (subprocess.TimeoutExpired, FileNotFoundError, json.JSONDecodeError) as exc:
+        logger.debug("winml sys skipped: %s", exc)
+
     path.parent.mkdir(parents=True, exist_ok=True)
     path.write_text(json.dumps(info, indent=2), encoding="utf-8")
 
@@ -1139,6 +1233,11 @@ def parse_args() -> argparse.Namespace:
         help="Skip report generation (useful when running per-model in a pipeline loop)",
     )
     parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
+    parser.add_argument(
+        "--raw-output",
+        action="store_true",
+        help="Keep raw subprocess output in eval_result.json without sanitization",
+    )
     parser.add_argument(
         "--continue",
         dest="continue_run",
@@ -1399,6 +1498,7 @@ def main() -> None:
                 ep=args.ep,
             )
             onnx_paths = build_result["onnx_paths"] if build_result["success"] else {}
+            onnx_size = _compute_onnx_size(onnx_paths)
 
             if not build_result["success"]:
                 # Build failed — synthesize failed result for downstream phases
@@ -1443,7 +1543,14 @@ def main() -> None:
             break
 
         result = build_eval_result(
-            entry, perf_proc, args.device, eval_types_run, accuracy_result, ep=args.ep
+            entry,
+            perf_proc,
+            args.device,
+            eval_types_run,
+            accuracy_result,
+            ep=args.ep,
+            onnx_size_bytes=onnx_size,
+            sanitize_fn=None if args.raw_output else _sanitize_output,
         )
         results.append(result)
 
diff --git a/scripts/e2e_eval/utils/reporter.py b/scripts/e2e_eval/utils/reporter.py
index a97fef69b..5da4f75ad 100644
--- a/scripts/e2e_eval/utils/reporter.py
+++ b/scripts/e2e_eval/utils/reporter.py
@@ -23,6 +23,8 @@
 
 
 if TYPE_CHECKING:
+    from collections.abc import Callable
+
     from .registry import ModelEntry
 
 
@@ -38,6 +40,8 @@ def build_eval_result(
     eval_types_run: list[str],
     accuracy_result: dict | None = None,
     ep: str | None = None,
+    onnx_size_bytes: int | None = None,
+    sanitize_fn: Callable[[str], str] | None = None,
 ) -> dict:
     """Build a unified eval_result dict (facts only, no derived fields).
 
@@ -46,16 +50,28 @@ def build_eval_result(
     accuracy_result is the accuracy sub-section dict (or None if not run).
     ep is the explicit execution provider (e.g., "qnn", "dml"), or None when
     not specified (device-to-provider mapping was used).
+    onnx_size_bytes is the combined size of the exported ONNX + .data files.
+    sanitize_fn, when provided, is applied to stdout/stderr to remove noise.
     """
     perf_section: dict | None = None
     if perf_proc is not None:
         passed = perf_proc["exit_code"] == 0
+        raw_stdout = perf_proc["stdout"]
+        raw_stderr = perf_proc["stderr"]
+        if sanitize_fn is not None:
+            stdout = sanitize_fn(raw_stdout)
+            stderr = sanitize_fn(raw_stderr)
+        else:
+            stdout = raw_stdout
+            stderr = raw_stderr
         perf_section = {
             "passed": passed,
             "elapsed": perf_proc["elapsed"],
             "exit_code": perf_proc["exit_code"],
-            "stdout_output": perf_proc["stdout"],
-            "stderr_output": perf_proc["stderr"],
+            "stdout_output": stdout,
+            "stderr_output": stderr,
+            "raw_stdout": raw_stdout,
+            "raw_stderr": raw_stderr,
             "timeout": perf_proc["timeout"],
             "command": perf_proc["command"],
             "error": perf_proc.get("error_summary", ""),
@@ -76,6 +92,8 @@ def build_eval_result(
         "accuracy": accuracy_result,
     }
     # Optional fields: only include when explicitly provided by the user.
+    if onnx_size_bytes is not None:
+        result["onnx_size_bytes"] = onnx_size_bytes
     if ep is not None:
         result["ep"] = ep
     return result
@@ -323,8 +341,9 @@ def generate_html_report(
     output_path: Path,
     registry_path: Path | None = None,
 ) -> None:
-    from .accuracy import format_delta
     """Generate interactive HTML report with Perf and Accuracy tabs."""
+    from .accuracy import format_delta
+
     results = report_data.get("results", [])
 
     # Load registry for enrichment
@@ -366,9 +385,7 @@ def generate_html_report(
                     if acc is not None
                     else None
                 ),
-                "delta_display": (
-                    format_delta(acc) if acc and not acc.get("skipped") else ""
-                ),
+                "delta_display": (format_delta(acc) if acc and not acc.get("skipped") else ""),
                 "metric": (
                     {
                         "name": (acc.get("winml_metric") or {}).get("metric"),
diff --git a/tests/unit/eval/test_eval.py b/tests/unit/eval/test_eval.py
index 568df6dbb..86b5426ff 100644
--- a/tests/unit/eval/test_eval.py
+++ b/tests/unit/eval/test_eval.py
@@ -110,9 +110,7 @@ def test_feature_extraction_mapped_to_hf_image_feature_extraction_for_vision_mod
         fake_onnx_config = MagicMock()
         fake_onnx_config.inputs = {"pixel_values": object()}
 
-        config = WinMLEvaluationConfig(
-            model_id="facebook/dinov2-base", task="feature-extraction"
-        )
+        config = WinMLEvaluationConfig(model_id="facebook/dinov2-base", task="feature-extraction")
         with (
             patch(
                 "transformers.AutoConfig.from_pretrained",
@@ -1158,6 +1156,45 @@ def test_ep_present_when_provided(self):
         )
         assert result["ep"] == "qnn"
 
+    def test_sanitize_fn_preserves_raw_perf_output(self):
+        reporter = self._load_reporter()
+
+        perf_proc = {
+            "exit_code": 0,
+            "stdout": "Latency (ms): 12.5\nThroughput: 80 samples/sec\nsome error line",
+            "stderr": "warning: device busy",
+            "elapsed": 5.0,
+            "timeout": False,
+            "command": "winml perf",
+            "timestamp": "2026-01-01T00:00:00+00:00",
+        }
+
+        def strip_perf(text: str) -> str:
+            return "\n".join(
+                line
+                for line in text.splitlines()
+                if "latency" not in line.lower() and "throughput" not in line.lower()
+            )
+
+        result = reporter.build_eval_result(
+            entry=self._make_entry(),
+            perf_proc=perf_proc,
+            device="cpu",
+            eval_types_run=["perf"],
+            accuracy_result=None,
+            ep=None,
+            sanitize_fn=strip_perf,
+        )
+
+        perf = result["perf"]
+        # sanitized output should not contain latency/throughput lines
+        assert "Latency" not in perf["stdout_output"]
+        assert "Throughput" not in perf["stdout_output"]
+        # raw output preserves the original perf data
+        assert "Latency (ms): 12.5" in perf["raw_stdout"]
+        assert "Throughput: 80 samples/sec" in perf["raw_stdout"]
+        assert perf["raw_stderr"] == "warning: device busy"
+
 
 class TestDefaultDatasetImmutability:
     """Tests that module-level _DEFAULT_DATASETS are not corrupted."""

From b98c818a129e860e9735782b67778a82f96b9238 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 15:00:21 +0800
Subject: [PATCH 037/143] docs: fill Reference, Troubleshooting, and
 Contributing pages

- Reference: full WinMLBuildConfig schema with all fields, types, defaults
- Troubleshooting: 10 common error patterns with symptoms and solutions
- Contributing: dev setup, testing, linting, code structure, PR checklist
---
 docs/contributing.md    | 180 ++++++++++++++++++++++++++++++++-
 docs/reference/index.md | 190 ++++++++++++++++++++++++++++++++++-
 docs/troubleshooting.md | 213 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 576 insertions(+), 7 deletions(-)

diff --git a/docs/contributing.md b/docs/contributing.md
index dba7a8254..9f54ba909 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -1,4 +1,180 @@
 # Contributing
 
-!!! note "Coming soon"
-    This page is part of the documentation MVP and will be authored shortly.
+This guide covers the development workflow for contributing to winml-cli.
+
+---
+
+## Prerequisites
+
+| Component | Version |
+|-----------|---------|
+| Python | 3.11 (`requires-python = ">=3.11,<3.12"`) |
+| Package manager | [uv](https://github.com/astral-sh/uv) |
+| OS | Windows 11 (primary target) |
+
+---
+
+## Development Setup
+
+```bash
+git clone https://github.com/microsoft/winml-cli.git
+cd winml-cli
+
+# Install all dependencies including dev tools
+uv sync --extra dev
+
+# Enable pre-commit hooks
+uv run pre-commit install
+```
+
+The pre-commit hooks automatically enforce:
+
+- MIT license headers on all `.py` files
+- Trailing whitespace removal
+- End-of-file newline
+- YAML syntax validation
+- Ruff linting and formatting
+
+---
+
+## Running Tests
+
+```bash
+# All unit tests
+uv run pytest tests/
+
+# Fast CI-like run (excludes hardware-dependent tests)
+uv run pytest tests/ -m "not e2e and not npu and not gpu"
+
+# Specific module
+uv run pytest tests/unit/analyze
+uv run pytest tests/unit/commands
+
+# With coverage
+uv run pytest tests/ --cov=src/winml/modelkit --cov-report=html
+```
+
+**Test markers:**
+
+| Marker | Use |
+|--------|-----|
+| `@pytest.mark.unit` | Fast unit tests (default) |
+| `@pytest.mark.smoke` | Critical-path tests that must always pass |
+| `@pytest.mark.e2e` | End-to-end tests (slow, may need hardware) |
+| `@pytest.mark.npu` | Requires NPU hardware |
+| `@pytest.mark.gpu` | Requires GPU |
+| `@pytest.mark.slow` | Tests taking > 30 seconds |
+
+---
+
+## Linting and Type Checking
+
+```bash
+# Lint (check only)
+uv run ruff check src/ tests/
+
+# Lint and auto-fix
+uv run ruff check src/ tests/ --fix
+
+# Format
+uv run ruff format src/ tests/
+
+# Type check
+uv run mypy src/
+
+# Run all pre-commit hooks manually
+uv run pre-commit run --all-files
+```
+
+---
+
+## Code Structure
+
+```text
+src/winml/modelkit/
+├── cli.py              # Entry point (winml command group)
+├── commands/           # CLI subcommands (export, build, analyze, etc.)
+├── models/             # Model loading from HuggingFace / local
+├── export/             # ONNX export logic and HTP
+├── optimize/           # Optimization pipelines and fusion
+├── analyze/            # Analysis engine and runtime rules
+├── config/             # Build config schema and constants
+├── build/              # Pipeline orchestration
+├── compiler/           # EP compilation (EPContext)
+├── quant/              # Quantization
+├── eval/               # Evaluation metrics
+├── serve/              # FastAPI serving layer
+├── session/            # Session management
+├── core/               # Core graph abstractions
+├── cache/              # Caching utilities
+└── utils/              # Shared utilities
+
+tests/
+├── unit/               # Unit tests (organized by module)
+├── integration/        # Integration tests
+├── e2e/                # End-to-end tests
+├── regression/         # Regression suite
+├── fixtures/           # Test data and mock models
+└── conftest.py         # Shared fixtures
+```
+
+---
+
+## Coding Conventions
+
+- **Line length:** 100 characters
+- **Docstrings:** Google style
+- **Strings:** Double quotes (enforced by Ruff)
+- **Type annotations:** Required for public API functions
+- **License header:** Auto-inserted by pre-commit on all `.py` files
+
+**Import order** (enforced by Ruff isort):
+
+1. `__future__`
+2. Standard library
+3. Third-party (`torch`, `transformers`, `onnx`, etc.)
+4. First-party (`winml.*`)
+5. Relative imports
+
+See the internal naming convention guide for ONNX/EP/QDQ term casing rules.
+
+---
+
+## PR Checklist
+
+Before submitting a pull request:
+
+- [ ] Tests pass: `uv run pytest tests/ -m "not e2e and not npu and not gpu"`
+- [ ] Linting passes: `uv run ruff check src/ tests/`
+- [ ] Formatting is clean: `uv run ruff format --check src/ tests/`
+- [ ] Type checking passes: `uv run mypy src/`
+- [ ] New code includes unit tests (target 80%+ coverage)
+- [ ] Docs updated if public API changed
+
+**CI will run:**
+
+1. **Lint workflow** — license headers + Ruff
+2. **Test workflow** — parallelized test groups on Windows
+3. **CLA bot** — Contributor License Agreement signature
+
+---
+
+## Documentation Development
+
+```bash
+# Live preview (auto-reloads)
+uv run mkdocs serve
+
+# Validate (strict mode, catches broken links)
+uv run mkdocs build --strict
+```
+
+See [docs/README.md](https://github.com/microsoft/winml-cli/blob/main/docs/README.md)
+for authoring conventions, publishing workflow, and site structure.
+
+---
+
+## See also
+
+- [Installation](getting-started/installation.md) — user-facing setup
+- [Commands](commands/overview.md) — CLI reference
diff --git a/docs/reference/index.md b/docs/reference/index.md
index dad8173fe..37ff62336 100644
--- a/docs/reference/index.md
+++ b/docs/reference/index.md
@@ -1,4 +1,188 @@
-# Reference
+# Reference — Build Configuration Schema
 
-!!! note "Coming soon"
-    This page is part of the documentation MVP and will be authored shortly.
+This page documents the full schema for `WinMLBuildConfig`, the JSON configuration
+file that drives `winml build` and related commands. Generate a config with
+`winml config`, then customize it before feeding it to `winml build -c config.json`.
+
+## Top-Level Structure
+
+```json
+{
+  "loader":  { ... },
+  "export":  { ... },
+  "optim":   { ... },
+  "quant":   { ... },
+  "compile": { ... },
+  "eval":    { ... },
+  "auto":    true
+}
+```
+
+Setting `quant` or `compile` to `null` skips that pipeline stage entirely.
+Setting `auto` to `true` (default) lets winml-cli auto-configure downstream
+stages based on the target device and precision.
+
+---
+
+## `loader` — Model Loading
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `task` | `str \| null` | `null` | HuggingFace task (e.g., `image-classification`). Auto-detected if omitted. |
+| `model_class` | `str \| null` | `null` | Override model class (e.g., `AutoModelForCTC`). |
+| `model_type` | `str \| null` | `null` | HuggingFace model type (e.g., `bert`, `resnet`). |
+| `module_path` | `str \| null` | `null` | Dotted path to a submodule for targeted export. |
+| `user_script` | `str \| null` | `null` | Path to custom model class script. |
+| `trust_remote_code` | `bool` | `false` | Trust remote code from HuggingFace. |
+
+---
+
+## `export` — ONNX Export
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `opset_version` | `int` | `17` | ONNX opset version. |
+| `batch_size` | `int` | `1` | Static batch size. Use 1 for QNN compatibility. |
+| `input_tensors` | `list[InputTensorSpec] \| null` | `null` | Input tensor specifications. Auto-inferred if omitted. |
+| `output_tensors` | `list[OutputTensorSpec] \| null` | `null` | Output tensor specifications. |
+| `dynamic_axes` | `dict \| null` | `null` | Dynamic axes mapping. ⚠️ Breaks MatMulAddFusion on QNN. |
+| `export_params` | `bool` | `true` | Include model parameters in ONNX. |
+| `do_constant_folding` | `bool` | `true` | Fold constants during export. |
+| `verbose` | `bool` | `false` | Verbose export logging. |
+| `dynamo` | `bool` | `false` | Use PyTorch 2.x Dynamo exporter. |
+| `enable_hierarchy_tags` | `bool` | `true` | Add module hierarchy tags to ONNX nodes. |
+| `clean_onnx` | `bool` | `false` | Strip hierarchy tags after export. |
+| `hierarchy_tag_format` | `"full" \| "module_only"` | `"full"` | Tag detail level. |
+
+**InputTensorSpec:**
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `name` | `str \| null` | Tensor name (e.g., `pixel_values`). |
+| `dtype` | `str \| null` | Data type (e.g., `float32`, `int64`). |
+| `shape` | `list[int] \| null` | Tensor shape (e.g., `[1, 3, 224, 224]`). |
+| `value_range` | `[float, float] \| null` | Min/max for dummy tensor generation. |
+
+---
+
+## `optim` — Graph Optimization
+
+A dictionary of boolean fusion flags. All default to `false` unless auto-configured.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `gelu_fusion` | `bool` | Fuse GeLU activation patterns. |
+| `layer_norm_fusion` | `bool` | Fuse LayerNorm patterns. |
+| `matmul_add_fusion` | `bool` | Fuse MatMul + Add (enables BiasGelu). |
+
+Additional fusion flags can be added as key-value pairs.
+
+---
+
+## `quant` — Quantization
+
+Set to `null` to skip quantization.
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `mode` | `"qdq" \| "static" \| "dynamic"` | `"qdq"` | Quantization mode. |
+| `weight_type` | `"uint8" \| "int8" \| "uint16" \| "int16"` | `"uint8"` | Weight data type. |
+| `activation_type` | `"uint8" \| "int8" \| "uint16" \| "int16"` | `"uint8"` | Activation data type. |
+| `calibration_method` | `"minmax" \| "entropy" \| "percentile"` | `"minmax"` | Scale computation method. |
+| `samples` | `int` | `10` | Number of calibration samples. |
+| `per_channel` | `bool` | `false` | Per-channel quantization. |
+| `symmetric` | `bool` | `false` | Symmetric quantization. |
+| `task` | `str \| null` | `null` | Task for dataset-aware calibration. |
+| `model_name` | `str \| null` | `null` | Model ID for calibration dataset resolution. |
+| `dataset_name` | `str \| null` | `null` | Override calibration dataset. |
+| `distribution` | `str` | `"uniform"` | Random distribution for dummy data. |
+| `seed` | `int \| null` | `null` | Random seed for reproducibility. |
+| `calibration_load_path` | `str \| null` | `null` | Load pre-computed calibration scales. |
+| `calibration_save_path` | `str \| null` | `null` | Save calibration scales. |
+| `op_types_to_quantize` | `list[str] \| null` | `null` | Operator types to quantize (all if null). |
+| `nodes_to_exclude` | `list[str] \| null` | `null` | Node names to skip. |
+
+---
+
+## `compile` — EP Compilation
+
+Set to `null` to skip compilation.
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `ep_config.provider` | `str` | `"qnn"` | EP alias: `qnn`, `cpu`, `dml`, `openvino`, `tensorrt`, `vitisai`, `migraphx`. |
+| `ep_config.device` | `str` | `"auto"` | Target device: `npu`, `gpu`, `cpu`, `auto`. |
+| `ep_config.enable_ep_context` | `bool` | `true` | Generate EPContext model. |
+| `ep_config.embed_context` | `bool` | `false` | Embed binary in ONNX (true) or external .bin (false). |
+| `ep_config.compiler` | `str` | `"ort"` | Compiler backend: `ort` or `qairt`. |
+| `ep_config.provider_options` | `dict` | `{}` | EP-specific options. |
+| `ep_config.qnn_sdk_root` | `str \| null` | `null` | QAIRT SDK path (required for `compiler: "qairt"`). |
+| `validate` | `bool` | `true` | Validate compiled model. |
+| `verbose` | `bool` | `false` | Verbose compilation logging. |
+
+---
+
+## `eval` — Evaluation
+
+Set to `null` (default) to skip evaluation.
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `model_id` | `str \| null` | `null` | HuggingFace model ID for config resolution. |
+| `model_path` | `str \| null` | `null` | Path to .onnx file. |
+| `task` | `str \| null` | `null` | Task type. |
+| `device` | `str` | `"auto"` | Inference device. |
+| `precision` | `str` | `"auto"` | Precision (`fp32`, `fp16`, `w8a16`, etc.). |
+| `ep` | `str \| null` | `null` | EP override. |
+| `dataset.path` | `str \| null` | `null` | HuggingFace dataset path. |
+| `dataset.name` | `str \| null` | `null` | Dataset config name. |
+| `dataset.split` | `str` | `"validation"` | Dataset split. |
+| `dataset.samples` | `int` | `100` | Evaluation sample count. |
+| `dataset.shuffle` | `bool` | `true` | Shuffle before sampling. |
+| `dataset.seed` | `int` | `42` | Random seed. |
+| `output_path` | `str \| null` | `null` | Path for JSON results output. |
+
+---
+
+## Example: Full Config
+
+```json
+{
+  "loader": {
+    "task": "image-classification",
+    "model_type": "resnet"
+  },
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1
+  },
+  "optim": {
+    "gelu_fusion": true,
+    "layer_norm_fusion": true,
+    "matmul_add_fusion": true
+  },
+  "quant": {
+    "mode": "qdq",
+    "weight_type": "uint8",
+    "activation_type": "uint8",
+    "samples": 10,
+    "calibration_method": "minmax"
+  },
+  "compile": {
+    "ep_config": {
+      "provider": "qnn",
+      "device": "npu",
+      "enable_ep_context": true,
+      "embed_context": false
+    },
+    "validate": true
+  },
+  "auto": true
+}
+```
+
+## See also
+
+- [winml config](../commands/config.md) — generate a config interactively
+- [winml build](../commands/build.md) — run the pipeline with a config
+- [Config and build](../concepts/config-and-build.md) — conceptual overview
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 21b5c4c89..71268e03d 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -1,4 +1,213 @@
 # Troubleshooting
 
-!!! note "Coming soon"
-    This page is part of the documentation MVP and will be authored shortly.
+Common issues and solutions when working with winml-cli.
+
+---
+
+## Build and Pipeline Errors
+
+### Config file is empty or invalid JSON
+
+```text
+UsageError: Config file is empty: config.json
+UsageError: Invalid JSON in config: Expecting value: line 1 column 1
+```
+
+**Cause:** The config file passed to `winml build -c` is empty, malformed, or not valid JSON.
+
+**Solution:** Validate the file with `python -m json.tool config.json`, or regenerate it:
+
+```bash
+uv run winml config -m <model> -d <device> -o output/
+```
+
+---
+
+### Cannot enable compilation: no compile section
+
+```text
+UsageError: Cannot enable compilation: no compile section found in the config file
+```
+
+**Cause:** You passed `--compile` but the config JSON has no `"compile"` section (it's `null`).
+
+**Solution:** Regenerate the config with compilation enabled, or add a compile section manually:
+
+```bash
+uv run winml config -m <model> -d npu -o output/
+```
+
+---
+
+### Already a compiled EPContext model
+
+```text
+ClickException: model_ctx.onnx is already a compiled EPContext model and cannot be re-compiled
+```
+
+**Cause:** You're trying to compile a model that is already an EPContext artifact (the `_ctx.onnx` output).
+
+**Solution:** Run compilation on the original (pre-compiled) ONNX file instead:
+
+```bash
+uv run winml compile -m model.onnx -d npu -o output/
+```
+
+---
+
+### Provider does not support EPContext compilation
+
+```text
+ClickException: Provider 'DmlExecutionProvider' does not support EPContext compilation
+```
+
+**Cause:** Not all EPs produce EPContext format. DML and CPU do not support pre-compilation.
+
+**Solution:** EPContext is supported by QNN, OpenVINO, TensorRT, and Vitis AI. For DML/CPU, skip the compile step — the runtime compiles on first load automatically:
+
+```bash
+uv run winml build -c config.json -m model -o output/ --no-compile
+```
+
+---
+
+## Analysis and Compatibility
+
+### Unsupported nodes persist after analysis
+
+```text
+RuntimeError: Unsupported nodes persist after analysis
+```
+
+**Cause:** The model contains operators that the selected EP cannot dispatch natively.
+
+**Solution:** Run `winml analyze` first to identify which operators are problematic:
+
+```bash
+uv run winml analyze -m model.onnx --ep qnn
+```
+
+Then consider:
+
+- Using a different EP (`--ep dml` or `--ep cpu`)
+- Running optimization to fuse unsupported patterns into supported ones
+- Checking if a newer opset version resolves the compatibility gap
+
+---
+
+## Device and EP Issues
+
+### Unknown EP or device mismatch
+
+```text
+UsageError: Unknown EP: invalid_ep
+UsageError: --ep QNNExecutionProvider cannot run on --device gpu
+```
+
+**Cause:** The specified EP doesn't exist or doesn't support the requested device.
+
+**Solution:** Check available EPs on your system:
+
+```bash
+uv run winml sys --list-ep
+```
+
+Valid EP aliases: `qnn`, `openvino`, `dml`, `cpu`, `tensorrt`, `migraphx`, `vitisai`.
+
+---
+
+### No NPU device detected
+
+```text
+Available Devices (priority order)
+  #1  GPU   ...
+  #2  CPU   ...
+```
+
+**Cause:** NPU driver not installed, or Windows version is too old.
+
+**Solution:**
+
+1. Verify Windows 11 24H2 or later
+2. Check for NPU driver updates in Device Manager → Neural processors
+3. Install the latest Qualcomm AI Engine Direct SDK (for Snapdragon NPUs)
+4. Re-run `uv run winml sys` to confirm
+
+!!! note
+    All winml-cli commands work without NPU hardware. Use `--device auto` to fall back to GPU or CPU.
+
+---
+
+## Quantization and Compilation Failures
+
+### Quantization failed
+
+```text
+RuntimeError: Quantization failed: [error details]
+```
+
+**Cause:** Quantization encountered an incompatible graph structure or calibration error.
+
+**Solution:**
+
+1. Add `--verbose` to see detailed error output
+2. Ensure the model has been optimized first (run `winml optimize` before `winml quantize`)
+3. Try a different calibration method: `--calibration-method entropy`
+4. Exclude problematic nodes: use `nodes_to_exclude` in the quant config
+
+---
+
+### No output file produced after compile
+
+```text
+Warning: Compilation finished but no output file was written
+ClickException: No output file produced
+```
+
+**Cause:** The compiler ran but didn't generate the expected `_ctx.onnx` file. Common with DML/CPU (which don't produce EPContext).
+
+**Solution:** Verify you're targeting an EP that supports EPContext:
+
+```bash
+# Correct — QNN supports EPContext
+uv run winml compile -m model.onnx -d npu --ep qnn
+
+# Won't produce output — DML doesn't support EPContext
+uv run winml compile -m model.onnx -d gpu --ep dml
+```
+
+---
+
+## Output and File Issues
+
+### Output path exists but is not a directory
+
+```text
+ValueError: Output path exists but is not a directory: output.onnx
+```
+
+**Cause:** The `-o` flag expects a directory path, but you passed a file path.
+
+**Solution:** Use a directory:
+
+```bash
+uv run winml build -c config.json -m model -o output_dir/
+```
+
+---
+
+## General Tips
+
+| Tip | Command |
+|-----|---------|
+| **Diagnose environment** | `uv run winml sys` |
+| **Check EP compatibility** | `uv run winml analyze -m model.onnx --ep <ep>` |
+| **Verbose output** | Add `-v` or `--verbose` to any command |
+| **Skip a pipeline stage** | `--no-quant`, `--no-compile`, `--no-optimize` |
+| **Regenerate config** | `uv run winml config -m <model> -d <device> -o dir/` |
+
+## See also
+
+- [winml sys](commands/sys.md) — system diagnostics
+- [winml analyze](commands/analyze.md) — EP compatibility analysis
+- [EP and Device](concepts/eps-and-devices.md) — execution provider reference

From 86daf43c813d71abbd0c48af19f17b54943db35c Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 18:37:27 +0800
Subject: [PATCH 038/143] docs: add Python API, Output Layout, and Supported
 Models reference pages

- Python API: WinMLAutoModel, build functions, BuildResult, inference classes
- Output Layout: directory structure, file naming, external data, manifest
- Supported Models: 34 tasks, validated catalog, EP compatibility matrix
- Updated mkdocs.yml nav to expand Reference section
---
 docs/reference/output-layout.md    | 161 ++++++++++++++++++++
 docs/reference/python-api.md       | 235 +++++++++++++++++++++++++++++
 docs/reference/supported-models.md | 175 +++++++++++++++++++++
 mkdocs.yml                         |   6 +-
 4 files changed, 576 insertions(+), 1 deletion(-)
 create mode 100644 docs/reference/output-layout.md
 create mode 100644 docs/reference/python-api.md
 create mode 100644 docs/reference/supported-models.md

diff --git a/docs/reference/output-layout.md b/docs/reference/output-layout.md
new file mode 100644
index 000000000..54aad37bb
--- /dev/null
+++ b/docs/reference/output-layout.md
@@ -0,0 +1,161 @@
+# Output Layout
+
+When you run `winml build`, the tool writes all artifacts to the output
+directory. This page documents what each file is and which ones you need
+for deployment.
+
+---
+
+## Directory Structure
+
+After a full pipeline run (export → optimize → quantize → compile):
+
+```text
+output/
+├── model.onnx                  ← FINAL artifact (deploy this)
+├── model.onnx.data             ← External weights (if model ≥ 100 MiB)
+├── winml_build_config.json     ← Persisted build config
+├── build_manifest.json         ← Build provenance and timing
+├── export.onnx                 ← Intermediate: raw ONNX export
+├── export.onnx.data
+├── optimized.onnx              ← Intermediate: after graph optimization
+├── optimized.onnx.data
+├── quantized.onnx              ← Intermediate: after QDQ insertion
+├── quantized.onnx.data
+├── compiled.onnx               ← Intermediate: after EP compilation
+└── compiled.onnx.data
+```
+
+---
+
+## File Categories
+
+### Final Artifacts (Keep for Deployment)
+
+| File | Purpose |
+|------|---------|
+| `model.onnx` | The deployment-ready model. Always present. |
+| `model.onnx.data` | External weight data (only if model ≥ 100 MiB). Must stay alongside `model.onnx`. |
+| `winml_build_config.json` | The config used for this build (includes auto-discovered flags). Useful for reproducibility. |
+| `build_manifest.json` | Build metadata: stages run, timings, quantization stats. |
+
+### Intermediate Files (Can Delete After Build)
+
+| File | Stage | Contents |
+|------|-------|----------|
+| `export.onnx` | Export | Raw PyTorch → ONNX conversion (float32) |
+| `optimized.onnx` | Optimize | Graph with fused operators, shape inference applied |
+| `quantized.onnx` | Quantize | QDQ nodes inserted, calibrated scales |
+| `compiled.onnx` | Compile | EPContext binary embedded or sidecar |
+
+Each intermediate has a corresponding `.onnx.data` file if the model exceeds
+100 MiB.
+
+---
+
+## What Gets Written at Each Stage
+
+### Export only (`winml export`)
+
+```text
+output/
+├── export.onnx
+└── export.onnx.data          (if ≥ 100 MiB)
+```
+
+### Optimize only (`winml optimize`)
+
+```text
+output/
+├── optimized.onnx
+└── optimized.onnx.data
+```
+
+### Full build (`winml build`)
+
+All stages write their intermediate, and `model.onnx` is a copy of the last
+successful stage output. If you skip quantization (`--no-quant`), the final
+model is a copy of `optimized.onnx`. If you skip compilation too, it's still
+a copy of `optimized.onnx`.
+
+---
+
+## External Data
+
+Models larger than **100 MiB** store weights in a separate `.onnx.data` file.
+Both files must be kept together — the `.onnx` file contains a reference to the
+data file by name.
+
+| Model Size | Files |
+|-----------|-------|
+| < 100 MiB | `model.onnx` only (weights embedded) |
+| ≥ 100 MiB | `model.onnx` + `model.onnx.data` |
+
+!!! warning
+    If you move `model.onnx`, always move `model.onnx.data` alongside it.
+    The ONNX file references the data file by relative path.
+
+---
+
+## Build Manifest
+
+`build_manifest.json` records provenance for every build:
+
+```json
+{
+  "schema_version": 1,
+  "source": "hf",
+  "model_id": "microsoft/resnet-50",
+  "task": "image-classification",
+  "timestamp": "2026-01-15T10:30:00.000000+00:00",
+  "elapsed_seconds": 45.1,
+  "final_artifact": "model.onnx",
+  "stages": [
+    {
+      "name": "export",
+      "status": "completed",
+      "filename": "export.onnx",
+      "elapsed_seconds": 12.5
+    },
+    {
+      "name": "optimize",
+      "status": "completed",
+      "filename": "optimized.onnx",
+      "elapsed_seconds": 8.2
+    },
+    {
+      "name": "quantize",
+      "status": "completed",
+      "filename": "quantized.onnx",
+      "elapsed_seconds": 15.3,
+      "nodes_quantized": 150,
+      "nodes_skipped": 12
+    },
+    {
+      "name": "compile",
+      "status": "completed",
+      "filename": "compiled.onnx",
+      "elapsed_seconds": 9.1
+    }
+  ]
+}
+```
+
+---
+
+## Rebuild Behavior
+
+- If `model.onnx` already exists and `rebuild=False` (default), the build is
+  skipped entirely.
+- Pass `--rebuild` (CLI) or `force_rebuild=True` (Python API) to force a fresh
+  build.
+- On rebuild, all old `.onnx` and `.onnx.data` files are deleted before the
+  pipeline runs.
+
+---
+
+## See also
+
+- [winml build](../commands/build.md) — build command reference
+- [Reference — Build Configuration Schema](index.md) — config file format
+- [How winml-cli Works](../concepts/how-it-works.md) — pipeline stages explained
diff --git a/docs/reference/python-api.md b/docs/reference/python-api.md
new file mode 100644
index 000000000..9672d160b
--- /dev/null
+++ b/docs/reference/python-api.md
@@ -0,0 +1,235 @@
+# Python API
+
+winml-cli can be used as a Python library for programmatic model building and
+inference. This page documents the public API surface.
+
+---
+
+## Quick Example
+
+```python
+from winml.modelkit import WinMLAutoModel
+
+# Build and load in one call
+model = WinMLAutoModel.from_pretrained("microsoft/resnet-50", device="npu")
+output = model(pixel_values=images)
+
+# From a local ONNX file
+model = WinMLAutoModel.from_onnx("model.onnx", task="image-classification")
+```
+
+---
+
+## `WinMLAutoModel`
+
+Factory class for automatic model building and loading. Not instantiable directly —
+use the class methods.
+
+### `from_pretrained()`
+
+Build and load a model from a HuggingFace ID or local path. Runs the full
+pipeline: config → export → optimize → quantize → compile → load.
+
+```python
+WinMLAutoModel.from_pretrained(
+    model_id_or_path: str | Path,
+    *,
+    task: str | None = None,
+    config: WinMLBuildConfig | None = None,
+    device: str = "auto",
+    precision: str = "auto",
+    cache_dir: str | Path | None = None,
+    use_cache: bool = True,
+    force_rebuild: bool = False,
+    trust_remote_code: bool = False,
+    shape_config: dict | None = None,
+    no_compile: bool = False,
+) -> WinMLPreTrainedModel
+```
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `model_id_or_path` | `str \| Path` | required | HuggingFace model ID or path to local model. |
+| `task` | `str \| None` | `None` | Task name. Auto-detected if omitted. |
+| `config` | `WinMLBuildConfig \| None` | `None` | Custom build config. Auto-generated if omitted. |
+| `device` | `str` | `"auto"` | Target device: `"auto"`, `"npu"`, `"gpu"`, `"cpu"`. |
+| `precision` | `str` | `"auto"` | Precision: `"auto"`, `"fp32"`, `"fp16"`, `"w8a8"`, etc. |
+| `cache_dir` | `str \| Path \| None` | `None` | Cache directory for built artifacts. |
+| `use_cache` | `bool` | `True` | Reuse cached build if available. |
+| `force_rebuild` | `bool` | `False` | Force rebuild even if cache exists. |
+| `trust_remote_code` | `bool` | `False` | Trust remote code from HuggingFace. |
+| `no_compile` | `bool` | `False` | Skip the compilation stage. |
+
+**Returns:** A task-specific `WinMLPreTrainedModel` subclass.
+
+---
+
+### `from_onnx()`
+
+Build from a pre-exported ONNX file. Runs: optimize → quantize → compile → load.
+
+```python
+WinMLAutoModel.from_onnx(
+    onnx_path: str | Path | dict[str, str | Path],
+    *,
+    task: str | None = None,
+    config: WinMLBuildConfig | None = None,
+    device: str = "auto",
+    precision: str = "auto",
+    ep: str | None = None,
+    skip_build: bool = False,
+) -> WinMLPreTrainedModel | WinMLCompositeModel
+```
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `onnx_path` | `str \| Path \| dict` | required | ONNX file path, or dict of submodel paths for composite models. |
+| `skip_build` | `bool` | `False` | Load ONNX directly without running optimize/quantize/compile. |
+
+---
+
+### `supported_tasks()`
+
+```python
+WinMLAutoModel.supported_tasks() -> list[str]
+```
+
+Returns all task strings the toolkit supports (34 tasks).
+
+---
+
+## Build Pipeline Functions
+
+Lower-level functions for fine-grained control over the pipeline.
+
+### `build_hf_model()`
+
+```python
+from winml.modelkit.build import build_hf_model
+
+result = build_hf_model(
+    config: WinMLBuildConfig,
+    output_dir: Path,
+    *,
+    model_id: str | None = None,
+    rebuild: bool = False,
+    trust_remote_code: bool = False,
+    cache_key: str | None = None,
+) -> BuildResult
+```
+
+Runs the full pipeline (export → optimize → analyze → quantize → compile) and
+writes all artifacts to `output_dir`.
+
+### `build_onnx_model()`
+
+```python
+from winml.modelkit.build import build_onnx_model
+
+result = build_onnx_model(
+    onnx_path: Path | str,
+    *,
+    config: WinMLBuildConfig,
+    output_dir: Path | str,
+    rebuild: bool = False,
+) -> BuildResult
+```
+
+Builds from an existing ONNX file (skips export).
+
+### `BuildResult`
+
+```python
+@dataclass
+class BuildResult:
+    output_dir: Path           # Directory containing all artifacts
+    final_onnx_path: Path      # Path to final model.onnx
+    config_path: Path          # Path to winml_build_config.json
+    stages_completed: list[str]  # e.g., ["export", "optimize", "quantize"]
+    stages_skipped: list[str]
+    stage_timings: dict[str, float]  # Per-stage seconds
+    elapsed: float             # Total build time (seconds)
+    reused: bool               # True if cache hit, no build ran
+    manifest_path: Path | None # Path to build_manifest.json
+```
+
+---
+
+## Config Generation
+
+### `generate_build_config()`
+
+```python
+from winml.modelkit.config import generate_build_config
+
+config = generate_build_config(
+    model_id: str | None = None,
+    *,
+    task: str | None = None,
+    device: str = "auto",
+    precision: str = "auto",
+    ep: str | None = None,
+    no_compile: bool = False,
+) -> WinMLBuildConfig
+```
+
+Auto-generates a complete build config by probing the model's `config.json`
+(does not download weights). Equivalent to what `winml config` produces.
+
+---
+
+## Inference Model Classes
+
+All inference models inherit from `WinMLPreTrainedModel` and are HuggingFace
+pipeline-compatible.
+
+### `WinMLPreTrainedModel` (Base)
+
+```python
+class WinMLPreTrainedModel:
+    def __call__(self, **kwargs) -> Any: ...
+    def perf(self, warmup: int = 0) -> ContextManager: ...
+
+    @property
+    def device(self) -> str: ...
+    @property
+    def ep_name(self) -> str | None: ...
+    @property
+    def io_config(self) -> dict: ...
+    @property
+    def task(self) -> str | None: ...
+```
+
+### Task-Specific Classes
+
+| Class | Task |
+|-------|------|
+| `WinMLModelForImageClassification` | `image-classification` |
+| `WinMLModelForSequenceClassification` | `text-classification` |
+| `WinMLModelForImageSegmentation` | `image-segmentation` |
+| `WinMLModelForSemanticSegmentation` | `semantic-segmentation` |
+| `WinMLModelForObjectDetection` | `object-detection` |
+| `WinMLModelForFeatureExtraction` | `feature-extraction` |
+| `WinMLModelForQuestionAnswering` | `question-answering` |
+| `WinMLModelForZeroShotImageClassification` | `zero-shot-image-classification` |
+| `WinMLModelForGenericTask` | fallback (raw outputs) |
+
+### Performance Tracking
+
+```python
+model = WinMLAutoModel.from_pretrained("microsoft/resnet-50", device="npu")
+
+with model.perf(warmup=5) as stats:
+    for img in test_images:
+        model(pixel_values=img)
+
+print(f"P99 latency: {stats.p99_ms:.2f} ms")
+```
+
+---
+
+## See also
+
+- [Reference — Build Configuration Schema](index.md) — full config field reference
+- [winml build](../commands/build.md) — CLI equivalent
+- [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview
diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
new file mode 100644
index 000000000..fb57cb4d0
--- /dev/null
+++ b/docs/reference/supported-models.md
@@ -0,0 +1,175 @@
+# Supported Models
+
+winml-cli supports a wide range of model architectures and tasks. This page
+lists what's validated and how to discover model support.
+
+---
+
+## Discovery Commands
+
+```bash
+# Browse the curated catalog (57 validated models)
+uv run winml catalog
+
+# Filter by task
+uv run winml catalog -k image-classification
+
+# Check if a specific model is supported
+uv run winml inspect -m microsoft/resnet-50
+
+# List all known tasks
+uv run winml inspect --list-tasks
+```
+
+---
+
+## Supported Tasks
+
+winml-cli supports **34 tasks** across vision, NLP, audio, and multimodal domains.
+
+### Vision
+
+| Task | Example Models |
+|------|----------------|
+| `image-classification` | ResNet, ConvNeXt, ViT, Swin |
+| `image-segmentation` | Segformer, Mask2Former |
+| `semantic-segmentation` | Segformer |
+| `object-detection` | DETR, YOLOS, Table-Transformer |
+| `depth-estimation` | Depth Anything, ZoeDepth |
+| `image-feature-extraction` | DINOv2, ViT |
+| `zero-shot-image-classification` | CLIP, SigLIP |
+
+### NLP
+
+| Task | Example Models |
+|------|----------------|
+| `text-classification` | BERT, RoBERTa, XLM-RoBERTa |
+| `token-classification` | BERT, RoBERTa (NER) |
+| `question-answering` | BERT, RoBERTa |
+| `fill-mask` | BERT, RoBERTa |
+| `feature-extraction` | BGE, BERT, all-MiniLM |
+| `text-generation` | Qwen3 (composite) |
+| `text2text-generation` | T5, BART, Marian |
+
+### Audio
+
+| Task | Example Models |
+|------|----------------|
+| `automatic-speech-recognition` | Whisper |
+| `audio-classification` | Wav2Vec2 |
+
+### Multimodal
+
+| Task | Example Models |
+|------|----------------|
+| `zero-shot-image-classification` | CLIP (text + vision) |
+| `image-to-text` | VisionEncoderDecoder |
+| `visual-question-answering` | BLIP |
+
+---
+
+## Validated Model Catalog
+
+The following architectures have been validated end-to-end with EP compatibility
+testing. Use `winml catalog` to browse the full list interactively.
+
+### Image Classification
+
+| Model | Architecture | EPs Tested |
+|-------|-------------|------------|
+| `microsoft/resnet-50` | ResNet | CPU, QNN (GPU/NPU), OpenVINO |
+| `facebook/convnext-tiny-224` | ConvNeXt | CPU, QNN (GPU/NPU), OpenVINO |
+| `google/vit-base-patch16-224` | ViT | CPU, QNN (GPU/NPU), OpenVINO |
+
+### Text Classification & NLU
+
+| Model | Architecture | EPs Tested |
+|-------|-------------|------------|
+| `bert-base-uncased` | BERT | CPU, QNN (GPU/NPU), OpenVINO |
+| `FacebookAI/roberta-base` | RoBERTa | CPU, QNN, OpenVINO |
+| `FacebookAI/xlm-roberta-base` | XLM-RoBERTa | CPU, QNN, OpenVINO |
+
+### Feature Extraction & Embeddings
+
+| Model | Architecture | EPs Tested |
+|-------|-------------|------------|
+| `BAAI/bge-base-en-v1.5` | BERT | CPU, QNN (GPU/NPU), OpenVINO |
+| `BAAI/bge-small-en-v1.5` | BERT | CPU, QNN (GPU/NPU), OpenVINO |
+| `sentence-transformers/all-MiniLM-L6-v2` | BERT | CPU, QNN, OpenVINO |
+
+### Vision-Language
+
+| Model | Architecture | EPs Tested |
+|-------|-------------|------------|
+| `openai/clip-vit-base-patch32` | CLIP | CPU, QNN, OpenVINO |
+| `openai/clip-vit-large-patch14` | CLIP | CPU, QNN, OpenVINO |
+
+### Segmentation
+
+| Model | Architecture | EPs Tested |
+|-------|-------------|------------|
+| `nvidia/segformer-b0-finetuned-ade-512-512` | Segformer | CPU, QNN, OpenVINO |
+| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` | Segformer | CPU, QNN, OpenVINO |
+
+### Object Detection
+
+| Model | Architecture | EPs Tested |
+|-------|-------------|------------|
+| `microsoft/table-transformer-detection` | Table-Transformer | CPU, OpenVINO |
+
+---
+
+## Execution Provider Compatibility
+
+Each validated model is tested against available EPs:
+
+| EP | Alias | Devices | Notes |
+|----|-------|---------|-------|
+| CPUExecutionProvider | `cpu` | CPU | Always available |
+| QNNExecutionProvider | `qnn` | NPU, GPU | Qualcomm Snapdragon; requires QNN SDK |
+| OpenVINOExecutionProvider | `openvino` | CPU, GPU, NPU | Intel hardware; install with `--extra openvino` |
+| DmlExecutionProvider | `dml` | GPU | DirectML; any DirectX 12 GPU |
+| VitisAIExecutionProvider | `vitisai` | NPU | AMD/Xilinx |
+
+---
+
+## Models with Custom Build Logic
+
+These architectures have specialized export handling (multi-component, custom
+tracing, or composite pipelines):
+
+| Architecture | Type | Components |
+|-------------|------|------------|
+| CLIP | Vision-Language | text_encoder + vision_encoder |
+| BLIP | Vision-Language | vision + text + decoder |
+| T5 / BART | Encoder-Decoder | encoder + decoder |
+| Marian | Translation | encoder + decoder |
+| SAM / SAM2 | Segmentation | image_encoder + mask_decoder |
+| Qwen3 | LLM | prefill + generation (composite) |
+| Whisper | Speech | encoder + decoder |
+
+---
+
+## Adding Unsupported Models
+
+If your model architecture isn't in the catalog, winml-cli may still support it
+through auto-detection:
+
+```bash
+# Try inspecting first
+uv run winml inspect -m your-org/your-model
+
+# If "Status: Supported", proceed normally
+uv run winml build -m your-org/your-model -d auto -o output/
+```
+
+For truly custom architectures, use `--trust-remote-code` and optionally provide
+a `--user-script` with your model class definition.
+
+---
+
+## See also
+
+- [winml catalog](../commands/catalog.md) — browse validated models interactively
+- [winml inspect](../commands/inspect.md) — check model compatibility
+- [EP and Device](../concepts/eps-and-devices.md) — execution provider details
diff --git a/mkdocs.yml b/mkdocs.yml
index b6d9ae3bd..ed1da3955 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -116,6 +116,10 @@ nav:
   - Tutorials:
       - Overview: tutorials/index.md
       - ConvNeXt on NPU: tutorials/npu-convnext.md
-  - Reference: reference/index.md
+  - Reference:
+      - Build Config Schema: reference/index.md
+      - Python API: reference/python-api.md
+      - Output Layout: reference/output-layout.md
+      - Supported Models: reference/supported-models.md
   - Troubleshooting: troubleshooting.md
   - Contributing: contributing.md

From c55cf557233c171316ff8dbc4b94ac56566e7d30 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 19:21:47 +0800
Subject: [PATCH 039/143] docs: fix factual errors in reference pages after
 source verification

Python API:
- generate_build_config(): add 8 missing params, fix return type
- build_hf_model(): add missing pytorch_model, random_init, ep, device params
- build_onnx_model(): add missing ep, device params
- from_onnx(): add missing cache_dir, use_cache, force_rebuild, session_options, hf_config

Output Layout:
- Remove phantom 'source' field from manifest example
- Add actual fields: cache_key, config_hash, analyze_iterations, analyze_details

Supported Models:
- Fix task count: 35 not 34
- Remove Whisper (not registered in model build system)
- Add missing models: SigLIP, MU2, VisionEncoderDecoder
---
 docs/reference/output-layout.md    |  6 +++++-
 docs/reference/python-api.md       | 27 +++++++++++++++++++++++++--
 docs/reference/supported-models.md |  6 ++++--
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/docs/reference/output-layout.md b/docs/reference/output-layout.md
index 54aad37bb..01377a254 100644
--- a/docs/reference/output-layout.md
+++ b/docs/reference/output-layout.md
@@ -104,12 +104,16 @@ data file by name.
 ```json
 {
   "schema_version": 1,
-  "source": "hf",
   "model_id": "microsoft/resnet-50",
   "task": "image-classification",
+  "cache_key": "a1b2c3d4e5f6",
+  "config_hash": "f7e8d9c0b1a2",
   "timestamp": "2026-01-15T10:30:00.000000+00:00",
   "elapsed_seconds": 45.1,
   "final_artifact": "model.onnx",
+  "analyze_iterations": 2,
+  "analyze_unsupported_node_count": 0,
+  "analyze_details": { "lint": {}, "autoconf": {} },
   "stages": [
     {
       "name": "export",
diff --git a/docs/reference/python-api.md b/docs/reference/python-api.md
index 9672d160b..8ed7c5a5b 100644
--- a/docs/reference/python-api.md
+++ b/docs/reference/python-api.md
@@ -77,7 +77,13 @@ WinMLAutoModel.from_onnx(
     device: str = "auto",
     precision: str = "auto",
     ep: str | None = None,
+    cache_dir: str | Path | None = None,
+    use_cache: bool = True,
+    force_rebuild: bool = False,
     skip_build: bool = False,
+    session_options: Any | None = None,
+    hf_config: PretrainedConfig | None = None,
+    **kwargs: Any,
 ) -> WinMLPreTrainedModel | WinMLCompositeModel
 ```
 
@@ -85,6 +91,7 @@ WinMLAutoModel.from_onnx(
 |-----------|------|---------|-------------|
 | `onnx_path` | `str \| Path \| dict` | required | ONNX file path, or dict of submodel paths for composite models. |
 | `skip_build` | `bool` | `False` | Load ONNX directly without running optimize/quantize/compile. |
+| `hf_config` | `PretrainedConfig \| None` | `None` | Required for composite models (dict inputs). |
 
 ---
 
@@ -112,9 +119,14 @@ result = build_hf_model(
     output_dir: Path,
     *,
     model_id: str | None = None,
+    pytorch_model: nn.Module | None = None,
     rebuild: bool = False,
     trust_remote_code: bool = False,
+    random_init: bool = False,
     cache_key: str | None = None,
+    ep: str | None = None,
+    device: str | None = None,
+    **kwargs: Any,
 ) -> BuildResult
 ```
 
@@ -132,6 +144,9 @@ result = build_onnx_model(
     config: WinMLBuildConfig,
     output_dir: Path | str,
     rebuild: bool = False,
+    ep: str | None = None,
+    device: str | None = None,
+    **kwargs: Any,
 ) -> BuildResult
 ```
 
@@ -166,15 +181,23 @@ config = generate_build_config(
     model_id: str | None = None,
     *,
     task: str | None = None,
+    model_class: str | None = None,
+    model_type: str | None = None,
+    module: str | None = None,
+    override: WinMLBuildConfig | None = None,
+    shape_config: dict | None = None,
+    library_name: str = "transformers",
     device: str = "auto",
     precision: str = "auto",
+    trust_remote_code: bool = False,
     ep: str | None = None,
-    no_compile: bool = False,
-) -> WinMLBuildConfig
+    onnx_path: str | Path | None = None,
+) -> WinMLBuildConfig | list[WinMLBuildConfig]
 ```
 
 Auto-generates a complete build config by probing the model's `config.json`
 (does not download weights). Equivalent to what `winml config` produces.
+Returns a list when `module` is specified (one config per submodule).
 
 ---
 
diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index fb57cb4d0..7a2175583 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -25,7 +25,7 @@ uv run winml inspect --list-tasks
 
 ## Supported Tasks
 
-winml-cli supports **34 tasks** across vision, NLP, audio, and multimodal domains.
+winml-cli supports **35 tasks** across vision, NLP, audio, and multimodal domains.
 
 ### Vision
 
@@ -141,12 +141,14 @@ tracing, or composite pipelines):
 | Architecture | Type | Components |
 |-------------|------|------------|
 | CLIP | Vision-Language | text_encoder + vision_encoder |
+| SigLIP | Vision-Language | text_model + vision_model |
 | BLIP | Vision-Language | vision + text + decoder |
 | T5 / BART | Encoder-Decoder | encoder + decoder |
 | Marian | Translation | encoder + decoder |
+| MU2 | Encoder-Decoder | encoder + decoder |
+| VisionEncoderDecoder | Image-to-Text | encoder + decoder |
 | SAM / SAM2 | Segmentation | image_encoder + mask_decoder |
 | Qwen3 | LLM | prefill + generation (composite) |
-| Whisper | Speech | encoder + decoder |
 
 ---
 

From 03ce1e0dd440c87b85b968a3e2d2bbaeaa08a6a1 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 19:26:57 +0800
Subject: [PATCH 040/143] docs: fix factual errors in supported-models.md

- CLIP components: text_encoder/vision_encoder -> clip_text_model/clip_vision_model
- SigLIP components: text_model/vision_model -> siglip_text_model/siglip_vision_model
- SAM/SAM2: image_encoder + mask_decoder -> vision_encoder + prompt_encoder + mask_decoder
- BLIP: vision + text + decoder -> vision_encoder + decoder
- Qwen3 -> Qwen (source file is qwen.py)
- Remove non-existent --user-script CLI flag
---
 docs/reference/supported-models.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index 7a2175583..b8bfba4aa 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -140,15 +140,15 @@ tracing, or composite pipelines):
 
 | Architecture | Type | Components |
 |-------------|------|------------|
-| CLIP | Vision-Language | text_encoder + vision_encoder |
-| SigLIP | Vision-Language | text_model + vision_model |
-| BLIP | Vision-Language | vision + text + decoder |
+| CLIP | Vision-Language | clip_text_model + clip_vision_model |
+| SigLIP | Vision-Language | siglip_text_model + siglip_vision_model |
+| BLIP | Vision-Language | vision_encoder + decoder |
 | T5 / BART | Encoder-Decoder | encoder + decoder |
 | Marian | Translation | encoder + decoder |
 | MU2 | Encoder-Decoder | encoder + decoder |
 | VisionEncoderDecoder | Image-to-Text | encoder + decoder |
-| SAM / SAM2 | Segmentation | image_encoder + mask_decoder |
-| Qwen3 | LLM | prefill + generation (composite) |
+| SAM / SAM2 | Segmentation | vision_encoder + prompt_encoder + mask_decoder |
+| Qwen | LLM | prefill + generation (composite) |
 
 ---
 
@@ -165,8 +165,8 @@ uv run winml inspect -m your-org/your-model
 uv run winml build -m your-org/your-model -d auto -o output/
 ```
 
-For truly custom architectures, use `--trust-remote-code` and optionally provide
-a `--user-script` with your model class definition.
+For truly custom architectures, use `--trust-remote-code` to allow execution of
+model code from the Hugging Face Hub.
 
 ---
 

From fab806900130000601fc5bac5884825dd6cc0ecf Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 19:33:03 +0800
Subject: [PATCH 041/143] docs: remove custom build logic section from
 supported-models

Internal implementation detail not useful for end users.
---
 docs/reference/supported-models.md | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index b8bfba4aa..19c3d8cc0 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -133,25 +133,6 @@ Each validated model is tested against available EPs:
 
 ---
 
-## Models with Custom Build Logic
-
-These architectures have specialized export handling (multi-component, custom
-tracing, or composite pipelines):
-
-| Architecture | Type | Components |
-|-------------|------|------------|
-| CLIP | Vision-Language | clip_text_model + clip_vision_model |
-| SigLIP | Vision-Language | siglip_text_model + siglip_vision_model |
-| BLIP | Vision-Language | vision_encoder + decoder |
-| T5 / BART | Encoder-Decoder | encoder + decoder |
-| Marian | Translation | encoder + decoder |
-| MU2 | Encoder-Decoder | encoder + decoder |
-| VisionEncoderDecoder | Image-to-Text | encoder + decoder |
-| SAM / SAM2 | Segmentation | vision_encoder + prompt_encoder + mask_decoder |
-| Qwen | LLM | prefill + generation (composite) |
-
----
-
 ## Adding Unsupported Models
 
 If your model architecture isn't in the catalog, winml-cli may still support it

From 81abd2610e4fc34b74af8517aaa066a45560935f Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 19:47:02 +0800
Subject: [PATCH 042/143] docs: add analyze_result.json section to
 output-layout from actual build

Ran winml build -m microsoft/resnet-50 -d cpu --no-quant --no-compile
and used the real analyze_result.json output as the documentation example.

Also:
- Added export_htp_metadata.json to file listing
- Clarified build_manifest.json is Python API only
- Fixed classification levels: supported/partial/unsupported/unknown
- Used actual OP name format: OP/ai.onnx/Conv
---
 docs/reference/output-layout.md | 76 ++++++++++++++++++++++++++++++++-
 1 file changed, 74 insertions(+), 2 deletions(-)

diff --git a/docs/reference/output-layout.md b/docs/reference/output-layout.md
index 01377a254..0f7eeacf7 100644
--- a/docs/reference/output-layout.md
+++ b/docs/reference/output-layout.md
@@ -15,7 +15,9 @@ output/
 ├── model.onnx                  ← FINAL artifact (deploy this)
 ├── model.onnx.data             ← External weights (if model ≥ 100 MiB)
 ├── winml_build_config.json     ← Persisted build config
-├── build_manifest.json         ← Build provenance and timing
+├── analyze_result.json         ← Static analysis (EP compatibility)
+├── build_manifest.json         ← Build provenance (Python API only)
+├── export_htp_metadata.json    ← HTP export metadata (hierarchy info)
 ├── export.onnx                 ← Intermediate: raw ONNX export
 ├── export.onnx.data
 ├── optimized.onnx              ← Intermediate: after graph optimization
@@ -37,7 +39,9 @@ output/
 | `model.onnx` | The deployment-ready model. Always present. |
 | `model.onnx.data` | External weight data (only if model ≥ 100 MiB). Must stay alongside `model.onnx`. |
 | `winml_build_config.json` | The config used for this build (includes auto-discovered flags). Useful for reproducibility. |
-| `build_manifest.json` | Build metadata: stages run, timings, quantization stats. |
+| `analyze_result.json` | Static analysis output: EP compatibility, operator classification, detected patterns. |
+| `build_manifest.json` | Build provenance with stage timings. Only generated via the Python API (`build_hf_model`/`build_onnx_model`). |
+| `export_htp_metadata.json` | HTP export metadata: module hierarchy, tracing info, tagging coverage. |
 
 ### Intermediate Files (Can Delete After Build)
 
@@ -97,6 +101,74 @@ data file by name.
 
 ---
 
+## Analyzer Result
+
+`analyze_result.json` contains the static analysis output from the build pipeline's
+analyze stage. It reports EP compatibility and operator classification:
+
+```json
+{
+  "analysis_timestamp": "2026-06-04T19:45:17.496169",
+  "metadata": {
+    "model_path": "iter.onnx",
+    "opset_version": 17,
+    "producer_name": "pytorch",
+    "producer_version": "2.12.0",
+    "total_operators": 122,
+    "operator_counts": {
+      "Conv": 53,
+      "Relu": 49,
+      "MaxPool": 1,
+      "Add": 16,
+      "GlobalAveragePool": 1,
+      "Flatten": 1,
+      "Gemm": 1
+    },
+    "unique_operator_types": 7,
+    "detected_pattern_count": {}
+  },
+  "results": [
+    {
+      "ihv_type": "Microsoft",
+      "ep_type": "CPUExecutionProvider",
+      "device_type": "cpu",
+      "runtime_support": false,
+      "has_errors": false,
+      "has_warnings": false,
+      "classification": {
+        "supported": [],
+        "partial": [],
+        "unsupported": [],
+        "unknown": [
+          "OP/ai.onnx/Conv",
+          "OP/ai.onnx/Relu",
+          "OP/ai.onnx/MaxPool",
+          "OP/ai.onnx/Add",
+          "OP/ai.onnx/GlobalAveragePool",
+          "OP/ai.onnx/Flatten",
+          "OP/ai.onnx/Gemm"
+        ]
+      },
+      "information": []
+    }
+  ]
+}
+```
+
+Key fields:
+
+| Field | Description |
+|-------|-------------|
+| `metadata.total_operators` | Total ONNX operator nodes in the model graph |
+| `metadata.operator_counts` | Frequency of each operator type |
+| `metadata.detected_pattern_count` | Fused subgraph patterns (GeLU, LayerNorm, etc.) |
+| `results[].ihv_type` | Hardware vendor (`"Microsoft"`, `"QC"`, `"Intel"`, etc.) |
+| `results[].runtime_support` | `true` if the EP can run all operators |
+| `results[].classification` | Operators grouped by support level: `supported`, `partial`, `unsupported`, `unknown` |
+| `results[].has_errors` | `true` if unsupported ops exist (model won't run on that EP) |
+
+---
+
 ## Build Manifest
 
 `build_manifest.json` records provenance for every build:

From e800c35f253690c47a53e3dcd4d68b86174fd15a Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:08:11 +0800
Subject: [PATCH 043/143] docs: add 'Build from Your Own ONNX File' tutorial

Covers the workflow for users who already have an .onnx file:
analyze, build with device targeting, skip stages, benchmark,
Python API inference, and config customization.
---
 docs/tutorials/build-from-onnx.md | 169 ++++++++++++++++++++++++++++++
 mkdocs.yml                        |   1 +
 2 files changed, 170 insertions(+)
 create mode 100644 docs/tutorials/build-from-onnx.md

diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
new file mode 100644
index 000000000..b656670d8
--- /dev/null
+++ b/docs/tutorials/build-from-onnx.md
@@ -0,0 +1,169 @@
+# Build from Your Own ONNX File
+
+This tutorial shows how to use winml-cli to optimize, quantize, and compile a
+model you've already exported to ONNX — without going through HuggingFace or
+PyTorch export.
+
+Use this workflow when:
+
+- You exported a model yourself (via `torch.onnx.export`, ONNX Runtime tools, etc.)
+- You received an `.onnx` file from a teammate or vendor
+- You want to optimize a model from the ONNX Model Zoo
+
+---
+
+## Prerequisites
+
+- **winml-cli** installed — see [Installation](../getting-started/installation.md)
+- An ONNX model file (e.g., `my_model.onnx`)
+- (Optional) QNN SDK for NPU compilation, or OpenVINO for Intel NPU
+
+---
+
+## Step 1: Analyze your ONNX file
+
+Before building, run the static analyzer to understand EP compatibility:
+
+```bash
+uv run winml analyze --model my_model.onnx
+```
+
+This reports which operators are supported by each EP, whether the model can run
+on NPU/GPU without modification, and which patterns (GeLU, LayerNorm, etc.) were
+detected.
+
+To save the analysis as JSON:
+
+```bash
+uv run winml analyze --model my_model.onnx --output analysis.json
+```
+
+---
+
+## Step 2: Build with `winml build`
+
+Pass your `.onnx` file directly as the model argument. winml-cli auto-detects
+that it's a local ONNX file and skips the export stage:
+
+```bash
+uv run winml build -m my_model.onnx -d cpu -o output/
+```
+
+This runs: **optimize → quantize → compile → model.onnx**.
+
+### Target a specific device
+
+=== "CPU (default)"
+
+    ```bash
+    uv run winml build -m my_model.onnx -d cpu -o output/
+    ```
+
+=== "NPU (QNN)"
+
+    ```bash
+    uv run winml build -m my_model.onnx -d npu -o output/
+    ```
+
+=== "GPU (DirectML)"
+
+    ```bash
+    uv run winml build -m my_model.onnx -d gpu --ep dml -o output/
+    ```
+
+### Skip stages
+
+If you only want optimization (no quantization or compilation):
+
+```bash
+uv run winml build -m my_model.onnx -d cpu --no-quant --no-compile -o output/
+```
+
+---
+
+## Step 3: Inspect the output
+
+After the build, your output directory looks like:
+
+```text
+output/
+├── model.onnx                     ← Deploy this
+├── my_model.onnx                  ← Copy of your input
+├── my_model_optimized.onnx        ← After graph optimization
+├── my_model_quantized.onnx        ← After quantization (if enabled)
+├── my_model_compiled.onnx         ← After EP compilation (if enabled)
+├── winml_build_config.json        ← Build config used
+└── analyze_result.json            ← EP compatibility analysis
+```
+
+!!! tip
+    If your model is ≥ 100 MiB, each `.onnx` file will have a companion
+    `.onnx.data` file containing the external weights.
+
+---
+
+## Step 4: Benchmark
+
+Run the performance benchmark against the final artifact:
+
+```bash
+uv run winml perf -m output/model.onnx -d cpu --warmup 5 --iterations 100
+```
+
+For NPU:
+
+```bash
+uv run winml perf -m output/model.onnx -d npu --warmup 5 --iterations 100
+```
+
+---
+
+## Step 5: Run inference (Python API)
+
+```python
+from winml.modelkit import WinMLAutoModel
+
+model = WinMLAutoModel.from_onnx(
+    "output/model.onnx",
+    task="image-classification",  # set your task
+    skip_build=True,              # already built, just load
+)
+
+output = model(pixel_values=your_input_tensor)
+```
+
+---
+
+## Using a config file
+
+For finer control, generate a config first and customize it:
+
+```bash
+uv run winml config --onnx my_model.onnx -d npu -o config.json
+```
+
+Edit `config.json` (adjust quantization parameters, compilation options, etc.),
+then build with it:
+
+```bash
+uv run winml build -m my_model.onnx -c config.json -o output/
+```
+
+---
+
+## Troubleshooting
+
+| Problem | Solution |
+|---------|----------|
+| "ONNX file not found" | Use an absolute path or ensure the file is in the current directory |
+| Quantization fails | Try `--no-quant` first to confirm optimize + compile works, then investigate calibration |
+| EP compilation fails | Ensure the target EP SDK is installed (`QNN_SDK_ROOT` for QNN, OpenVINO runtime for Intel) |
+| Model too large for memory | Use `--no-compile` and compile on the target device |
+
+---
+
+## See also
+
+- [Output Layout](../reference/output-layout.md) — what each output file contains
+- [Build Config Schema](../reference/index.md) — customize the build config
+- [ConvNeXt on NPU](npu-convnext.md) — full tutorial starting from HuggingFace
diff --git a/mkdocs.yml b/mkdocs.yml
index ed1da3955..2a9e5a91f 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -116,6 +116,7 @@ nav:
   - Tutorials:
       - Overview: tutorials/index.md
       - ConvNeXt on NPU: tutorials/npu-convnext.md
+      - Build from ONNX: tutorials/build-from-onnx.md
   - Reference:
       - Build Config Schema: reference/index.md
       - Python API: reference/python-api.md

From 222227bcb8a288ca6694be06badf2cdf668aecf0 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:17:42 +0800
Subject: [PATCH 044/143] docs: rewrite ONNX tutorial as 'Bring Your Own Model'
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Section A — Primitive commands:
  analyze → optimize with recommendations → re-analyze (feedback loop)
  Plus optional quantize + compile for NPU

Section B — One-shot with winml build:
  Auto-runs the analyze-optimize convergence loop
  Config-driven NPU targeting with quantization

Matches the style and depth of the ConvNeXt NPU tutorial.
---
 docs/tutorials/build-from-onnx.md | 254 +++++++++++++++++++++---------
 mkdocs.yml                        |   2 +-
 2 files changed, 181 insertions(+), 75 deletions(-)

diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index b656670d8..b300c8122 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -1,152 +1,255 @@
-# Build from Your Own ONNX File
+# Bring Your Own Model
 
-This tutorial shows how to use winml-cli to optimize, quantize, and compile a
-model you've already exported to ONNX — without going through HuggingFace or
-PyTorch export.
+This tutorial walks you through the complete workflow for optimizing, analyzing, and deploying an ONNX model you already have — whether you exported it yourself (`torch.onnx.export`, ONNX Runtime tools), received it from a teammate, or downloaded it from the ONNX Model Zoo.
 
-Use this workflow when:
+Unlike the [ConvNeXt on NPU](npu-convnext.md) tutorial which starts from a HuggingFace model ID, this tutorial assumes you already have a `.onnx` file on disk and want to make it run faster on your target hardware.
 
-- You exported a model yourself (via `torch.onnx.export`, ONNX Runtime tools, etc.)
-- You received an `.onnx` file from a teammate or vendor
-- You want to optimize a model from the ONNX Model Zoo
+The tutorial is split into two sections. Section A walks through the analyze → optimize → re-analyze loop using primitive commands, teaching you how the optimization feedback cycle works. Section B shows how `winml build` automates that same loop in a single command, optionally targeting NPU with quantization.
 
 ---
 
 ## Prerequisites
 
+- **Windows 11 24H2** — required for NPU stack support
+- **Python 3.11** and **uv** installed (`pip install uv` or follow [astral.sh/uv](https://astral.sh/uv))
 - **winml-cli** installed — see [Installation](../getting-started/installation.md)
-- An ONNX model file (e.g., `my_model.onnx`)
-- (Optional) QNN SDK for NPU compilation, or OpenVINO for Intel NPU
+- **An ONNX model file** — this tutorial uses `my_model.onnx` as a placeholder; substitute your own file
+- **For QNN (Snapdragon NPU):** QAIRT SDK installed and `QNN_SDK_ROOT` set to its root directory
+- **For OpenVINO (Intel CPU/GPU/NPU):** OpenVINO runtime installed and registered as an ONNX Runtime EP
+
+> No NPU? Set `--device cpu` wherever you see `--device npu`. Every other flag stays the same.
 
 ---
 
-## Step 1: Analyze your ONNX file
+## Section A — Primitive commands
+
+Working through the primitive commands one at a time reveals how the analyze–optimize feedback cycle works. Each command accepts the output of the previous step as input, and every intermediate artifact is available for inspection.
+
+### Step 1: Analyze the original model
 
-Before building, run the static analyzer to understand EP compatibility:
+Before any optimization, run the static analyzer to understand your model's EP compatibility and operator profile:
 
 ```bash
-uv run winml analyze --model my_model.onnx
+uv run winml analyze --model my_model.onnx --ep qnn --device npu
 ```
 
-This reports which operators are supported by each EP, whether the model can run
-on NPU/GPU without modification, and which patterns (GeLU, LayerNorm, etc.) were
-detected.
+The analyzer classifies every operator in the graph as **supported**, **partial**, **unsupported**, or **unknown** for the target EP. It also detects fusible subgraph patterns (GeLU, LayerNorm, Attention, etc.) that the optimizer can collapse.
 
-To save the analysis as JSON:
+Save the results to a file for reference:
 
 ```bash
-uv run winml analyze --model my_model.onnx --output analysis.json
+uv run winml analyze --model my_model.onnx --ep qnn --device npu --output before_optim.json
+```
+
+A representative output looks like:
+
+```text
+Model:              my_model.onnx
+Opset:              17
+Total operators:    245
+Unique op types:    12
+
+EP:                 QNNExecutionProvider (NPU)
+Runtime support:    ✓ (all operators supported)
+Patterns detected:  SUBGRAPH/GELU_Erf (12), SUBGRAPH/LayerNorm (6)
+
+Recommendation:     Enable gelu_fusion and layer_norm_fusion to reduce
+                    node count and improve NPU throughput.
 ```
 
+!!! note "What we just did"
+    The analyzer performs static analysis — no runtime or hardware required. It tells you two things: (1) can the model run on your target EP at all, and (2) are there graph patterns that the optimizer can fuse to improve performance. The "Recommendation" section is the key output — it tells you exactly which optimization flags to pass to the next step.
+
 ---
 
-## Step 2: Build with `winml build`
+### Step 2: Optimize with recommended options
 
-Pass your `.onnx` file directly as the model argument. winml-cli auto-detects
-that it's a local ONNX file and skips the export stage:
+Take the analyzer's recommendations and pass them as flags to the optimizer:
 
 ```bash
-uv run winml build -m my_model.onnx -d cpu -o output/
+uv run winml optimize -m my_model.onnx -o my_model_optimized.onnx \
+    --enable-gelu-fusion --enable-layer-norm-fusion
 ```
 
-This runs: **optimize → quantize → compile → model.onnx**.
+The optimizer reports how many nodes were reduced. Typical output:
+
+```text
+Input:     245 nodes (12 unique op types)
+Fusions:   gelu_fusion (12 matches), layer_norm_fusion (6 matches)
+Output:    209 nodes (8 unique op types)
+Saved:     my_model_optimized.onnx
+```
+
+To see all available optimization flags:
+
+```bash
+uv run winml optimize --list-capabilities
+```
+
+You can also drive optimization from a config file rather than CLI flags:
+
+```bash
+uv run winml optimize -m my_model.onnx -o my_model_optimized.onnx -c config.json
+```
+
+!!! note "What we just did"
+    Graph optimization fuses multi-node patterns (like the 5-node GeLU/Erf sequence) into single high-level operators that EPs can execute more efficiently. The optimizer is purely a graph transformation — it doesn't change the model's numerical behavior or require calibration data. Running it before quantization is important: calibration should be performed on the already-fused topology, not the verbose original graph.
+
+---
+
+### Step 3: Re-analyze the optimized model
+
+Run the analyzer again on the optimized output to confirm that the fusions resolved and no new issues appeared:
+
+```bash
+uv run winml analyze --model my_model_optimized.onnx --ep qnn --device npu --output after_optim.json
+```
+
+Compare the results:
+
+```text
+Model:              my_model_optimized.onnx
+Opset:              17
+Total operators:    209
+Unique op types:    8
+
+EP:                 QNNExecutionProvider (NPU)
+Runtime support:    ✓ (all operators supported)
+Patterns detected:  (none remaining — all fused)
+
+No further optimizations recommended.
+```
+
+!!! note "What we just did"
+    The analyze → optimize → re-analyze cycle is the fundamental feedback loop in winml-cli. In Section B you'll see that `winml build` automates this loop — it calls the analyzer, applies recommendations, re-analyzes, and repeats until convergence (typically 1–3 iterations). Doing it manually here teaches you what the automation is actually doing under the hood.
+
+---
+
+### Step 4: Benchmark the optimized model
+
+Measure the performance improvement from optimization:
 
-### Target a specific device
+```bash
+uv run winml perf -m my_model_optimized.onnx --device cpu --warmup 5 --iterations 50
+```
 
-=== "CPU (default)"
+For NPU (if you have the compiled model from a later step):
 
-    ```bash
-    uv run winml build -m my_model.onnx -d cpu -o output/
-    ```
+```bash
+uv run winml perf -m my_model_optimized.onnx --device npu --warmup 5 --iterations 50
+```
 
-=== "NPU (QNN)"
+---
 
-    ```bash
-    uv run winml build -m my_model.onnx -d npu -o output/
-    ```
+### Step 5 (optional): Quantize and compile for NPU
 
-=== "GPU (DirectML)"
+If your target is NPU deployment, continue the pipeline with quantization and compilation:
 
-    ```bash
-    uv run winml build -m my_model.onnx -d gpu --ep dml -o output/
-    ```
+```bash
+# Quantize (INT8, QDQ format)
+uv run winml quantize -m my_model_optimized.onnx -o my_model_int8.onnx --precision int8 --samples 32
 
-### Skip stages
+# Compile for QNN NPU
+uv run winml compile -m my_model_int8.onnx --device npu
+```
 
-If you only want optimization (no quantization or compilation):
+Then benchmark the final compiled artifact:
 
 ```bash
-uv run winml build -m my_model.onnx -d cpu --no-quant --no-compile -o output/
+uv run winml perf -m my_model_int8_npu_ctx.onnx --device npu --iterations 50 --monitor
 ```
 
 ---
 
-## Step 3: Inspect the output
+## Section B — One-shot with `winml build`
 
-After the build, your output directory looks like:
+Once you understand the analyze → optimize → re-analyze loop (which you now do), you can let `winml build` handle everything in one command. When you pass a `.onnx` file, winml-cli auto-detects it and skips the export stage — running the optimization loop, quantization, and compilation automatically.
+
+### CPU target (optimize only)
+
+```bash
+uv run winml build -m my_model.onnx -d cpu -o output/ --no-quant --no-compile
+```
+
+This runs the analyze–optimize convergence loop and writes the optimized model:
 
 ```text
 output/
 ├── model.onnx                     ← Deploy this
 ├── my_model.onnx                  ← Copy of your input
 ├── my_model_optimized.onnx        ← After graph optimization
-├── my_model_quantized.onnx        ← After quantization (if enabled)
-├── my_model_compiled.onnx         ← After EP compilation (if enabled)
-├── winml_build_config.json        ← Build config used
-└── analyze_result.json            ← EP compatibility analysis
+├── winml_build_config.json        ← Auto-generated build config
+└── analyze_result.json            ← Final analysis output
 ```
 
-!!! tip
-    If your model is ≥ 100 MiB, each `.onnx` file will have a companion
-    `.onnx.data` file containing the external weights.
+### NPU target (full pipeline)
 
----
+To get a quantized, compiled model for NPU in one shot, generate a config first:
 
-## Step 4: Benchmark
+```bash
+uv run winml config --onnx my_model.onnx -d npu --precision int8 -o config.json
+```
 
-Run the performance benchmark against the final artifact:
+Then build:
 
 ```bash
-uv run winml perf -m output/model.onnx -d cpu --warmup 5 --iterations 100
+uv run winml build -m my_model.onnx -c config.json -o output/
 ```
 
-For NPU:
+The pipeline runs: **analyze → optimize → (re-analyze → re-optimize if needed) → quantize → compile → model.onnx**.
+
+The output directory for a full NPU build looks like:
+
+```text
+output/
+├── model.onnx                     ← FINAL: compiled NPU artifact
+├── my_model.onnx                  ← Copy of your input
+├── my_model_optimized.onnx        ← After optimization loop converged
+├── my_model_quantized.onnx        ← After INT8 quantization
+├── my_model_compiled.onnx         ← After EP compilation
+├── winml_build_config.json        ← Config used (including auto-detected options)
+└── analyze_result.json            ← Analysis from optimize stage
+```
+
+!!! note "What we just did"
+    `winml build` with an ONNX input runs the same analyze → optimize → re-analyze convergence loop from Section A, but automatically. It reads the analyzer's recommendations, applies them, re-runs the analyzer, and repeats until no new recommendations appear (max 3 iterations by default). The config file specifies device, precision, and EP — so `--device npu --precision int8` in the config causes quantize and compile stages to run automatically.
+
+### Selectively skip stages
+
+- `--no-quant` — skip quantization (produces a floating-point optimized model)
+- `--no-compile` — skip compilation (useful if you'll compile on the target device later)
 
 ```bash
-uv run winml perf -m output/model.onnx -d npu --warmup 5 --iterations 100
+# Optimize + quantize, but skip compilation
+uv run winml build -m my_model.onnx -d npu --no-compile -o output/
 ```
 
 ---
 
-## Step 5: Run inference (Python API)
+## Using the Python API
 
 ```python
 from winml.modelkit import WinMLAutoModel
 
+# Load from a pre-built ONNX (skips the build pipeline)
 model = WinMLAutoModel.from_onnx(
     "output/model.onnx",
     task="image-classification",  # set your task
-    skip_build=True,              # already built, just load
+    skip_build=True,
 )
 
 output = model(pixel_values=your_input_tensor)
 ```
 
----
-
-## Using a config file
-
-For finer control, generate a config first and customize it:
+Or trigger the full build programmatically:
 
-```bash
-uv run winml config --onnx my_model.onnx -d npu -o config.json
-```
-
-Edit `config.json` (adjust quantization parameters, compilation options, etc.),
-then build with it:
+```python
+from winml.modelkit.build import build_onnx_model
+from winml.modelkit.config import generate_build_config
 
-```bash
-uv run winml build -m my_model.onnx -c config.json -o output/
+config = generate_build_config(onnx_path="my_model.onnx", device="npu", precision="int8")
+result = build_onnx_model("my_model.onnx", config=config, output_dir="output/")
+print(f"Final model: {result.final_onnx_path}")
 ```
 
 ---
@@ -156,14 +259,17 @@ uv run winml build -m my_model.onnx -c config.json -o output/
 | Problem | Solution |
 |---------|----------|
 | "ONNX file not found" | Use an absolute path or ensure the file is in the current directory |
-| Quantization fails | Try `--no-quant` first to confirm optimize + compile works, then investigate calibration |
+| Analyzer reports unsupported ops | Check if an optimization fusion resolves them; if not, the model needs modification for that EP |
+| Optimization loop doesn't converge | The default max is 3 iterations; if patterns persist, they may not be fusible — use `--no-quant --no-compile` and inspect |
+| Quantization accuracy regression | Try `--precision int16`, `--per-channel`, or increase `--samples` for better calibration |
 | EP compilation fails | Ensure the target EP SDK is installed (`QNN_SDK_ROOT` for QNN, OpenVINO runtime for Intel) |
 | Model too large for memory | Use `--no-compile` and compile on the target device |
 
 ---
 
-## See also
+## Where to go next
 
-- [Output Layout](../reference/output-layout.md) — what each output file contains
-- [Build Config Schema](../reference/index.md) — customize the build config
-- [ConvNeXt on NPU](npu-convnext.md) — full tutorial starting from HuggingFace
+- [ConvNeXt on NPU](npu-convnext.md) — the same pipeline starting from HuggingFace (includes export stage)
+- [Output Layout](../reference/output-layout.md) — what each output file contains and the `analyze_result.json` schema
+- [Concepts → Analyze and optimize](../concepts/analyze-and-optimize.md) — how the convergence loop works internally
+- [Build Config Schema](../reference/index.md) — customize quantization, compilation, and optimization settings
diff --git a/mkdocs.yml b/mkdocs.yml
index 2a9e5a91f..f3aff7628 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -116,7 +116,7 @@ nav:
   - Tutorials:
       - Overview: tutorials/index.md
       - ConvNeXt on NPU: tutorials/npu-convnext.md
-      - Build from ONNX: tutorials/build-from-onnx.md
+      - Bring Your Own Model: tutorials/build-from-onnx.md
   - Reference:
       - Build Config Schema: reference/index.md
       - Python API: reference/python-api.md

From 66c521df39e95383fc02d18e128f4ab582659806 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:18:44 +0800
Subject: [PATCH 045/143] docs: rename tutorial to 'Bring Your Own ONNX Model'

---
 docs/tutorials/build-from-onnx.md | 2 +-
 mkdocs.yml                        | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index b300c8122..88b6d5512 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -1,4 +1,4 @@
-# Bring Your Own Model
+# Bring Your Own ONNX Model
 
 This tutorial walks you through the complete workflow for optimizing, analyzing, and deploying an ONNX model you already have — whether you exported it yourself (`torch.onnx.export`, ONNX Runtime tools), received it from a teammate, or downloaded it from the ONNX Model Zoo.
 
diff --git a/mkdocs.yml b/mkdocs.yml
index f3aff7628..81a07ae9b 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -116,7 +116,7 @@ nav:
   - Tutorials:
       - Overview: tutorials/index.md
       - ConvNeXt on NPU: tutorials/npu-convnext.md
-      - Bring Your Own Model: tutorials/build-from-onnx.md
+      - Bring Your Own ONNX Model: tutorials/build-from-onnx.md
   - Reference:
       - Build Config Schema: reference/index.md
       - Python API: reference/python-api.md

From 99a80dd8da472c078d5e5eadaef41fe3e1b1e832 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:20:09 +0800
Subject: [PATCH 046/143] docs: add QNN GPU (Adreno) device to EP table

---
 docs/concepts/eps-and-devices.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/concepts/eps-and-devices.md b/docs/concepts/eps-and-devices.md
index abf447c27..e735114d7 100644
--- a/docs/concepts/eps-and-devices.md
+++ b/docs/concepts/eps-and-devices.md
@@ -10,7 +10,7 @@ The table below lists every Execution Provider that winml-cli has explicit suppo
 
 | EP | Device | Hardware | When to use |
 |----|--------|----------|-------------|
-| `QNNExecutionProvider` | npu | Qualcomm NPU (Hexagon DSP) | Snapdragon-based Copilot+ PCs; best latency and power efficiency on Qualcomm silicon |
+| `QNNExecutionProvider` | npu / gpu | Qualcomm NPU (Hexagon DSP) / Qualcomm GPU (Adreno) | Snapdragon-based Copilot+ PCs; best latency and power efficiency on Qualcomm silicon |
 | `VitisAIExecutionProvider` | npu | AMD NPU (XDNA) | AMD Ryzen AI platforms; targets the AMD AI Engine via the Vitis AI stack |
 | `OpenVINOExecutionProvider` | npu / gpu / cpu | Intel CPU / GPU / NPU | Intel Core Ultra platforms; flexible device targeting across all three Intel compute types |
 | `DmlExecutionProvider` | gpu | GPU (DirectML) | Any DirectX 12 GPU on Windows; broad compatibility across AMD, Intel, and NVIDIA discrete/integrated graphics |

From 644d83273f1037818548c799c95af6a15c8fc97d Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:21:10 +0800
Subject: [PATCH 047/143] docs: add Bring Your Own ONNX Model to tutorials
 overview

---
 docs/tutorials/index.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
index c5c6838d2..7f4d23713 100644
--- a/docs/tutorials/index.md
+++ b/docs/tutorials/index.md
@@ -7,5 +7,6 @@ Tutorials are linear, prescriptive, end-to-end walkthroughs that guide you throu
 | Tutorial | What you'll build | Hardware |
 |---|---|---|
 | [ConvNeXt on NPU](npu-convnext.md) | A quantized ConvNeXt image classifier compiled for Snapdragon NPU (with CPU/DirectML fallback) | Copilot+PC NPU primary; CPU works as fallback |
+| [Bring Your Own ONNX Model](build-from-onnx.md) | Optimize and deploy an ONNX file you already have, using the analyze → optimize → re-analyze feedback loop | Any (CPU, NPU, GPU) |
 
 More tutorials are coming, covering additional model families, execution providers, and deployment scenarios. Check back as the `winml-cli` documentation expands.

From d92636a728b81d8098670d871b5b2656c75cf4c5 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:26:40 +0800
Subject: [PATCH 048/143] docs: fix analyze/optimize workflow in BYOM tutorial

Use --optim-config to output optimization config from analyzer,
then pass it to optimizer with -c flag (matching actual CLI API).
---
 docs/tutorials/build-from-onnx.md | 34 ++++++++++++-------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index 88b6d5512..da01c117d 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -27,18 +27,18 @@ Working through the primitive commands one at a time reveals how the analyze–o
 
 ### Step 1: Analyze the original model
 
-Before any optimization, run the static analyzer to understand your model's EP compatibility and operator profile:
+Before any optimization, run the static analyzer to understand your model's EP compatibility and get optimization recommendations:
 
 ```bash
-uv run winml analyze --model my_model.onnx --ep qnn --device npu
+uv run winml analyze --model my_model.onnx --optim-config optim_config.json
 ```
 
-The analyzer classifies every operator in the graph as **supported**, **partial**, **unsupported**, or **unknown** for the target EP. It also detects fusible subgraph patterns (GeLU, LayerNorm, Attention, etc.) that the optimizer can collapse.
+The analyzer classifies every operator in the graph as **supported**, **partial**, **unsupported**, or **unknown** for each EP. It also detects fusible subgraph patterns (GeLU, LayerNorm, Attention, etc.) and writes the recommended optimization flags to `optim_config.json`.
 
-Save the results to a file for reference:
+To target a specific EP:
 
 ```bash
-uv run winml analyze --model my_model.onnx --ep qnn --device npu --output before_optim.json
+uv run winml analyze --model my_model.onnx --ep qnn --device npu --optim-config optim_config.json
 ```
 
 A representative output looks like:
@@ -53,25 +53,23 @@ EP:                 QNNExecutionProvider (NPU)
 Runtime support:    ✓ (all operators supported)
 Patterns detected:  SUBGRAPH/GELU_Erf (12), SUBGRAPH/LayerNorm (6)
 
-Recommendation:     Enable gelu_fusion and layer_norm_fusion to reduce
-                    node count and improve NPU throughput.
+Optimization config saved to: optim_config.json
 ```
 
 !!! note "What we just did"
-    The analyzer performs static analysis — no runtime or hardware required. It tells you two things: (1) can the model run on your target EP at all, and (2) are there graph patterns that the optimizer can fuse to improve performance. The "Recommendation" section is the key output — it tells you exactly which optimization flags to pass to the next step.
+    The analyzer performs static analysis — no runtime or hardware required. It tells you two things: (1) can the model run on your target EP at all, and (2) are there graph patterns that the optimizer can fuse to improve performance. The `--optim-config` flag is the key — it outputs a JSON file with the exact optimization settings the optimizer needs to resolve the detected patterns.
 
 ---
 
-### Step 2: Optimize with recommended options
+### Step 2: Optimize with the generated config
 
-Take the analyzer's recommendations and pass them as flags to the optimizer:
+Pass the analyzer's output config directly to the optimizer:
 
 ```bash
-uv run winml optimize -m my_model.onnx -o my_model_optimized.onnx \
-    --enable-gelu-fusion --enable-layer-norm-fusion
+uv run winml optimize -m my_model.onnx -c optim_config.json -o my_model_optimized.onnx
 ```
 
-The optimizer reports how many nodes were reduced. Typical output:
+The optimizer applies the fusions specified in the config and reports how many nodes were reduced. Typical output:
 
 ```text
 Input:     245 nodes (12 unique op types)
@@ -80,18 +78,12 @@ Output:    209 nodes (8 unique op types)
 Saved:     my_model_optimized.onnx
 ```
 
-To see all available optimization flags:
+To see all available optimization capabilities:
 
 ```bash
 uv run winml optimize --list-capabilities
 ```
 
-You can also drive optimization from a config file rather than CLI flags:
-
-```bash
-uv run winml optimize -m my_model.onnx -o my_model_optimized.onnx -c config.json
-```
-
 !!! note "What we just did"
     Graph optimization fuses multi-node patterns (like the 5-node GeLU/Erf sequence) into single high-level operators that EPs can execute more efficiently. The optimizer is purely a graph transformation — it doesn't change the model's numerical behavior or require calibration data. Running it before quantization is important: calibration should be performed on the already-fused topology, not the verbose original graph.
 
@@ -102,7 +94,7 @@ uv run winml optimize -m my_model.onnx -o my_model_optimized.onnx -c config.json
 Run the analyzer again on the optimized output to confirm that the fusions resolved and no new issues appeared:
 
 ```bash
-uv run winml analyze --model my_model_optimized.onnx --ep qnn --device npu --output after_optim.json
+uv run winml analyze --model my_model_optimized.onnx --ep qnn --device npu
 ```
 
 Compare the results:

From d1a3fbad5b4a6f08ac29ea8ae6862d448c534be5 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:39:23 +0800
Subject: [PATCH 049/143] docs: fix analyze/optimize output in BYOM tutorial
 from actual runs

- Show real analyze output format (per-EP op tables with S/P/U/Unk)
- Show real optimize output format (Loading/Running/Saving/Success)
- Note that simple models like ResNet have no fusible patterns
- Add tip about transformer models having higher reduction rates
---
 docs/tutorials/build-from-onnx.md | 74 ++++++++++++++++++-------------
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index da01c117d..aaae4a401 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -33,7 +33,7 @@ Before any optimization, run the static analyzer to understand your model's EP c
 uv run winml analyze --model my_model.onnx --optim-config optim_config.json
 ```
 
-The analyzer classifies every operator in the graph as **supported**, **partial**, **unsupported**, or **unknown** for each EP. It also detects fusible subgraph patterns (GeLU, LayerNorm, Attention, etc.) and writes the recommended optimization flags to `optim_config.json`.
+The analyzer classifies every operator in the graph as **supported**, **partial**, **unsupported**, or **unknown** for each available EP. It also detects fusible subgraph patterns and writes the recommended optimization flags to `optim_config.json`.
 
 To target a specific EP:
 
@@ -41,23 +41,39 @@ To target a specific EP:
 uv run winml analyze --model my_model.onnx --ep qnn --device npu --optim-config optim_config.json
 ```
 
-A representative output looks like:
+The output shows per-EP compatibility results:
 
 ```text
-Model:              my_model.onnx
-Opset:              17
-Total operators:    245
-Unique op types:    12
-
-EP:                 QNNExecutionProvider (NPU)
-Runtime support:    ✓ (all operators supported)
-Patterns detected:  SUBGRAPH/GELU_Erf (12), SUBGRAPH/LayerNorm (6)
-
-Optimization config saved to: optim_config.json
+══════════════════════════════════════════════════════════════════════════
+📊 OP CHECK
+══════════════════════════════════════════════════════════════════════════
+   📚 Model: my_model.onnx
+   🔺 Opset: 17  Producer: pytorch v2.12.0
+   📏 Operators: 122 total, 7 unique types
+   🏗️ Analysis targets: QNNExecutionProvider (NPU), QNNExecutionProvider (GPU)
+────────────────────────────────────────────────────────────────────────
+👻 EP 1: QNNExecutionProvider on NPU
+────────────────────────────────────────────────────────────────────────
+ Op Type                       S/P/U/Unk
+ 🃓 Conv (53)                  53/0/0/0
+ 🃓 Relu (49)                  49/0/0/0
+ 🃓 Add (16)                   16/0/0/0
+ 🃓 MaxPool (1)                1/0/0/0
+ 🃓 GlobalAveragePool (1)      1/0/0/0
+ 🃓 Flatten (1)                1/0/0/0
+ 🃓 Gemm (1)                   1/0/0/0
+ TOTAL (122)                   122/0/0/0
+══════════════════════════════════════════════════════════════════════════
+📊 ANALYSIS SUMMARY
+══════════════════════════════════════════════════════════════════════════
+   🃓 QNNExecutionProvider (NPU): 122/0/0/0
+      Ready to deploy
 ```
 
+If the analyzer detects fusible patterns (GeLU, LayerNorm, etc.), they will appear in the output and the `optim_config.json` will contain the recommended fusion settings. If no patterns are detected (as with simple architectures like ResNet), the config will be empty `{}`.
+
 !!! note "What we just did"
-    The analyzer performs static analysis — no runtime or hardware required. It tells you two things: (1) can the model run on your target EP at all, and (2) are there graph patterns that the optimizer can fuse to improve performance. The `--optim-config` flag is the key — it outputs a JSON file with the exact optimization settings the optimizer needs to resolve the detected patterns.
+    The analyzer performs static analysis — no runtime or hardware required. It tells you two things: (1) can the model run on your target EP at all, and (2) are there graph patterns that the optimizer can fuse to improve performance. The `--optim-config` flag outputs a JSON file with the exact optimization settings the optimizer needs. S/P/U/Unk = Supported/Partial/Unsupported/Unknown.
 
 ---
 
@@ -69,15 +85,22 @@ Pass the analyzer's output config directly to the optimizer:
 uv run winml optimize -m my_model.onnx -c optim_config.json -o my_model_optimized.onnx
 ```
 
-The optimizer applies the fusions specified in the config and reports how many nodes were reduced. Typical output:
+The optimizer applies the fusions specified in the config. Output:
 
 ```text
-Input:     245 nodes (12 unique op types)
-Fusions:   gelu_fusion (12 matches), layer_norm_fusion (6 matches)
-Output:    209 nodes (8 unique op types)
-Saved:     my_model_optimized.onnx
+Input: my_model.onnx
+Output: my_model_optimized.onnx
+Loading model...
+Running optimizer...
+Saving optimized model...
+
+Success! Model optimized: my_model_optimized.onnx
+Nodes: 122 -> 122 (0.0% reduction)
 ```
 
+!!! tip
+    The node reduction depends on your model's architecture. Simple models like ResNet (only Conv, Relu, Add) have no fusible patterns. Transformer-based models (BERT, ViT) typically see 10–30% node reduction from GeLU, LayerNorm, and Attention fusions.
+
 To see all available optimization capabilities:
 
 ```bash
@@ -97,20 +120,7 @@ Run the analyzer again on the optimized output to confirm that the fusions resol
 uv run winml analyze --model my_model_optimized.onnx --ep qnn --device npu
 ```
 
-Compare the results:
-
-```text
-Model:              my_model_optimized.onnx
-Opset:              17
-Total operators:    209
-Unique op types:    8
-
-EP:                 QNNExecutionProvider (NPU)
-Runtime support:    ✓ (all operators supported)
-Patterns detected:  (none remaining — all fused)
-
-No further optimizations recommended.
-```
+If the original analysis found fusible patterns that were optimized away, this run should show zero detected patterns and the same or better EP compatibility score.
 
 !!! note "What we just did"
     The analyze → optimize → re-analyze cycle is the fundamental feedback loop in winml-cli. In Section B you'll see that `winml build` automates this loop — it calls the analyzer, applies recommendations, re-analyzes, and repeats until convergence (typically 1–3 iterations). Doing it manually here teaches you what the automation is actually doing under the hood.

From d9ccc8a10ae03c5a166f74beec8d390a6a2963e7 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 4 Jun 2026 21:47:06 +0800
Subject: [PATCH 050/143] docs: fix build stage defaults in BYOM tutorial

- Compilation OFF by default (use --compile to enable)
- NPU: quantize ON (w8a16), GPU/CPU: quantize OFF (fp16)
- Flag is --no-quant (no --quantize toggle)
- Fix examples to match actual default behavior
---
 docs/tutorials/build-from-onnx.md | 38 +++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index aaae4a401..8dddc19c3 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -170,10 +170,10 @@ Once you understand the analyze → optimize → re-analyze loop (which you now
 ### CPU target (optimize only)
 
 ```bash
-uv run winml build -m my_model.onnx -d cpu -o output/ --no-quant --no-compile
+uv run winml build -m my_model.onnx -d cpu -o output/
 ```
 
-This runs the analyze–optimize convergence loop and writes the optimized model:
+Since `-d cpu` resolves to fp16 precision (no quantization) and compilation is off by default, this just runs the analyze–optimize convergence loop:
 
 ```text
 output/
@@ -186,15 +186,16 @@ output/
 
 ### NPU target (full pipeline)
 
-To get a quantized, compiled model for NPU in one shot, generate a config first:
+To get a quantized, compiled model for NPU in one shot, pass `--compile`:
 
 ```bash
-uv run winml config --onnx my_model.onnx -d npu --precision int8 -o config.json
+uv run winml build -m my_model.onnx -d npu --compile -o output/
 ```
 
-Then build:
+Or generate a config first for more control:
 
 ```bash
+uv run winml config --onnx my_model.onnx -d npu --precision int8 -o config.json
 uv run winml build -m my_model.onnx -c config.json -o output/
 ```
 
@@ -218,12 +219,31 @@ output/
 
 ### Selectively skip stages
 
-- `--no-quant` — skip quantization (produces a floating-point optimized model)
-- `--no-compile` — skip compilation (useful if you'll compile on the target device later)
+By default when auto-generating config (no `-c` flag):
+
+- **Compilation is OFF** — pass `--compile` to enable it
+- **Quantization depends on device**:
+    - `-d npu` → quantization ON (w8a16 precision by default)
+    - `-d gpu` / `-d cpu` → quantization OFF (fp16, no quantization)
+
+Override flags:
+
+- `--no-quant` — force skip quantization (even on NPU)
+- `--compile` — force enable compilation (requires EP SDK)
+- `--no-compile` — force skip compilation (default when no config file)
 
 ```bash
-# Optimize + quantize, but skip compilation
-uv run winml build -m my_model.onnx -d npu --no-compile -o output/
+# NPU: optimize + quantize (w8a16), skip compilation
+uv run winml build -m my_model.onnx -d npu -o output/
+
+# NPU: full pipeline including compilation
+uv run winml build -m my_model.onnx -d npu --compile -o output/
+
+# NPU: optimize only, no quantize, no compile
+uv run winml build -m my_model.onnx -d npu --no-quant -o output/
+
+# CPU/GPU: optimize only (quantize and compile are already off)
+uv run winml build -m my_model.onnx -d cpu -o output/
 ```
 
 ---

From 441e03d71db83cedba9cc94b6f323bd67476b37b Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 09:59:04 +0800
Subject: [PATCH 051/143] docs: fix 8 factual errors found during audit

- build.md: add missing 'optimize' stage to pipeline description
- build.md: add -d short form for --device flag
- perf.md: add -d short form for --device flag
- analyze-and-optimize.md: update capability count 43->57, add rewrite category
- python-api.md: fix supported_tasks() count 34->16
- supported-models.md: clarify 35 recognized vs 16 inference-supported tasks
- reference/index.md: fix eval model_path type to include dict for composites
---
 docs/commands/build.md                | 8 ++++----
 docs/commands/perf.md                 | 2 +-
 docs/concepts/analyze-and-optimize.md | 3 ++-
 docs/reference/index.md               | 2 +-
 docs/reference/python-api.md          | 2 +-
 docs/reference/supported-models.md    | 2 +-
 6 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/docs/commands/build.md b/docs/commands/build.md
index 857effba8..addab1ee2 100644
--- a/docs/commands/build.md
+++ b/docs/commands/build.md
@@ -1,13 +1,13 @@
 # winml build
 
-> Run the entire winml-cli pipeline (export → quantize → compile) in one command.
+> Run the entire winml-cli pipeline (export → optimize → quantize → compile) in one command.
 
 ## When to use this
 
 Use `winml build` when you want to go from a Hugging Face model ID (or an
 existing `.onnx` file) to a deployment-ready artifact in a single invocation,
-without manually chaining `winml export`, `winml quantize`, and `winml
-compile`. A build config file — generated by `winml config` — controls every
+without manually chaining `winml export`, `winml optimize`, `winml quantize`,
+and `winml compile`. A build config file — generated by `winml config` — controls every
 stage of the pipeline.
 
 ## Synopsis
@@ -29,7 +29,7 @@ $ winml build [options]
 | `--no-compile` / `--compile` | | flag | `None` | Override compilation. `--compile` forces enable (config must have a compile section). `--no-compile` forces skip. Default: inherit from config. |
 | `--no-optimize` | | flag | `false` | Skip the optimization stage (for pre-quantized ONNX input models). |
 | `--ep` | | string | `None` | Target execution provider for the analyzer (e.g., `qnn`). Falls back to the compile config EP if not set. |
-| `--device` | | string | `auto` | Target device for the analyzer (e.g., `npu`, `gpu`). Default: `auto` (auto-detect). |
+| `--device` | `-d` | string | `auto` | Target device for the analyzer (e.g., `npu`, `gpu`). Default: `auto` (auto-detect). |
 | `--no-analyze` | | flag | `false` | Skip the analyzer loop during build. |
 | `--max-optim-iterations` | | integer | `None` | Maximum autoconf re-optimization rounds (3 enforced internally when not set). `--no-analyze` implicitly sets this to 0. |
 | `--trust-remote-code` | | flag | `false` | Allow executing custom code from model repositories. Use only with trusted sources. |
diff --git a/docs/commands/perf.md b/docs/commands/perf.md
index 879f4ba76..f070469ff 100644
--- a/docs/commands/perf.md
+++ b/docs/commands/perf.md
@@ -20,7 +20,7 @@ $ winml perf [options]
 | `--task` | | `TEXT` | auto-detected | Explicit task override (e.g., `image-classification`). Inferred from the model if omitted. |
 | `--iterations` | | `INTEGER` | `100` | Number of timed inference iterations used to compute statistics. |
 | `--warmup` | | `INTEGER` | `10` | Number of warm-up iterations run before timing begins; excluded from statistics. |
-| `--device` | | `auto\|cpu\|gpu\|npu` | `auto` | Device to run the benchmark on. `auto` selects the highest-priority available device. |
+| `--device` | `-d` | `auto\|cpu\|gpu\|npu` | `auto` | Device to run the benchmark on. `auto` selects the highest-priority available device. |
 | `--precision` | | `TEXT` | `auto` | Precision mode applied during model build: `auto`, `fp32`, `fp16`, `int8`, `int16`, or compound forms such as `w8a16`. |
 | `--ep` | | `TEXT` | — | Force a specific execution provider (e.g., `qnn`, `dml`, `vitisai`, `openvino`, `cpu`). Overrides the device-to-provider mapping. |
 | `--output` | `-o` | `PATH` | `~/.cache/winml/perf/<slug>/<timestamp>.json` | Output JSON file path for the benchmark report. |
diff --git a/docs/concepts/analyze-and-optimize.md b/docs/concepts/analyze-and-optimize.md
index 879a24ca1..b72b2b9d9 100644
--- a/docs/concepts/analyze-and-optimize.md
+++ b/docs/concepts/analyze-and-optimize.md
@@ -115,7 +115,7 @@ The model validators run regardless of whether there are runtime check results 
 | **ORTFusionPipe** | ORT Python transformer optimizer: attention, LayerNorm, and RMSNorm fusions  |
 | **SurgeryPipe**   | Post-optimization model surgery (constant clamping, NaN guard removal)       |
 
-Every optimization is a named **capability** toggled via `--enable-<name>` and `--disable-<name>` flags. Run `--list-capabilities` to see all registered optimizations and their defaults. The optimizer currently ships 43 static capabilities across 13 categories:
+Every optimization is a named **capability** toggled via `--enable-<name>` and `--disable-<name>` flags. Run `--list-capabilities` to see all registered optimizations and their defaults. The optimizer currently ships 57 static capabilities across 13 categories:
 
 | Category     | Capabilities | Examples                                        |
 | ------------ | :----------: | ----------------------------------------------- |
@@ -130,6 +130,7 @@ Every optimization is a named **capability** toggled via `--enable-<name>` and `
 | Activation   | 2            | bias-softmax-fusion, bias-dropout-fusion         |
 | Attention    | 1            | attention-fusion                                 |
 | Misc         | 4            | pad-fusion, gather-to-slice-fusion               |
+| Rewrite      | 14           | attention-expandedattention, matmuladd-conv2d4d, layernormalization-singlelayernorm |
 | Surgery      | 2            | clamp-constant-values, remove-isnan-in-attention-mask |
 
 This granularity matters when a specific fusion breaks a downstream step or when you need an exact optimization profile for a given EP. Some capabilities declare dependencies (e.g., `bias-gelu-fusion` requires `gelu-fusion`); the optimizer resolves these automatically when you enable a flag.
diff --git a/docs/reference/index.md b/docs/reference/index.md
index 37ff62336..a7193d1be 100644
--- a/docs/reference/index.md
+++ b/docs/reference/index.md
@@ -129,7 +129,7 @@ Set to `null` (default) to skip evaluation.
 | Field | Type | Default | Description |
 |-------|------|---------|-------------|
 | `model_id` | `str \| null` | `null` | HuggingFace model ID for config resolution. |
-| `model_path` | `str \| null` | `null` | Path to .onnx file. |
+| `model_path` | `str \| dict[str, str] \| null` | `null` | Path to .onnx file, or a `{role: path}` dict for composite models. |
 | `task` | `str \| null` | `null` | Task type. |
 | `device` | `str` | `"auto"` | Inference device. |
 | `precision` | `str` | `"auto"` | Precision (`fp32`, `fp16`, `w8a16`, etc.). |
diff --git a/docs/reference/python-api.md b/docs/reference/python-api.md
index 8ed7c5a5b..be80045bd 100644
--- a/docs/reference/python-api.md
+++ b/docs/reference/python-api.md
@@ -101,7 +101,7 @@ WinMLAutoModel.from_onnx(
 WinMLAutoModel.supported_tasks() -> list[str]
 ```
 
-Returns all task strings the toolkit supports (34 tasks).
+Returns all task strings with dedicated inference classes (16 tasks).
 
 ---
 
diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index 19c3d8cc0..d9f792575 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -25,7 +25,7 @@ uv run winml inspect --list-tasks
 
 ## Supported Tasks
 
-winml-cli supports **35 tasks** across vision, NLP, audio, and multimodal domains.
+winml-cli recognizes **35 task types** across vision, NLP, audio, and multimodal domains. Of these, 16 have dedicated inference classes; the remainder are supported via the generic task fallback.
 
 ### Vision
 

From 55ee2233fa69aae7b3c4ed6e657ea7f06db7670b Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 10:06:20 +0800
Subject: [PATCH 052/143] docs: add Agent Skill reference page

Document the use-winml-cli Copilot Skill that enables AI coding agents
to drive the winml pipeline. Covers skill capabilities, usage with
GitHub Copilot and custom agents, key principles, and example flow.
---
 docs/reference/agent-skill.md | 100 ++++++++++++++++++++++++++++++++++
 mkdocs.yml                    |   1 +
 2 files changed, 101 insertions(+)
 create mode 100644 docs/reference/agent-skill.md

diff --git a/docs/reference/agent-skill.md b/docs/reference/agent-skill.md
new file mode 100644
index 000000000..c92a6587b
--- /dev/null
+++ b/docs/reference/agent-skill.md
@@ -0,0 +1,100 @@
+# Agent Skill
+
+winml-cli ships a **Copilot Skill** (`use-winml-cli`) that lets AI coding agents
+drive the entire model-building pipeline on your behalf. When a coding agent has
+this skill attached, it can inspect models, generate configs, run builds, and
+interpret results — without you having to remember exact flags or stage ordering.
+
+---
+
+## What the skill provides
+
+The skill teaches the agent:
+
+| Capability | What the agent learns |
+|---|---|
+| **Pipeline shape** | The stage order (`inspect → export → analyze → optimize → quantize → compile → perf`) and when to enter mid-pipeline |
+| **Flag discovery** | Always run `winml <command> --help` before quoting a command — never fabricate flags |
+| **Output mapping** | Which command's `-o` produces the artifact the user actually needs |
+| **Scope awareness** | Which model architectures are supported (classic DL) vs. out-of-scope (LLMs, diffusion) |
+| **Hardware detection** | Use `winml sys --list-ep` to confirm what's available before targeting an EP |
+| **Two paths** | When to use primitives (debugging, exploring) vs. config + build (production, CI) |
+
+---
+
+## How to use it
+
+### In GitHub Copilot (Chat / Workspace)
+
+The skill is automatically available when working in this repository. Ask
+Copilot to build, benchmark, or debug a model and it will follow the skill's
+guidance:
+
+```
+@workspace Build microsoft/resnet-50 for my NPU and show me the latency
+```
+
+### In other agents (Copilot Extensions, custom MCP)
+
+Point the agent at the skill file:
+
+```
+skills/use-winml-cli/SKILL.md
+```
+
+The skill uses standard Copilot Skill format (YAML front-matter + markdown
+body). Any agent that supports skill ingestion can consume it directly.
+
+---
+
+## Skill location
+
+```
+winml-cli/
+└── skills/
+    └── use-winml-cli/
+        └── SKILL.md          ← the skill definition
+```
+
+---
+
+## Key principles encoded in the skill
+
+1. **Inspect first** — always run `winml inspect` before building to catch
+   unsupported architectures early.
+
+2. **Don't fabricate flags** — if a flag isn't in `--help`, it doesn't exist.
+   The skill enforces this as a hard rule.
+
+3. **Published outputs only** — each command has an explicit `-o` output; never
+   fish artifacts from internal cache.
+
+4. **EP-compiled models are EP-bound** — don't benchmark a QNN-compiled model on
+   the CPU EP. Use the pre-compile optimized ONNX for cross-EP comparison.
+
+5. **Scope gate** — the agent will refuse to attempt generative/decoder-only
+   models (GPT, LLaMA, Phi, Stable Diffusion) and explain they're out of scope.
+
+---
+
+## Example agent interaction
+
+```
+User: Can I run ConvNeXt on my Snapdragon X Elite NPU?
+
+Agent (with skill):
+1. Runs `winml sys --list-ep` → confirms QNNExecutionProvider is registered
+2. Runs `winml inspect -m microsoft/convnext-tiny-224` → confirms supported
+3. Runs `winml config --onnx ... -d npu -o config.json`
+4. Runs `winml build -c config.json -m microsoft/convnext-tiny-224 -o output/`
+5. Runs `winml perf -m output/model.onnx -d npu --monitor`
+6. Reports latency + NPU utilization to user
+```
+
+---
+
+## Updating the skill
+
+The skill lives at `skills/use-winml-cli/SKILL.md` in the repository root.
+When commands or flags change, update both the docs site and the skill file to
+keep agent behavior aligned with the CLI.
diff --git a/mkdocs.yml b/mkdocs.yml
index 81a07ae9b..3f757df6e 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -122,5 +122,6 @@ nav:
       - Python API: reference/python-api.md
       - Output Layout: reference/output-layout.md
       - Supported Models: reference/supported-models.md
+      - Agent Skill: reference/agent-skill.md
   - Troubleshooting: troubleshooting.md
   - Contributing: contributing.md

From 849298d974ab353231645b3ddf818a994664047c Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 10:26:49 +0800
Subject: [PATCH 053/143] docs: reorder quickstart to inspect before export

Follow the 'inspect first' principle: confirm the model is supported
before downloading weights and running export.
---
 docs/getting-started/quickstart.md | 56 ++++++++++++++++--------------
 1 file changed, 29 insertions(+), 27 deletions(-)

diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
index 398863492..22aefa807 100644
--- a/docs/getting-started/quickstart.md
+++ b/docs/getting-started/quickstart.md
@@ -1,9 +1,9 @@
 # Quickstart
 
-This page proves your winml-cli install works end-to-end. You will export a
-Hugging Face image classifier to ONNX and then inspect the resulting artifact.
-No quantization, no execution-provider selection — just the two commands you
-need to confirm everything is wired up correctly. Estimated time: 5 minutes.
+This page proves your winml-cli install works end-to-end. You will inspect a
+Hugging Face image classifier, then export it to ONNX. No quantization, no
+execution-provider selection — just the commands you need to confirm everything
+is wired up correctly. Estimated time: 5 minutes.
 
 ## Verify the install
 
@@ -19,25 +19,12 @@ skipping SDK versions and Python environment details that plain `winml sys`
 would include. If the command exits without error, your winml-cli install is
 ready. See [`winml sys`](../commands/sys.md) for the full flag reference.
 
-## Export your first model
+## Inspect the model
 
-```bash
-uv run winml export -m microsoft/resnet-50 -o resnet50.onnx
-```
-
-!!! note "What just happened"
-    winml-cli downloaded the `microsoft/resnet-50` weights from Hugging Face,
-    ran the eight-step Hierarchy-preserving Tags Protocol (HTP) to trace the
-    PyTorch module tree, and wrote an ONNX file to `resnet50.onnx`. Each ONNX
-    node carries a `hierarchy_tag` metadata property recording its full PyTorch
-    ancestry, which downstream quantization and compilation steps use to reason
-    about the graph. See [`winml export`](../commands/export.md) for the full
-    flag reference.
-
-## Inspect the artifact
+Before downloading any weights, confirm that winml-cli recognises the model:
 
 ```bash
-uv run winml inspect -m resnet50.onnx
+uv run winml inspect -m microsoft/resnet-50
 ```
 
 ```text
@@ -50,13 +37,28 @@ uv run winml inspect -m resnet50.onnx
 ╰────────────────────────────────────────────────────────────────────────────╯
 ```
 
-When you pass a local `.onnx` file, `winml inspect` reads the embedded model
-metadata directly. When you pass a Hugging Face model ID instead, it reads
-the model's `config.json` from the Hub without downloading weights. In both
-cases it resolves the loader, exporter, and WinML inference class that
-winml-cli will use for this architecture. See
-[`winml inspect`](../commands/inspect.md) for output-format and hierarchy
-options.
+!!! note "What just happened"
+    `winml inspect` read only the model's `config.json` from Hugging Face Hub —
+    no weights downloaded — and confirmed that `microsoft/resnet-50` maps to a
+    supported task, a known model class, and a compatible ONNX exporter. Always
+    inspect before export to catch unsupported architectures early. See
+    [`winml inspect`](../commands/inspect.md) for output-format and hierarchy
+    options.
+
+## Export the model
+
+```bash
+uv run winml export -m microsoft/resnet-50 -o resnet50.onnx
+```
+
+!!! note "What just happened"
+    winml-cli downloaded the `microsoft/resnet-50` weights from Hugging Face,
+    ran the eight-step Hierarchy-preserving Tags Protocol (HTP) to trace the
+    PyTorch module tree, and wrote an ONNX file to `resnet50.onnx`. Each ONNX
+    node carries a `hierarchy_tag` metadata property recording its full PyTorch
+    ancestry, which downstream quantization and compilation steps use to reason
+    about the graph. See [`winml export`](../commands/export.md) for the full
+    flag reference.
 
 ## What's next
 

From b53add9d5da8c9326bac8f0bb70f8e2747cf243b Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 10:32:51 +0800
Subject: [PATCH 054/143] docs: move agent skill page to Getting Started

Better discoverable as 'Use with AI Agent' alongside Installation and
Quickstart, rather than buried in Reference.
---
 docs/{reference => getting-started}/agent-skill.md | 0
 mkdocs.yml                                         | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename docs/{reference => getting-started}/agent-skill.md (100%)

diff --git a/docs/reference/agent-skill.md b/docs/getting-started/agent-skill.md
similarity index 100%
rename from docs/reference/agent-skill.md
rename to docs/getting-started/agent-skill.md
diff --git a/mkdocs.yml b/mkdocs.yml
index 3f757df6e..d68a7f3a5 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -76,6 +76,7 @@ nav:
       - Installation: getting-started/installation.md
       - Quickstart: getting-started/quickstart.md
       - End-to-End Tour: getting-started/end-to-end.md
+      - Use with AI Agent: getting-started/agent-skill.md
   - Concepts:
       - Fundamentals:
           - How winml-cli works: concepts/how-it-works.md
@@ -122,6 +123,5 @@ nav:
       - Python API: reference/python-api.md
       - Output Layout: reference/output-layout.md
       - Supported Models: reference/supported-models.md
-      - Agent Skill: reference/agent-skill.md
   - Troubleshooting: troubleshooting.md
   - Contributing: contributing.md

From acbc074f7b98c3dc06a0e4b9b58435b10a473c24 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 10:35:35 +0800
Subject: [PATCH 055/143] docs: fix agent skill usage section for accuracy

Clarify that automatic skill loading only works with Copilot Coding Agent,
not @workspace Chat. Add guidance for manually attaching the skill to
other agents.
---
 docs/getting-started/agent-skill.md | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/docs/getting-started/agent-skill.md b/docs/getting-started/agent-skill.md
index c92a6587b..756bd766d 100644
--- a/docs/getting-started/agent-skill.md
+++ b/docs/getting-started/agent-skill.md
@@ -24,26 +24,28 @@ The skill teaches the agent:
 
 ## How to use it
 
-### In GitHub Copilot (Chat / Workspace)
+### With GitHub Copilot Coding Agent
 
-The skill is automatically available when working in this repository. Ask
-Copilot to build, benchmark, or debug a model and it will follow the skill's
-guidance:
+The [Copilot Coding Agent](https://docs.github.com/en/copilot/using-github-copilot/using-the-copilot-coding-agent)
+(the cloud agent that creates PRs) automatically reads `skills/use-winml-cli/SKILL.md`
+when working on this repository. No setup needed — assign an issue or ask
+Copilot to build/optimize a model and it will follow the skill's guidance to
+run the correct `winml` commands.
 
-```
-@workspace Build microsoft/resnet-50 for my NPU and show me the latency
-```
-
-### In other agents (Copilot Extensions, custom MCP)
+### With other AI agents
 
-Point the agent at the skill file:
+For agents that support custom instructions (e.g., Copilot Extensions, Claude,
+ChatGPT with file uploads, or custom MCP tool servers), attach the skill file
+as context:
 
 ```
 skills/use-winml-cli/SKILL.md
 ```
 
-The skill uses standard Copilot Skill format (YAML front-matter + markdown
-body). Any agent that supports skill ingestion can consume it directly.
+You can copy the file contents into your agent's system prompt, upload it as a
+reference document, or include it in a `.github/copilot-instructions.md` for
+VS Code Copilot Chat. The skill uses standard markdown with YAML front-matter —
+any agent that accepts text context can benefit from it.
 
 ---
 

From 94c71ea63aa213ccfbbec2fd194891fed0a64f7a Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 10:42:15 +0800
Subject: [PATCH 056/143] docs: add ONNX input entry point to pipeline diagram
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Show both paths: HF model → export → optimize, and existing ONNX → optimize.
---
 docs/concepts/how-it-works.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index 91739833d..098d55f57 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -16,6 +16,7 @@ programmatic entry point for `WinMLAutoModel.from_pretrained()`.
 ```mermaid
 flowchart TD
     A[PyTorch / HF model] --> B[winml export]
+    O[Existing ONNX file] --> C
     B --> C[winml optimize]
     C --> D[winml quantize]
     D --> E[winml compile]

From f61b20904837fdb4cefd5df9e63f7dc907575828 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 10:58:26 +0800
Subject: [PATCH 057/143] docs: add EP alias column to eps-and-devices table

Show the short CLI alias (qnn, openvino, vitisai, dml, etc.) alongside
the canonical ONNX Runtime EP name. Source: constants.py EP_ALIASES.
---
 docs/concepts/eps-and-devices.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/docs/concepts/eps-and-devices.md b/docs/concepts/eps-and-devices.md
index e735114d7..ccfac8c37 100644
--- a/docs/concepts/eps-and-devices.md
+++ b/docs/concepts/eps-and-devices.md
@@ -6,17 +6,17 @@ A **device** is the hardware category that an EP targets — one of `npu`, `gpu`
 
 ## EPs winml-cli supports
 
-The table below lists every Execution Provider that winml-cli has explicit support for. EP names are the canonical ONNX Runtime strings accepted by `--ep`.
-
-| EP | Device | Hardware | When to use |
-|----|--------|----------|-------------|
-| `QNNExecutionProvider` | npu / gpu | Qualcomm NPU (Hexagon DSP) / Qualcomm GPU (Adreno) | Snapdragon-based Copilot+ PCs; best latency and power efficiency on Qualcomm silicon |
-| `VitisAIExecutionProvider` | npu | AMD NPU (XDNA) | AMD Ryzen AI platforms; targets the AMD AI Engine via the Vitis AI stack |
-| `OpenVINOExecutionProvider` | npu / gpu / cpu | Intel CPU / GPU / NPU | Intel Core Ultra platforms; flexible device targeting across all three Intel compute types |
-| `DmlExecutionProvider` | gpu | GPU (DirectML) | Any DirectX 12 GPU on Windows; broad compatibility across AMD, Intel, and NVIDIA discrete/integrated graphics |
-| `NvTensorRTRTXExecutionProvider` | gpu | NVIDIA GPU (TensorRT RTX) | NVIDIA RTX GPUs; maximum throughput via TensorRT graph optimization |
-| `MIGraphXExecutionProvider` | gpu | AMD GPU (MIGraphX) | AMD discrete GPUs; hardware-accelerated inference via the MIGraphX graph engine |
-| `CPUExecutionProvider` | cpu | CPU | Universal fallback; always available regardless of hardware |
+The table below lists every Execution Provider that winml-cli has explicit support for. EP names are the canonical ONNX Runtime strings accepted by `--ep`. You can also use the short **alias** (case-insensitive) anywhere the full name is accepted.
+
+| EP | Alias | Device | Hardware | When to use |
+|----|-------|--------|----------|-------------|
+| `QNNExecutionProvider` | `qnn` | npu / gpu | Qualcomm NPU (Hexagon DSP) / Qualcomm GPU (Adreno) | Snapdragon-based Copilot+ PCs; best latency and power efficiency on Qualcomm silicon |
+| `VitisAIExecutionProvider` | `vitisai` | npu | AMD NPU (XDNA) | AMD Ryzen AI platforms; targets the AMD AI Engine via the Vitis AI stack |
+| `OpenVINOExecutionProvider` | `openvino` | npu / gpu / cpu | Intel CPU / GPU / NPU | Intel Core Ultra platforms; flexible device targeting across all three Intel compute types |
+| `DmlExecutionProvider` | `dml` | gpu | GPU (DirectML) | Any DirectX 12 GPU on Windows; broad compatibility across AMD, Intel, and NVIDIA discrete/integrated graphics |
+| `NvTensorRTRTXExecutionProvider` | `nv_tensorrt_rtx` | gpu | NVIDIA GPU (TensorRT RTX) | NVIDIA RTX GPUs; maximum throughput via TensorRT graph optimization |
+| `MIGraphXExecutionProvider` | `migraphx` | gpu | AMD GPU (MIGraphX) | AMD discrete GPUs; hardware-accelerated inference via the MIGraphX graph engine |
+| `CPUExecutionProvider` | `cpu` | cpu | CPU | Universal fallback; always available regardless of hardware |
 
 To see which EPs are available on the current machine, run:
 

From 05048f291a251a1876e24cf54f8ed6046ddf98fb Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:00:55 +0800
Subject: [PATCH 058/143] docs: document auto/all for --device and --ep flags

- --device accepts auto (default) and all (analyze only)
- --ep accepts auto and all (analyze only)
- Add examples for both special values
---
 docs/concepts/eps-and-devices.md | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/docs/concepts/eps-and-devices.md b/docs/concepts/eps-and-devices.md
index ccfac8c37..216b4cbba 100644
--- a/docs/concepts/eps-and-devices.md
+++ b/docs/concepts/eps-and-devices.md
@@ -32,17 +32,24 @@ winml-cli exposes two overlapping flags for targeting hardware. Understanding th
 
 Accepts one of four values: `auto`, `cpu`, `gpu`, or `npu`. When set to `auto` (the default), winml-cli inspects the machine and selects the highest-priority device class that has a compatible EP available, in the order NPU > GPU > CPU. Setting an explicit value such as `--device npu` requests a device category without naming the EP.
 
+For `winml analyze`, `--device` also accepts `all` — this evaluates the model against every device that has rule data, producing a side-by-side compatibility report.
+
 ```bash
 # Let winml-cli pick the best available device
 winml analyze --model model.onnx --device auto
 
 # Target the NPU device class
 winml analyze --model model.onnx --device npu
+
+# Analyze against all devices at once (analyze only)
+winml analyze --model model.onnx --device all
 ```
 
 **`--ep` (low-level override)**
 
-Accepts a valid EP name (for example `qnn`, `vitisai`, `dml`, `openvino`). When `--ep` is provided it takes precedence over `--device` and bypasses device-class resolution entirely. Use `--ep` when you need to pin a specific provider — for instance to compare `QNNExecutionProvider` against `DmlExecutionProvider` on the same machine.
+Accepts a valid EP name or alias (for example `qnn`, `vitisai`, `dml`, `openvino`), or `auto` to let winml-cli resolve the EP from the device. When `--ep` is provided with a specific value it takes precedence over `--device` and bypasses device-class resolution entirely. Use `--ep` when you need to pin a specific provider — for instance to compare `QNNExecutionProvider` against `DmlExecutionProvider` on the same machine.
+
+For `winml analyze`, `--ep` also accepts `all` — this evaluates the model against every registered EP simultaneously.
 
 ```bash
 # Force Qualcomm QNN regardless of device selection
@@ -50,6 +57,9 @@ winml analyze --model model.onnx --ep QNNExecutionProvider --device npu
 
 # Use the short alias; winml-cli normalizes it to the full name
 winml analyze --model model.onnx --ep qnn
+
+# Analyze against all EPs at once (analyze only)
+winml analyze --model model.onnx --ep all
 ```
 
 The `--ep` flag accepts a free-form string and is not restricted to the choices listed above. This allows forward compatibility with EP names that winml-cli does not yet enumerate.

From 5f8760380b70b6362e35e3b03cd1545b41d7be46 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:09:31 +0800
Subject: [PATCH 059/143] docs: fix compile validation description to match
 code

Validation uses all-ones dummy inputs and checks for NaN/Inf in outputs.
It does NOT compare against the original model numerically.
---
 docs/concepts/compile-and-epcontext.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/concepts/compile-and-epcontext.md b/docs/concepts/compile-and-epcontext.md
index a4227e2c3..f5c37521c 100644
--- a/docs/concepts/compile-and-epcontext.md
+++ b/docs/concepts/compile-and-epcontext.md
@@ -30,9 +30,9 @@ If you are iterating on quantization settings or ONNX graphs and want to check w
 
 ## Skipping validation
 
-By default `winml compile` runs a validation pass after compilation finishes — it loads the compiled model, feeds it random inputs, and checks that the outputs are numerically consistent with the original. This catches compilation regressions early.
+By default `winml compile` runs a validation pass after compilation finishes — it loads the compiled model into an inference session, feeds it dummy inputs (all-ones tensors), and checks that the outputs do not contain NaN or Inf values. This catches basic compilation failures early (e.g., the EP rejecting the graph or producing garbage outputs).
 
-The `--no-validate` flag skips that pass. It is useful during rapid iteration when you only want to confirm that the EP can accept the model without the overhead of a full inference run. Do not use `--no-validate` for production builds. Shipping an unvalidated compiled artifact risks silent correctness regressions that are difficult to diagnose in the field.
+The `--no-validate` flag skips that pass. It is useful during rapid iteration when you only want to confirm that compilation succeeds without the overhead of a trial inference run.
 
 ## See also
 

From 9551e34d553558e322a45da463b1376d38a7179f Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:24:52 +0800
Subject: [PATCH 060/143] docs: fix commands overview factual errors

- Remove --debug from global flags (it's hidden=True)
- Clarify -p is only shorthand for --precision on config/quantize
---
 docs/commands/overview.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/docs/commands/overview.md b/docs/commands/overview.md
index 77cfacf31..34faed201 100644
--- a/docs/commands/overview.md
+++ b/docs/commands/overview.md
@@ -49,7 +49,7 @@ measure speed and accuracy.
 
 ## Global flags
 
-`-v` / `--verbose`, `-q` / `--quiet`, `--debug`, `--version`, and `-h` /
+`-v` / `--verbose`, `-q` / `--quiet`, `--version`, and `-h` /
 `--help` live on the root `winml` group only. Subcommands access them through
 `ctx.obj` and do not redefine them. See
 `src/winml/modelkit/cli.py` for the canonical contract.
@@ -58,9 +58,10 @@ measure speed and accuracy.
 
 Several flags share semantics across the commands that accept them:
 `-m` / `--model`, `-d` / `--device`, `--ep`, `-o` / `--output`,
-`-t` / `--task`, and `-p` / `--precision`. Defaults and accepted values can
-differ per command; check the **Flags** section of each command page rather
-than assuming they transfer.
+`-t` / `--task`, and `--precision`. Defaults and accepted values can
+differ per command (e.g., `-p` is a short form for `--precision` only on
+`config` and `quantize`); check the **Flags** section of each command page
+rather than assuming they transfer.
 
 ## See also
 

From abd8ef2f1c20d62141a0d9b3891d7a827b8d96fd Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:26:39 +0800
Subject: [PATCH 061/143] docs: note ort vs qairt compiler backends in compile
 step

Mention --compiler ort (default) and --compiler qairt options so users
know both backends are available.
---
 docs/samples/convnext-primitives.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
index 70f4deb48..d0ff1dc9b 100644
--- a/docs/samples/convnext-primitives.md
+++ b/docs/samples/convnext-primitives.md
@@ -84,7 +84,7 @@ Saved: convnext_int8.onnx
 
 ## Step 5: Compile for each EP
 
-Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the runtime can skip per-session JIT compilation.
+Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the runtime can skip per-session JIT compilation. Two compiler backends are available: `ort` (default, uses ONNX Runtime's built-in compiler) and `qairt` (uses the QAIRT SDK's offline compiler directly). Pass `--compiler qairt` if you have the QAIRT SDK and want direct QNN compilation.
 
 === "CPU"
 

From 1c6ac5c77756734866431541774a005e3d813576 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:31:30 +0800
Subject: [PATCH 062/143] write Qwen3 composite model sample from source code

---
 docs/samples/qwen3-composite.md | 145 ++++++++++++++++++++++++++++----
 1 file changed, 130 insertions(+), 15 deletions(-)

diff --git a/docs/samples/qwen3-composite.md b/docs/samples/qwen3-composite.md
index 5875a2e7e..5545ba30c 100644
--- a/docs/samples/qwen3-composite.md
+++ b/docs/samples/qwen3-composite.md
@@ -1,27 +1,142 @@
 # Qwen3 — Composite Models
 
-!!! info "Coming soon"
-    Composite-model support — running models with multiple components like a text encoder + decoder, or a vision encoder + LLM, through a single winml-cli pipeline — is on an in-progress feature branch. This page will be authored once that work merges.
+Qwen3 (`Qwen/Qwen3-0.6B`, `Qwen/Qwen3-1.7B`, etc.) is a decoder-only large language model that uses grouped-query attention and a sliding-window KV cache. winml-cli treats it as a **composite model** — a model that is split into multiple ONNX sub-models that run together at inference time. For Qwen3, the two sub-models are:
 
-## What composite models are
+| Sub-model | Role | Input shape (`input_ids`) | Output KV shape |
+|-----------|------|--------------------------|-----------------|
+| `decoder_prefill` | Processes the full prompt in chunks | `[1, 64]` | `[1, kv_heads, 64, head_dim]` |
+| `decoder_gen` | Generates one token at a time | `[1, 1]` | `[1, kv_heads, 1, head_dim]` |
 
-A composite model is a system made up of two or more distinct ONNX sub-models that work together as a single inference pipeline. A common example is a vision-language model like Qwen3-VL, where a vision encoder processes an image and feeds its output into a separate language model decoder. Another pattern is an encoder-decoder pair — two ONNX files that share a tokenizer configuration and must be executed in sequence at runtime. Multi-stage pipelines generalize this further: the output tensor of one sub-model becomes the input tensor of the next, with each stage potentially targeting a different execution provider or precision. Composite models add coordination complexity beyond what a single ONNX graph requires, so they call for first-class support in the build and inference tooling rather than ad hoc stitching.
+Both sub-models share the same weights and KV cache buffer. Splitting prefill from generation lets each ONNX graph have fully static shapes, which is required for efficient NPU compilation.
 
-## What Qwen3 will demonstrate
+## Prerequisites
 
-The following is a forward-looking sketch of what this sample will cover once the composite-model feature branch lands:
+- winml-cli installed and `winml` on your PATH.
+- A network connection to download Qwen3 weights from HuggingFace on first run.
+- At least 4 GB free disk space (for `Qwen3-0.6B`; larger variants need more).
 
-- How to declare a composite model in a `BuildConfig` — specifying multiple sub-models, their connection points, and a shared tokenizer configuration.
-- How `winml build` orchestrates export and compilation of each sub-model independently, then assembles the composite pipeline.
-- How to run end-to-end inference across the composite pipeline using a single `winml` invocation.
-- How to benchmark each sub-model's latency independently with `winml perf` to identify bottlenecks.
-- This section is a sketch and will be revised once the implementation lands; details may change.
+## Step 1: Generate build configs
 
-## Track progress
+```bash
+winml config -m Qwen/Qwen3-0.6B --task text-generation -o qwen3.json
+```
 
-Follow development and check current status at https://github.com/microsoft/winml-cli.
+Because `(qwen3, text-generation)` is registered as a composite model, this command produces **two** config files — one per sub-model:
+
+- `qwen3_decoder_prefill.json` — export config using `feature-extraction` task
+- `qwen3_decoder_gen.json` — export config using `text2text-generation` task
+
+Each config includes Qwen3-specific optimizations (dynamo export, opset 18, GeLU fusion, RMSNorm fusion, MatMul+Add fusion, clamp constant values, and remove-IsNaN-in-attention-mask).
+
+## Step 2: Build each sub-model
+
+Build both sub-models individually using their config files:
+
+```bash
+# Build the prefill sub-model
+winml build -c qwen3_decoder_prefill.json -m Qwen/Qwen3-0.6B -o output/prefill
+
+# Build the generation sub-model
+winml build -c qwen3_decoder_gen.json -m Qwen/Qwen3-0.6B -o output/gen
+```
+
+Each `winml build` runs the full pipeline: export (via torch dynamo) → optimize → quantize → compile. The output directories contain the final ONNX files ready for inference.
+
+To target a specific execution provider (e.g., QNN for NPU):
+
+```bash
+winml build -c qwen3_decoder_prefill.json -m Qwen/Qwen3-0.6B -o output/prefill --ep qnn
+winml build -c qwen3_decoder_gen.json -m Qwen/Qwen3-0.6B -o output/gen --ep qnn
+```
+
+## Step 3: Benchmark each sub-model
+
+```bash
+winml perf output/prefill -d npu
+winml perf output/gen -d npu
+```
+
+This lets you identify whether the prefill or generation phase is the bottleneck on your target hardware.
+
+## Step 4: Run inference (Python API)
+
+The `WinMLQwen3Model` class combines both sub-models into a single generation pipeline that implements HuggingFace's `GenerationMixin` interface:
+
+```python
+from winml.modelkit.models.hf.qwen import WinMLQwen3Model
+
+# Build and load both sub-models in one call
+model = WinMLQwen3Model.from_pretrained("Qwen/Qwen3-0.6B", task="text-generation")
+
+# Or load pre-built ONNX files (skips re-export/optimization)
+from winml.modelkit.models.auto import WinMLAutoModel
+from transformers import AutoConfig
+
+prefill = WinMLAutoModel.from_pretrained("output/prefill/model.onnx", skip_build=True)
+gen = WinMLAutoModel.from_pretrained("output/gen/model.onnx", skip_build=True)
+config = AutoConfig.from_pretrained("Qwen/Qwen3-0.6B")
+
+model = WinMLQwen3Model(
+    sub_models={"decoder_prefill": prefill, "decoder_gen": gen},
+    config=config,
+)
+
+# Generate text using HF's standard generate() API
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+inputs = tokenizer("Hello, how are you?", return_tensors="pt")
+output_ids = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
+```
+
+### Customizing shape config per sub-model
+
+You can pass different `shape_config` to each sub-model via `sub_model_kwargs`:
+
+```python
+model = WinMLQwen3Model.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    task="text-generation",
+    sub_model_kwargs={
+        "decoder_prefill": {"shape_config": {"max_cache_len": 512, "seq_len": 64}},
+        "decoder_gen":     {"shape_config": {"max_cache_len": 512, "seq_len": 1}},
+    },
+)
+```
+
+## How it works internally
+
+The composite model architecture for Qwen3:
+
+```mermaid
+graph LR
+    A[winml config] -->|"(qwen3, text-generation)"| B[Composite Registry]
+    B --> C[decoder_prefill config]
+    B --> D[decoder_gen config]
+    C --> E[winml build → prefill.onnx]
+    D --> F[winml build → gen.onnx]
+    E --> G[WinMLQwen3Model]
+    F --> G
+    G -->|GenerationMixin.generate| H[Token output]
+```
+
+Key design decisions:
+
+- **Dynamo export required** — TorchScript fails for Qwen3; dynamo produces opset 18 graphs.
+- **Sliding-window KV cache** — Uses Slice+Concat (FIFO) instead of index_copy_. New KV tokens are appended at the end of the buffer; oldest tokens are evicted.
+- **Static shapes throughout** — Both sub-models have fixed input/output shapes, enabling ahead-of-time compilation for NPU.
+- **GenericTask registration** — Sub-models are registered as `WinMLModelForGenericTask` so their raw ONNX outputs (logits + KV) are preserved without task-specific post-processing.
+
+## Other composite models
+
+The same composite model pattern is used for:
+
+- **T5** (`google-t5/t5-small`) — encoder + decoder architecture for translation/summarization
+- **Mu2** — encoder-decoder with custom code (`trust_remote_code=True`)
 
 ## See also
 
-- [BERT — Config + Build + Perf](../samples/bert-config-build.md)
-- [Config and build](../concepts/config-and-build.md)
+- [BERT — Config + Build + Perf](bert-config-build.md) — single-model workflow
+- [ConvNeXt — Primitive commands](convnext-primitives.md) — step-by-step pipeline
+- [Config and build](../concepts/config-and-build.md) — concept overview

From 912cc9ffd3d5442534e6cb2960e5ab6d2628aa6e Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:34:57 +0800
Subject: [PATCH 063/143] add ORT compiler backend to tutorial compile step

---
 docs/tutorials/npu-convnext.md | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index 8eda5a7d3..91b71ae17 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -142,23 +142,34 @@ The quantizer generates 32 random calibration samples, runs them through the mod
 
 Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time. This is the step that produces a device-locked artifact — the output is tied to the specific EP and, for QNN, to the QNN SDK version.
 
-=== "QNN (Snapdragon NPU)"
+Two compiler backends are available:
+
+- **`--compiler ort`** (default) — uses ONNX Runtime's built-in EP context compiler. Works for QNN and OpenVINO targets without needing the vendor SDK on the build machine (the ORT package bundles the necessary libraries).
+- **`--compiler qairt`** — uses the QAIRT SDK's offline compiler directly. Requires `QNN_SDK_ROOT` to point at a local QAIRT SDK installation. Produces the same EPContext output but goes through Qualcomm's native toolchain.
+
+=== "QNN via ORT (default)"
 
     ```bash
-    # Requires QNN_SDK_ROOT env var set to your QAIRT SDK root
     uv run winml compile -m convnext_int8.onnx --device npu
     ```
 
+=== "QNN via QAIRT SDK"
+
+    ```bash
+    # Requires QNN_SDK_ROOT env var set to your QAIRT SDK root
+    uv run winml compile -m convnext_int8.onnx --device npu --compiler qairt
+    ```
+
 === "OpenVINO (Intel CPU/GPU/NPU)"
 
     ```bash
     uv run winml compile -m convnext_int8.onnx --device npu --ep openvino
     ```
 
-The compiled output file appears in the same directory as the input model. For QNN, the file name follows the pattern `convnext_int8_npu_ctx.onnx` (using the resolved device string `npu`, not the EP name) and an accompanying `.bin` context binary is written alongside it. For OpenVINO targeting the NPU, the compiled artifact is also named `convnext_int8_npu_ctx.onnx`. CPU builds do not produce a new artifact — the compile step validates EP compatibility but writes no output file; use `convnext_int8.onnx` directly for CPU inference.
+The compiled output file appears in the same directory as the input model. The file name follows the pattern `convnext_int8_npu_ctx.onnx` (using the resolved device string `npu`, not the EP name) and an accompanying `.bin` context binary is written alongside it (unless `--embed` is passed, which embeds the binary inside the ONNX file). CPU builds do not produce a new artifact — the compile step validates EP compatibility but writes no output file; use `convnext_int8.onnx` directly for CPU inference.
 
 !!! note "What we just did"
-    Compilation embeds EP context — the compiled binary — inside or alongside the ONNX file using the `EPContext` node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15–60 second JIT penalty on first load. winml-cli locates the QAIRT SDK libraries needed for QNN compilation through `QNN_SDK_ROOT` (set as an environment variable, or passed with `--qnn-sdk-root` on `winml compile`). `winml build` reads only the env var. See [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) for the full picture of what gets embedded and how the context is consumed at runtime.
+    Compilation embeds EP context — the compiled binary — inside or alongside the ONNX file using the `EPContext` node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15–60 second JIT penalty on first load. The default `--compiler ort` backend bundles compilation within ONNX Runtime itself. The `--compiler qairt` backend calls the QAIRT SDK directly and requires `QNN_SDK_ROOT` (set as an environment variable, or passed with `--qnn-sdk-root` on `winml compile`). `winml build` reads only the env var. See [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) for the full picture of what gets embedded and how the context is consumed at runtime.
 
 ---
 

From 94e060d8305de070504d2c8164d69d480bf061f7 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:37:08 +0800
Subject: [PATCH 064/143] clarify ORT vs QAIRT compiler backends in compile
 sections

---
 docs/samples/convnext-primitives.md | 17 ++++++++++++-----
 docs/tutorials/build-from-onnx.md   |  2 +-
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
index d0ff1dc9b..d556de464 100644
--- a/docs/samples/convnext-primitives.md
+++ b/docs/samples/convnext-primitives.md
@@ -98,16 +98,23 @@ Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the run
     winml compile -m convnext_int8.onnx --output-dir . --device gpu
     ```
 
-=== "NPU"
+=== "NPU (ORT, default)"
 
     ```bash
-    winml compile -m convnext_int8.onnx --output-dir . --device npu --qnn-sdk-root <path-to-qnn-sdk>
+    winml compile -m convnext_int8.onnx --output-dir . --device npu
     ```
 
-!!! note "NPU requires the QNN SDK"
-    Compilation for `--device npu` invokes the Qualcomm QNN offline compiler, which must be installed separately. Pass `--qnn-sdk-root` pointing at the root of your QAIRT SDK installation, or set the `QNN_SDK_ROOT` environment variable to the same path. If the SDK is absent, compile for CPU or GPU instead. For a full explanation of how EPs relate to device targets see [ONNX & Execution Providers](../concepts/eps-and-devices.md).
+=== "NPU (QAIRT SDK)"
 
-Only the NPU invocation writes a new compiled artifact — `convnext_int8_npu_ctx.onnx` — which contains an EPContext node embedding the pre-compiled Hexagon binary. CPU and GPU compile with `enable_ep_context=False` by default: the compile step validates the model against the target EP but does not produce a new file. For CPU and GPU perf benchmarks (Step 6), use the quantized `convnext_int8.onnx` directly.
+    ```bash
+    # Requires QNN_SDK_ROOT env var or --qnn-sdk-root
+    winml compile -m convnext_int8.onnx --output-dir . --device npu --compiler qairt --qnn-sdk-root <path-to-qnn-sdk>
+    ```
+
+!!! note "NPU compiler backends"
+    The default `--compiler ort` backend uses ONNX Runtime's built-in QNN compilation — no separate SDK installation is needed. The `--compiler qairt` backend calls the QAIRT SDK's offline compiler directly and requires either `--qnn-sdk-root` or the `QNN_SDK_ROOT` environment variable. Both produce the same EPContext output format. For a full explanation of how EPs relate to device targets see [ONNX & Execution Providers](../concepts/eps-and-devices.md).
+
+Only the NPU invocation writes a new compiled artifact — `convnext_int8_npu_ctx.onnx` — which contains an EPContext node embedding the pre-compiled binary. CPU and GPU compile with `enable_ep_context=False` by default: the compile step validates the model against the target EP but does not produce a new file. For CPU and GPU perf benchmarks (Step 6), use the quantized `convnext_int8.onnx` directly.
 
 ## Step 6: Benchmark
 
diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index 8dddc19c3..fe6898e3c 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -151,7 +151,7 @@ If your target is NPU deployment, continue the pipeline with quantization and co
 # Quantize (INT8, QDQ format)
 uv run winml quantize -m my_model_optimized.onnx -o my_model_int8.onnx --precision int8 --samples 32
 
-# Compile for QNN NPU
+# Compile for NPU (default --compiler ort; use --compiler qairt for QAIRT SDK)
 uv run winml compile -m my_model_int8.onnx --device npu
 ```
 

From 274fcaf606eb58b0f3bc93b5ee4b89562eca9c33 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:39:52 +0800
Subject: [PATCH 065/143] document auto field behavior and fix convnext compile
 tabs

---
 docs/reference/index.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/docs/reference/index.md b/docs/reference/index.md
index a7193d1be..23c169ecb 100644
--- a/docs/reference/index.md
+++ b/docs/reference/index.md
@@ -181,6 +181,17 @@ Set to `null` (default) to skip evaluation.
 }
 ```
 
+### The `auto` field
+
+The top-level `"auto"` field (default: `true`) controls whether the build pipeline runs the **autoconf loop** — an iterative analyze → discover → re-optimize cycle that automatically detects which additional graph optimizations the model needs for the target EP.
+
+| Value | Behavior |
+|-------|----------|
+| `true` (default) | After initial optimization, the analyzer inspects the graph for unsupported or sub-optimal nodes and proposes additional optimization flags. The pipeline re-optimizes using the discovered flags and repeats (up to `--max-optim-iterations`, default 3). The final optimization result depends on what the analyzer discovers at runtime, so **outputs may vary** if the model or EP support changes between runs. |
+| `false` | The pipeline applies only the explicit `optim` flags from the config — no autoconf discovery, no re-optimization loop. Builds are **fully deterministic** given the same config and input model. Use this for reproducible CI builds or when you have already tuned the optimization flags manually. |
+
+When `auto` is `true` and the autoconf loop discovers additional flags, the final persisted config (written to the output directory) includes the merged result so you can inspect what was discovered.
+
 ## See also
 
 - [winml config](../commands/config.md) — generate a config interactively

From 6ad30d6525021734c1bf47e2a92b8d52a93cfe2c Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:42:22 +0800
Subject: [PATCH 066/143] emphasize winml_build_config.json as reproducible
 CI/CD artifact

---
 docs/reference/output-layout.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/reference/output-layout.md b/docs/reference/output-layout.md
index 0f7eeacf7..c560ae8f5 100644
--- a/docs/reference/output-layout.md
+++ b/docs/reference/output-layout.md
@@ -38,7 +38,7 @@ output/
 |------|---------|
 | `model.onnx` | The deployment-ready model. Always present. |
 | `model.onnx.data` | External weight data (only if model ≥ 100 MiB). Must stay alongside `model.onnx`. |
-| `winml_build_config.json` | The config used for this build (includes auto-discovered flags). Useful for reproducibility. |
+| `winml_build_config.json` | The complete pipeline config used for this build (includes auto-discovered optimization flags). This file is a **reproducible pipeline specification** — check it into version control or feed it directly to `winml build -c` in a CI/CD pipeline to guarantee identical model processing across machines and runs (set `"auto": false` for fully deterministic builds). |
 | `analyze_result.json` | Static analysis output: EP compatibility, operator classification, detected patterns. |
 | `build_manifest.json` | Build provenance with stage timings. Only generated via the Python API (`build_hf_model`/`build_onnx_model`). |
 | `export_htp_metadata.json` | HTP export metadata: module hierarchy, tracing info, tagging coverage. |

From 1e5e057291db4f611fc37ecf0e5aa035a9b5f6ad Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:45:12 +0800
Subject: [PATCH 067/143] add CI/CD reproducibility callouts for build config

---
 docs/commands/build.md             | 3 +++
 docs/concepts/how-it-works.md      | 2 ++
 docs/getting-started/end-to-end.md | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/docs/commands/build.md b/docs/commands/build.md
index addab1ee2..a825ed4a6 100644
--- a/docs/commands/build.md
+++ b/docs/commands/build.md
@@ -47,6 +47,9 @@ and applies further passes; `--no-analyze` disables it for a deterministic
 single-pass build. Individual stages can be suppressed with `--no-quant`,
 `--no-compile`, and `--no-optimize` without touching the config file.
 
+!!! tip "Reproducible CI/CD builds"
+    The config file is a portable, self-contained pipeline specification. Check it into source control and invoke `winml build -c config.json` in CI to produce identical artifacts without manual flag management. Set `"auto": false` in the config to disable the autoconf discovery loop for fully deterministic output.
+
 ## Examples
 
 ```bash
diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index 098d55f57..32a3244af 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -113,6 +113,8 @@ which is convenient for one-off experiments.
 
 The config file is written (or updated) to the output directory after the optimize stage
 completes, capturing any autoconf-adjusted fusion flags so the build is reproducible.
+This persisted `winml_build_config.json` is a self-contained pipeline specification that
+you can check into version control and run in CI/CD (`winml build -c winml_build_config.json -m <model> -o output/`) for repeatable, unattended builds across environments.
 
 ## See Also
 
diff --git a/docs/getting-started/end-to-end.md b/docs/getting-started/end-to-end.md
index 44052030c..7d387c8fa 100644
--- a/docs/getting-started/end-to-end.md
+++ b/docs/getting-started/end-to-end.md
@@ -83,6 +83,9 @@ hardware and writes the winning device (NPU, GPU, or CPU) together with
 matching precision and compile settings into `convnext_config.json`. You can
 open the file to see exactly what was picked before committing to a full build.
 
+!!! tip "Config as CI/CD artifact"
+    The generated `convnext_config.json` is a self-contained, reproducible pipeline specification. Check it into version control and use it in CI/CD pipelines (`winml build -c convnext_config.json -m ... -o ...`) to guarantee identical model processing across machines and runs. Set `"auto": false` in the config for fully deterministic builds (disables the autoconf discovery loop). See [Why version a config](../concepts/config-and-build.md#why-version-a-config) for details.
+
 For a field-by-field explanation of every section in the generated JSON and how
 the `quant` and `compile` blocks interact, see
 [Config and build](../concepts/config-and-build.md).

From 526c9c3a68bf56c91a53c4568e2929f04a389a8b Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:48:10 +0800
Subject: [PATCH 068/143] fix EP table: add missing EPs, correct device order
 and notes

---
 docs/reference/supported-models.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index d9f792575..9054bb97f 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -125,10 +125,13 @@ Each validated model is tested against available EPs:
 
 | EP | Alias | Devices | Notes |
 |----|-------|---------|-------|
-| CPUExecutionProvider | `cpu` | CPU | Always available |
-| QNNExecutionProvider | `qnn` | NPU, GPU | Qualcomm Snapdragon; requires QNN SDK |
-| OpenVINOExecutionProvider | `openvino` | CPU, GPU, NPU | Intel hardware; install with `--extra openvino` |
+| NvTensorRTRTXExecutionProvider | `nvtensorrtrtx`, `nv_tensorrt_rtx` | GPU | NVIDIA TensorRT-RTX; NVIDIA GPU with TensorRT runtime |
+| CUDAExecutionProvider | `cuda` | GPU | NVIDIA CUDA; any CUDA-capable GPU |
+| MIGraphXExecutionProvider | `migraphx` | GPU | AMD ROCm MIGraphX |
+| QNNExecutionProvider | `qnn` | NPU, GPU | Qualcomm Snapdragon; bundled in ORT (`--compiler qairt` needs QNN SDK) |
+| OpenVINOExecutionProvider | `openvino` | NPU, GPU, CPU | Intel hardware; install with `--extra openvino` |
 | DmlExecutionProvider | `dml` | GPU | DirectML; any DirectX 12 GPU |
+| CPUExecutionProvider | `cpu` | CPU | Always available |
 | VitisAIExecutionProvider | `vitisai` | NPU | AMD/Xilinx |
 
 ---

From 16bf4261c7ae3eaf2e004050a8e9fa41082eeedc Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:49:44 +0800
Subject: [PATCH 069/143] simplify docs contributing page to reference repo
 CONTRIBUTING.md

---
 docs/contributing.md | 164 ++-----------------------------------------
 1 file changed, 6 insertions(+), 158 deletions(-)

diff --git a/docs/contributing.md b/docs/contributing.md
index 9f54ba909..5d49ab069 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -1,179 +1,27 @@
 # Contributing
 
-This guide covers the development workflow for contributing to winml-cli.
+For the full contributing guide — development setup, coding conventions, testing, PR checklist, and CLA — see [`CONTRIBUTING.md`](https://github.com/microsoft/winml-cli/blob/main/CONTRIBUTING.md) in the repository root.
 
----
-
-## Prerequisites
-
-| Component | Version |
-|-----------|---------|
-| Python | 3.11 (`requires-python = ">=3.11,<3.12"`) |
-| Package manager | [uv](https://github.com/astral-sh/uv) |
-| OS | Windows 11 (primary target) |
-
----
-
-## Development Setup
+## Quick Reference
 
 ```bash
+# Clone and set up
 git clone https://github.com/microsoft/winml-cli.git
 cd winml-cli
-
-# Install all dependencies including dev tools
 uv sync --extra dev
-
-# Enable pre-commit hooks
 uv run pre-commit install
-```
-
-The pre-commit hooks automatically enforce:
-
-- MIT license headers on all `.py` files
-- Trailing whitespace removal
-- End-of-file newline
-- YAML syntax validation
-- Ruff linting and formatting
-
----
-
-## Running Tests
-
-```bash
-# All unit tests
-uv run pytest tests/
 
-# Fast CI-like run (excludes hardware-dependent tests)
+# Run tests
 uv run pytest tests/ -m "not e2e and not npu and not gpu"
 
-# Specific module
-uv run pytest tests/unit/analyze
-uv run pytest tests/unit/commands
-
-# With coverage
-uv run pytest tests/ --cov=src/winml/modelkit --cov-report=html
-```
-
-**Test markers:**
-
-| Marker | Use |
-|--------|-----|
-| `@pytest.mark.unit` | Fast unit tests (default) |
-| `@pytest.mark.smoke` | Critical-path tests that must always pass |
-| `@pytest.mark.e2e` | End-to-end tests (slow, may need hardware) |
-| `@pytest.mark.npu` | Requires NPU hardware |
-| `@pytest.mark.gpu` | Requires GPU |
-| `@pytest.mark.slow` | Tests taking > 30 seconds |
-
----
-
-## Linting and Type Checking
-
-```bash
-# Lint (check only)
-uv run ruff check src/ tests/
-
-# Lint and auto-fix
+# Lint and format
 uv run ruff check src/ tests/ --fix
-
-# Format
 uv run ruff format src/ tests/
 
-# Type check
-uv run mypy src/
-
-# Run all pre-commit hooks manually
-uv run pre-commit run --all-files
-```
-
----
-
-## Code Structure
-
-```text
-src/winml/modelkit/
-├── cli.py              # Entry point (winml command group)
-├── commands/           # CLI subcommands (export, build, analyze, etc.)
-├── models/             # Model loading from HuggingFace / local
-├── export/             # ONNX export logic and HTP
-├── optimize/           # Optimization pipelines and fusion
-├── analyze/            # Analysis engine and runtime rules
-├── config/             # Build config schema and constants
-├── build/              # Pipeline orchestration
-├── compiler/           # EP compilation (EPContext)
-├── quant/              # Quantization
-├── eval/               # Evaluation metrics
-├── serve/              # FastAPI serving layer
-├── session/            # Session management
-├── core/               # Core graph abstractions
-├── cache/              # Caching utilities
-└── utils/              # Shared utilities
-
-tests/
-├── unit/               # Unit tests (organized by module)
-├── integration/        # Integration tests
-├── e2e/                # End-to-end tests
-├── regression/         # Regression suite
-├── fixtures/           # Test data and mock models
-└── conftest.py         # Shared fixtures
-```
-
----
-
-## Coding Conventions
-
-- **Line length:** 100 characters
-- **Docstrings:** Google style
-- **Strings:** Double quotes (enforced by Ruff)
-- **Type annotations:** Required for public API functions
-- **License header:** Auto-inserted by pre-commit on all `.py` files
-
-**Import order** (enforced by Ruff isort):
-
-1. `__future__`
-2. Standard library
-3. Third-party (`torch`, `transformers`, `onnx`, etc.)
-4. First-party (`winml.*`)
-5. Relative imports
-
-See the internal naming convention guide for ONNX/EP/QDQ term casing rules.
-
----
-
-## PR Checklist
-
-Before submitting a pull request:
-
-- [ ] Tests pass: `uv run pytest tests/ -m "not e2e and not npu and not gpu"`
-- [ ] Linting passes: `uv run ruff check src/ tests/`
-- [ ] Formatting is clean: `uv run ruff format --check src/ tests/`
-- [ ] Type checking passes: `uv run mypy src/`
-- [ ] New code includes unit tests (target 80%+ coverage)
-- [ ] Docs updated if public API changed
-
-**CI will run:**
-
-1. **Lint workflow** — license headers + Ruff
-2. **Test workflow** — parallelized test groups on Windows
-3. **CLA bot** — Contributor License Agreement signature
-
----
-
-## Documentation Development
-
-```bash
-# Live preview (auto-reloads)
+# Docs preview
 uv run mkdocs serve
-
-# Validate (strict mode, catches broken links)
-uv run mkdocs build --strict
 ```
 
-See [docs/README.md](https://github.com/microsoft/winml-cli/blob/main/docs/README.md)
-for authoring conventions, publishing workflow, and site structure.
-
----
-
 ## See also
 
 - [Installation](getting-started/installation.md) — user-facing setup

From a27d635619df225c339f2483f4a304e7c671e3db Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:51:35 +0800
Subject: [PATCH 070/143] add runtime check rules download step to contributing

---
 docs/contributing.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/docs/contributing.md b/docs/contributing.md
index 5d49ab069..b76f8e934 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -11,6 +11,13 @@ cd winml-cli
 uv sync --extra dev
 uv run pre-commit install
 
+# Download runtime check rules (required for `winml analyze`)
+gh release download <tag> --repo microsoft/winml-cli --pattern 'rules-v*.zip' --dir .
+# Windows:
+Expand-Archive -Path .\rules-v*.zip -DestinationPath src\winml\modelkit\analyze\rules\runtime_check_rules -Force
+# Linux/macOS:
+# unzip -o rules-v*.zip -d src/winml/modelkit/analyze/rules/runtime_check_rules
+
 # Run tests
 uv run pytest tests/ -m "not e2e and not npu and not gpu"
 

From da7928153262e52f9dec339ef8cf3b705506c197 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:53:41 +0800
Subject: [PATCH 071/143] remove trivial 'config file empty' troubleshooting
 entry

---
 docs/troubleshooting.md | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 71268e03d..7d1fd9ef7 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -6,23 +6,6 @@ Common issues and solutions when working with winml-cli.
 
 ## Build and Pipeline Errors
 
-### Config file is empty or invalid JSON
-
-```text
-UsageError: Config file is empty: config.json
-UsageError: Invalid JSON in config: Expecting value: line 1 column 1
-```
-
-**Cause:** The config file passed to `winml build -c` is empty, malformed, or not valid JSON.
-
-**Solution:** Validate the file with `python -m json.tool config.json`, or regenerate it:
-
-```bash
-uv run winml config -m <model> -d <device> -o output/
-```
-
----
-
 ### Cannot enable compilation: no compile section
 
 ```text

From 2ac1e8aafc6e1f09d4dc43fb659177b37b32ff2a Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:55:13 +0800
Subject: [PATCH 072/143] note that compile is off by default in winml build

---
 docs/troubleshooting.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 7d1fd9ef7..61c145ad8 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -12,14 +12,17 @@ Common issues and solutions when working with winml-cli.
 UsageError: Cannot enable compilation: no compile section found in the config file
 ```
 
-**Cause:** You passed `--compile` but the config JSON has no `"compile"` section (it's `null`).
+**Cause:** Compilation is **off by default** in `winml build`. You passed `--compile` to explicitly enable it, but the config JSON has no `"compile"` section (it's `null`). This happens when the config was generated without a device target that supports EPContext (e.g., `--device cpu` or `--device auto` on a machine without NPU).
 
-**Solution:** Regenerate the config with compilation enabled, or add a compile section manually:
+**Solution:** Regenerate the config targeting a device that supports compilation (NPU or GPU with an EP that produces EPContext):
 
 ```bash
 uv run winml config -m <model> -d npu -o output/
 ```
 
+!!! note
+    By default `winml build` skips the compile stage unless `--compile` is passed or the config contains a non-null `"compile"` section. To include compilation in the generated config, specify a device that maps to an EPContext-capable EP (e.g., `-d npu`).
+
 ---
 
 ### Already a compiled EPContext model

From a5f63617acab86fc1294d3e93cc3b510e21ea002 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 11:57:35 +0800
Subject: [PATCH 073/143] expand unsupported nodes guidance with optim-config
 workflow

---
 docs/troubleshooting.md | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 61c145ad8..14fa507c0 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -67,17 +67,28 @@ RuntimeError: Unsupported nodes persist after analysis
 
 **Cause:** The model contains operators that the selected EP cannot dispatch natively.
 
-**Solution:** Run `winml analyze` first to identify which operators are problematic:
+**Solution:** Run `winml analyze` with `--optim-config` to identify problematic operators and get recommended graph optimizations:
 
 ```bash
-uv run winml analyze -m model.onnx --ep qnn
+# Analyze and output optimization recommendations
+uv run winml analyze -m model.onnx --ep qnn --optim-config optim_config.json
 ```
 
-Then consider:
+This produces `optim_config.json` with the auto-discovered optimization flags. Apply them with `winml optimize`, then re-analyze:
 
-- Using a different EP (`--ep dml` or `--ep cpu`)
-- Running optimization to fuse unsupported patterns into supported ones
-- Checking if a newer opset version resolves the compatibility gap
+```bash
+# Apply recommended optimizations
+uv run winml optimize -m model.onnx -o model_optimized.onnx -c optim_config.json
+
+# Re-analyze to check if unsupported nodes are resolved
+uv run winml analyze -m model_optimized.onnx --ep qnn
+```
+
+If unsupported nodes still remain after optimization, consider:
+
+- **Manually modifying problematic nodes** — use tools like `onnx-graphsurgeon` to replace or remove operators the EP cannot handle
+- **Using a different EP** (`--ep dml` or `--ep cpu`) that supports the operators in question
+- **Checking if a newer opset version** resolves the compatibility gap (re-export with `--opset-version 18`)
 
 ---
 

From 96acb211e5faeaf520d2462d50d7c77571a5600c Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 12:01:44 +0800
Subject: [PATCH 074/143] add normalize-before-analyze troubleshooting tip

---
 docs/troubleshooting.md | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 14fa507c0..f8710b4ae 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -92,6 +92,24 @@ If unsupported nodes still remain after optimization, consider:
 
 ---
 
+### Many "unknown" results from constant nodes
+
+When `winml analyze` reports a large number of nodes as "unknown", the model likely hasn't been normalized — it contains raw constant-folding subgraphs, missing shape annotations, or redundant initializer nodes that the analyzer cannot classify.
+
+**Solution:** Run `winml optimize` with no optimization flags to normalize the model (constant folding, shape inference, dead-node elimination), then re-analyze:
+
+```bash
+# Normalize only (no fusion flags)
+uv run winml optimize -m model.onnx -o model_normalized.onnx
+
+# Re-analyze — constant nodes are now folded, shapes are inferred
+uv run winml analyze -m model_normalized.onnx --ep qnn
+```
+
+This baseline pass collapses constant subgraphs into initializers and propagates tensor shapes throughout the graph, giving the analyzer enough information to classify nodes correctly.
+
+---
+
 ## Device and EP Issues
 
 ### Unknown EP or device mismatch

From c5217fa32107842433926c8e1f4c9baf07175354 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 12:03:03 +0800
Subject: [PATCH 075/143] add --compile flag to config example in
 troubleshooting

---
 docs/troubleshooting.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index f8710b4ae..529cf2abd 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -17,7 +17,7 @@ UsageError: Cannot enable compilation: no compile section found in the config fi
 **Solution:** Regenerate the config targeting a device that supports compilation (NPU or GPU with an EP that produces EPContext):
 
 ```bash
-uv run winml config -m <model> -d npu -o output/
+uv run winml config -m <model> -d npu --compile -o output/
 ```
 
 !!! note

From e7d8536071d140b01a2304e8252c302b9367773c Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 12:09:58 +0800
Subject: [PATCH 076/143] remove Device and EP Issues section from
 troubleshooting

---
 docs/troubleshooting.md | 43 -----------------------------------------
 1 file changed, 43 deletions(-)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 529cf2abd..d757206ab 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -110,49 +110,6 @@ This baseline pass collapses constant subgraphs into initializers and propagates
 
 ---
 
-## Device and EP Issues
-
-### Unknown EP or device mismatch
-
-```text
-UsageError: Unknown EP: invalid_ep
-UsageError: --ep QNNExecutionProvider cannot run on --device gpu
-```
-
-**Cause:** The specified EP doesn't exist or doesn't support the requested device.
-
-**Solution:** Check available EPs on your system:
-
-```bash
-uv run winml sys --list-ep
-```
-
-Valid EP aliases: `qnn`, `openvino`, `dml`, `cpu`, `tensorrt`, `migraphx`, `vitisai`.
-
----
-
-### No NPU device detected
-
-```text
-Available Devices (priority order)
-  #1  GPU   ...
-  #2  CPU   ...
-```
-
-**Cause:** NPU driver not installed, or Windows version is too old.
-
-**Solution:**
-
-1. Verify Windows 11 24H2 or later
-2. Check for NPU driver updates in Device Manager → Neural processors
-3. Install the latest Qualcomm AI Engine Direct SDK (for Snapdragon NPUs)
-4. Re-run `uv run winml sys` to confirm
-
-!!! note
-    All winml-cli commands work without NPU hardware. Use `--device auto` to fall back to GPU or CPU.
-
----
-
 ## Quantization and Compilation Failures
 
 ### Quantization failed

From 72c340beaabed14ebe85be8d47c0c25cde67446e Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 12:13:34 +0800
Subject: [PATCH 077/143] remove Quantization/Compilation and Output/File
 sections from troubleshooting

---
 docs/troubleshooting.md | 58 -----------------------------------------
 1 file changed, 58 deletions(-)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index d757206ab..3da050c7d 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -110,64 +110,6 @@ This baseline pass collapses constant subgraphs into initializers and propagates
 
 ---
 
-## Quantization and Compilation Failures
-
-### Quantization failed
-
-```text
-RuntimeError: Quantization failed: [error details]
-```
-
-**Cause:** Quantization encountered an incompatible graph structure or calibration error.
-
-**Solution:**
-
-1. Add `--verbose` to see detailed error output
-2. Ensure the model has been optimized first (run `winml optimize` before `winml quantize`)
-3. Try a different calibration method: `--calibration-method entropy`
-4. Exclude problematic nodes: use `nodes_to_exclude` in the quant config
-
----
-
-### No output file produced after compile
-
-```text
-Warning: Compilation finished but no output file was written
-ClickException: No output file produced
-```
-
-**Cause:** The compiler ran but didn't generate the expected `_ctx.onnx` file. Common with DML/CPU (which don't produce EPContext).
-
-**Solution:** Verify you're targeting an EP that supports EPContext:
-
-```bash
-# Correct — QNN supports EPContext
-uv run winml compile -m model.onnx -d npu --ep qnn
-
-# Won't produce output — DML doesn't support EPContext
-uv run winml compile -m model.onnx -d gpu --ep dml
-```
-
----
-
-## Output and File Issues
-
-### Output path exists but is not a directory
-
-```text
-ValueError: Output path exists but is not a directory: output.onnx
-```
-
-**Cause:** The `-o` flag expects a directory path, but you passed a file path.
-
-**Solution:** Use a directory:
-
-```bash
-uv run winml build -c config.json -m model -o output_dir/
-```
-
----
-
 ## General Tips
 
 | Tip | Command |

From e68817b9a06cf234ed42f99181b263a06f7de273 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 12:14:18 +0800
Subject: [PATCH 078/143] add --rebuild tip to troubleshooting general tips

---
 docs/troubleshooting.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index 3da050c7d..dd21e455a 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -118,6 +118,7 @@ This baseline pass collapses constant subgraphs into initializers and propagates
 | **Check EP compatibility** | `uv run winml analyze -m model.onnx --ep <ep>` |
 | **Verbose output** | Add `-v` or `--verbose` to any command |
 | **Skip a pipeline stage** | `--no-quant`, `--no-compile`, `--no-optimize` |
+| **Force rebuild (ignore cache)** | `uv run winml build -c config.json -m <model> -o output/ --rebuild` |
 | **Regenerate config** | `uv run winml config -m <model> -d <device> -o dir/` |
 
 ## See also

From 5bc692760fe4275333cb31b3ba7dfac9275093d5 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 18:16:31 +0800
Subject: [PATCH 079/143] docs: add mike version control plugin for
 multi-version docs

---
 mkdocs.yml | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mkdocs.yml b/mkdocs.yml
index d68a7f3a5..c7ff6ccb3 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -14,6 +14,10 @@ exclude_docs: |
   /pytest-best-practices.md
   /README.md
 
+extra:
+  version:
+    provider: mike
+
 theme:
   name: material
   features:
@@ -45,6 +49,10 @@ theme:
 
 plugins:
   - search
+  - mike:
+      version_selector: true
+      css_dir: css
+      javascript_dir: js
 
 markdown_extensions:
   - admonition

From be316f399980ce1aaf36ba6e8236132aab9aee31 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 19:04:25 +0800
Subject: [PATCH 080/143] docs: reorganize troubleshooting by component,
 replace mermaid with SVG, add analyze section

---
 .github/workflows/docs.yml    | 60 +++++++++++++++++++++++++++++++----
 docs/commands/overview.md     |  2 +-
 docs/concepts/how-it-works.md | 27 ++++++++++------
 docs/troubleshooting.md       | 25 +++++++++++++--
 docs/versions.json            |  3 ++
 mkdocs.yml                    |  3 +-
 6 files changed, 99 insertions(+), 21 deletions(-)
 create mode 100644 docs/versions.json

diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
index d4d9992f5..2b2d9d5e7 100644
--- a/.github/workflows/docs.yml
+++ b/.github/workflows/docs.yml
@@ -1,22 +1,68 @@
 name: Build & Publish Docs
 
 on:
+  push:
+    branches: [main]
+    paths: ["docs/**", "mkdocs.yml"]
+  release:
+    types: [published]
   workflow_dispatch:
+    inputs:
+      version:
+        description: "Version label to deploy (e.g., 0.2). Leave empty to use 'dev'."
+        required: false
 
 permissions:
   contents: write
+  pages: write
 
 jobs:
-  build:
+  deploy:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
       - uses: astral-sh/setup-uv@v3
         with:
-          python-version: "3.10"
+          python-version: "3.11"
+
       - run: uv sync --extra dev
-      - run: uv run mkdocs build --strict
-      - uses: peaceiris/actions-gh-pages@v4
-        with:
-          github_token: ${{ secrets.GITHUB_TOKEN }}
-          publish_dir: ./site
+
+      - name: Configure git for mike
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "github-actions[bot]@users.noreply.github.com"
+
+      - name: Determine version
+        id: version
+        run: |
+          if [[ "${{ github.event_name }}" == "release" ]]; then
+            # Strip 'v' prefix from tag: v0.2.0 → 0.2
+            TAG="${{ github.event.release.tag_name }}"
+            VERSION="${TAG#v}"
+            VERSION="${VERSION%.*}"
+            echo "version=$VERSION" >> "$GITHUB_OUTPUT"
+            echo "alias=latest" >> "$GITHUB_OUTPUT"
+          elif [[ -n "${{ github.event.inputs.version }}" ]]; then
+            echo "version=${{ github.event.inputs.version }}" >> "$GITHUB_OUTPUT"
+            echo "alias=latest" >> "$GITHUB_OUTPUT"
+          else
+            echo "version=dev" >> "$GITHUB_OUTPUT"
+            echo "alias=" >> "$GITHUB_OUTPUT"
+          fi
+
+      - name: Deploy docs with mike
+        run: |
+          VERSION="${{ steps.version.outputs.version }}"
+          ALIAS="${{ steps.version.outputs.alias }}"
+          if [[ -n "$ALIAS" ]]; then
+            uv run mike deploy "$VERSION" "$ALIAS" --update-aliases --push
+          else
+            uv run mike deploy "$VERSION" --push
+          fi
+
+      - name: Set default version
+        if: steps.version.outputs.alias == 'latest'
+        run: uv run mike set-default latest --push
diff --git a/docs/commands/overview.md b/docs/commands/overview.md
index 34faed201..e3ca6d603 100644
--- a/docs/commands/overview.md
+++ b/docs/commands/overview.md
@@ -15,7 +15,7 @@ configuration and tunes the ONNX graph. **Build** (`export`, `quantize`,
 The typical workflow follows that order: run `winml sys` to confirm hardware
 and EPs, then `winml inspect` or `winml catalog` to verify model support. Use
 `winml config` to generate a build configuration, then `winml build` to execute
-the full pipeline — or chain `export` → `optimize` → `quantize` → `compile`
+the full pipeline — or chain `export` → `analyze` → `optimize` → `quantize` → `compile`
 individually for finer control. Close with `winml perf` and `winml eval` to
 measure speed and accuracy.
 
diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index 32a3244af..b9785737d 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -13,16 +13,7 @@ programmatic entry point for `WinMLAutoModel.from_pretrained()`.
 
 ## The Pipeline at a Glance
 
-```mermaid
-flowchart TD
-    A[PyTorch / HF model] --> B[winml export]
-    O[Existing ONNX file] --> C
-    B --> C[winml optimize]
-    C --> D[winml quantize]
-    D --> E[winml compile]
-    E --> F[EP-ready ONNX]
-    F --> G[winml perf / eval]
-```
+![winml-cli workflow](../assets/workflow-only.svg)
 
 The stages run in order, and each one writes an intermediate ONNX file to the output
 directory. All intermediate artifacts are preserved so you can inspect any stage's output
@@ -30,6 +21,22 @@ or feed a pre-processed file into a later stage directly.
 
 ## Pipeline Stages
 
+### Analyze — `winml analyze`
+
+`winml analyze` performs static compatibility analysis on an ONNX graph against a target
+execution provider. It classifies every node as Supported, Partial, Unsupported, or
+Unknown — without running the model on the device. Use it before building to check if
+your model (or an intermediate artifact from any pipeline stage) will run cleanly on the
+target EP:
+
+```bash
+winml analyze -m model.onnx --ep qnn --device npu
+```
+
+Add `--optim-config optim.json` to output auto-discovered optimization recommendations
+that can be fed directly into `winml optimize`. The same analyzer also drives the
+autoconf feedback loop inside `winml build`.
+
 ### Export — `winml export`
 
 `winml export` loads a Hugging Face model (pretrained or random-weight), traces it with
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index dd21e455a..fbe602375 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -4,7 +4,7 @@ Common issues and solutions when working with winml-cli.
 
 ---
 
-## Build and Pipeline Errors
+## Compile
 
 ### Cannot enable compilation: no compile section
 
@@ -57,7 +57,7 @@ uv run winml build -c config.json -m model -o output/ --no-compile
 
 ---
 
-## Analysis and Compatibility
+## Analyze
 
 ### Unsupported nodes persist after analysis
 
@@ -110,6 +110,26 @@ This baseline pass collapses constant subgraphs into initializers and propagates
 
 ---
 
+## Build / Cache
+
+### Disk full / out of space
+
+Build artifacts (exported ONNX, optimized graphs, quantized models, compiled EPContext files) are cached under:
+
+```
+C:\Users\<user>\.cache\winml
+```
+
+This directory can grow significantly after multiple builds with large models. If you encounter disk-full errors or want to reclaim space, it is safe to delete the entire folder:
+
+```powershell
+Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\winml"
+```
+
+The next `winml build` will re-create the cache as needed. Use `--rebuild` to force a full rebuild without relying on cached intermediates.
+
+---
+
 ## General Tips
 
 | Tip | Command |
@@ -120,6 +140,7 @@ This baseline pass collapses constant subgraphs into initializers and propagates
 | **Skip a pipeline stage** | `--no-quant`, `--no-compile`, `--no-optimize` |
 | **Force rebuild (ignore cache)** | `uv run winml build -c config.json -m <model> -o output/ --rebuild` |
 | **Regenerate config** | `uv run winml config -m <model> -d <device> -o dir/` |
+| **Free disk space** | Delete `C:\Users\<user>\.cache\winml` |
 
 ## See also
 
diff --git a/docs/versions.json b/docs/versions.json
new file mode 100644
index 000000000..7b51a8ab9
--- /dev/null
+++ b/docs/versions.json
@@ -0,0 +1,3 @@
+[
+  {"version": "0.1", "title": "0.1", "aliases": ["latest"]}
+]
diff --git a/mkdocs.yml b/mkdocs.yml
index c7ff6ccb3..3b9aafba2 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -17,6 +17,7 @@ exclude_docs: |
 extra:
   version:
     provider: mike
+    default: latest
 
 theme:
   name: material
@@ -90,8 +91,8 @@ nav:
           - How winml-cli works: concepts/how-it-works.md
           - Graph and IR: concepts/graphs-and-ir.md
           - Weight and Activation: concepts/weight-and-activation.md
-          - EP and Device: concepts/eps-and-devices.md
           - Datatype and Quantization: concepts/quantization.md
+          - EP and Device: concepts/eps-and-devices.md
       - WinML CLI:
           - Primitives and pipeline: concepts/primitives-and-pipeline.md
           - Load and export: concepts/load-and-export.md

From bb3a40c67189f0667b7292b1b322c80709259c1a Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 5 Jun 2026 19:10:37 +0800
Subject: [PATCH 081/143] docs: move Analyze section after Export in
 how-it-works

---
 docs/concepts/how-it-works.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index b9785737d..f18fa40ce 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -21,6 +21,13 @@ or feed a pre-processed file into a later stage directly.
 
 ## Pipeline Stages
 
+### Export — `winml export`
+
+`winml export` loads a Hugging Face model (pretrained or random-weight), traces it with
+torch.export or an Optimum-based exporter, and writes a portable, device-agnostic ONNX
+file. The output at this stage is a plain ONNX graph with float32 weights and no
+EP-specific nodes.
+
 ### Analyze — `winml analyze`
 
 `winml analyze` performs static compatibility analysis on an ONNX graph against a target
@@ -37,13 +44,6 @@ Add `--optim-config optim.json` to output auto-discovered optimization recommend
 that can be fed directly into `winml optimize`. The same analyzer also drives the
 autoconf feedback loop inside `winml build`.
 
-### Export — `winml export`
-
-`winml export` loads a Hugging Face model (pretrained or random-weight), traces it with
-torch.export or an Optimum-based exporter, and writes a portable, device-agnostic ONNX
-file. The output at this stage is a plain ONNX graph with float32 weights and no
-EP-specific nodes.
-
 ### Optimize — `winml optimize`
 
 `winml optimize` runs graph-level transformations on the exported ONNX: operator fusion

From a7bec60e2c0ab44bf3f01ddf6cf58ef968cd76b4 Mon Sep 17 00:00:00 2001
From: Yi Ren <reny@microsoft.com>
Date: Mon, 8 Jun 2026 11:06:12 +0800
Subject: [PATCH 082/143] docs: replace qwen3 composite sample with clip
 composite sample

Rewrite the composite-models sample around openai/clip-vit-base-patch32
(dual-encoder zero-shot image classification), demo setup/inference/shape
config via WinMLAutoModel, and update the mkdocs nav entry.
---
 docs/samples/clip-composite.md  | 160 ++++++++++++++++++++++++++++++++
 docs/samples/qwen3-composite.md | 142 ----------------------------
 mkdocs.yml                      |   2 +-
 3 files changed, 161 insertions(+), 143 deletions(-)
 create mode 100644 docs/samples/clip-composite.md
 delete mode 100644 docs/samples/qwen3-composite.md

diff --git a/docs/samples/clip-composite.md b/docs/samples/clip-composite.md
new file mode 100644
index 000000000..f0d79015a
--- /dev/null
+++ b/docs/samples/clip-composite.md
@@ -0,0 +1,160 @@
+# CLIP — Composite Models
+
+CLIP (`openai/clip-vit-base-patch32`) is a dual-encoder vision-language model: one tower encodes images, the other encodes text, and both project into a shared embedding space. winml-cli treats it as a **composite model** — a model that is split into multiple ONNX sub-models that run together at inference time. For CLIP, the two sub-models are:
+
+| Sub-model | Role | Input shape | Output (projected) |
+|-----------|------|-------------|--------------------|
+| `image-encoder` | Encodes images into embeddings | `pixel_values` `[1, 3, 224, 224]` | `image_embeds` `[1, 512]` |
+| `text-encoder` | Encodes text labels into embeddings | `input_ids` `[1, 77]` | `text_embeds` `[1, 512]` |
+
+Zero-shot classification is achieved by embedding the image and the candidate text labels, then ranking the labels by the cosine similarity between their embeddings. Splitting the towers into two ONNX graphs lets each encoder have fully static shapes (required for efficient NPU compilation) and lets you build, cache, and benchmark them independently.
+
+## Prerequisites
+
+- winml-cli installed and `winml` on your PATH.
+- A network connection to download CLIP weights from HuggingFace on first run.
+
+## Overall workflow
+
+The composite model architecture for CLIP:
+
+```mermaid
+graph LR
+    A[winml config] -->|"(clip, zero-shot-image-classification)"| B[Composite Registry]
+    B --> C[image-encoder config]
+    B --> D[text-encoder config]
+    C --> E[winml build → image-encoder.onnx]
+    D --> F[winml build → text-encoder.onnx]
+    E --> G[WinMLAutoModel]
+    F --> G
+    G -->|logits_per_image| H[Classification scores]
+```
+
+## Step 1: Generate build configs
+
+```bash
+winml config -m openai/clip-vit-base-patch32 --task zero-shot-image-classification -o clip.json
+```
+
+Because `(clip, zero-shot-image-classification)` is registered as a composite model, this command produces **two** config files — one per sub-model:
+
+- `clip_image-encoder.json` — export config using `image-feature-extraction` task
+- `clip_text-encoder.json` — export config using `feature-extraction` task
+
+Each config includes CLIP-specific optimizations (GELU fusion, LayerNorm fusion, MatMul+Add fusion, and clamp constant values).
+
+## Step 2: Build each sub-model
+
+Build both sub-models individually using their config files:
+
+```bash
+# Build the image encoder
+winml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder
+
+# Build the text encoder
+winml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder
+```
+
+Each `winml build` runs the full pipeline: export → optimize → quantize → compile. The output directories contain the final ONNX files ready for inference.
+
+To target a specific execution provider (e.g., QNN for NPU):
+
+```bash
+winml build -c clip_image-encoder.json -m openai/clip-vit-base-patch32 -o output/image-encoder --ep qnn
+winml build -c clip_text-encoder.json -m openai/clip-vit-base-patch32 -o output/text-encoder --ep qnn
+```
+
+## Step 3: Benchmark each sub-model
+
+```bash
+winml perf output/image-encoder -d npu
+winml perf output/text-encoder -d npu
+```
+
+This lets you identify whether the image or text encoder is the bottleneck on your target hardware.
+
+## Step 4: Run inference (Python API)
+
+There are two ways to get a ready-to-run model. Both return the same `WinMLModelForZeroShotImageClassification` — a single object that orchestrates the two encoders and combines their projected embeddings into similarity scores — so the inference code afterward is identical.
+
+**Option 1 — Load the ONNX files built in Step 2** (skips re-export/optimization). Pass a dict mapping each component name to its built `model.onnx`, plus the HF config so the composite registry can resolve `(clip, zero-shot-image-classification)`:
+
+```python
+from transformers import AutoConfig
+
+from winml.modelkit.models import WinMLAutoModel
+
+model = WinMLAutoModel.from_onnx(
+    {
+        "image-encoder": "output/image-encoder/model.onnx",
+        "text-encoder": "output/text-encoder/model.onnx",
+    },
+    task="zero-shot-image-classification",
+    hf_config=AutoConfig.from_pretrained("openai/clip-vit-base-patch32"),
+    skip_build=True,
+)
+```
+
+**Option 2 — Build both encoders from the HuggingFace model in one call.** `WinMLAutoModel.from_pretrained` detects the composite task and runs the full pipeline for each sub-model:
+
+```python
+from winml.modelkit.models import WinMLAutoModel
+
+model = WinMLAutoModel.from_pretrained(
+    "openai/clip-vit-base-patch32",
+    task="zero-shot-image-classification",
+)
+```
+
+Either way, run inference the same way — prepare an image plus candidate labels with the HF processor, then call the model:
+
+```python
+from PIL import Image
+from transformers import CLIPProcessor
+
+processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+image = Image.open("cat.jpg")
+labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
+inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
+
+# Run both encoders and combine into per-label similarity scores
+outputs = model(**inputs)
+probs = outputs.logits_per_image.softmax(dim=-1)
+for label, p in zip(labels, probs[0].tolist()):
+    print(f"{label}: {p:.4f}")
+```
+
+The text encoder's fixed sequence length (77) is handled for you — the processor's tokens are padded or truncated to match the ONNX graph before each run.
+
+### Customizing shape config per sub-model
+
+Each encoder takes its own `shape_config`, passed through `sub_model_kwargs`. The image encoder accepts vision keys (`height`, `width`); the text encoder accepts text keys (`sequence_length`):
+
+```python
+model = WinMLAutoModel.from_pretrained(
+    "openai/clip-vit-base-patch32",
+    task="zero-shot-image-classification",
+    sub_model_kwargs={
+        "image-encoder": {"shape_config": {"height": 224, "width": 224}},
+        "text-encoder":  {"shape_config": {"sequence_length": 77}},
+    },
+)
+```
+
+## Other composite models
+
+The same composite model pattern is used for:
+
+- **SigLIP** (`google/siglip-base-patch16-224`) — dual-encoder zero-shot image classification; shares the same composite wrapper as CLIP
+- **T5** (`google-t5/t5-small`) — encoder + decoder for translation/summarization
+- **BART** (`facebook/bart-large-cnn`) — encoder + decoder for summarization and table-question-answering (TAPEX)
+- **Marian** (`Helsinki-NLP/opus-mt-en-de`) — encoder + decoder for translation
+- **Qwen3** (`Qwen/Qwen3-0.6B`) — prefill + generation decoders for text generation
+- **BLIP** (`Salesforce/blip-image-captioning-base`) — vision encoder + text decoder for image-to-text captioning
+- **Vision-encoder-decoder** (`microsoft/trocr-base-handwritten`) — vision encoder + text decoder for image-to-text (TrOCR, Donut)
+
+## See also
+
+- [BERT — Config + Build + Perf](bert-config-build.md) — single-model workflow
+- [ConvNeXt — Primitive commands](convnext-primitives.md) — step-by-step pipeline
+- [Config and build](../concepts/config-and-build.md) — concept overview
diff --git a/docs/samples/qwen3-composite.md b/docs/samples/qwen3-composite.md
deleted file mode 100644
index 5545ba30c..000000000
--- a/docs/samples/qwen3-composite.md
+++ /dev/null
@@ -1,142 +0,0 @@
-# Qwen3 — Composite Models
-
-Qwen3 (`Qwen/Qwen3-0.6B`, `Qwen/Qwen3-1.7B`, etc.) is a decoder-only large language model that uses grouped-query attention and a sliding-window KV cache. winml-cli treats it as a **composite model** — a model that is split into multiple ONNX sub-models that run together at inference time. For Qwen3, the two sub-models are:
-
-| Sub-model | Role | Input shape (`input_ids`) | Output KV shape |
-|-----------|------|--------------------------|-----------------|
-| `decoder_prefill` | Processes the full prompt in chunks | `[1, 64]` | `[1, kv_heads, 64, head_dim]` |
-| `decoder_gen` | Generates one token at a time | `[1, 1]` | `[1, kv_heads, 1, head_dim]` |
-
-Both sub-models share the same weights and KV cache buffer. Splitting prefill from generation lets each ONNX graph have fully static shapes, which is required for efficient NPU compilation.
-
-## Prerequisites
-
-- winml-cli installed and `winml` on your PATH.
-- A network connection to download Qwen3 weights from HuggingFace on first run.
-- At least 4 GB free disk space (for `Qwen3-0.6B`; larger variants need more).
-
-## Step 1: Generate build configs
-
-```bash
-winml config -m Qwen/Qwen3-0.6B --task text-generation -o qwen3.json
-```
-
-Because `(qwen3, text-generation)` is registered as a composite model, this command produces **two** config files — one per sub-model:
-
-- `qwen3_decoder_prefill.json` — export config using `feature-extraction` task
-- `qwen3_decoder_gen.json` — export config using `text2text-generation` task
-
-Each config includes Qwen3-specific optimizations (dynamo export, opset 18, GeLU fusion, RMSNorm fusion, MatMul+Add fusion, clamp constant values, and remove-IsNaN-in-attention-mask).
-
-## Step 2: Build each sub-model
-
-Build both sub-models individually using their config files:
-
-```bash
-# Build the prefill sub-model
-winml build -c qwen3_decoder_prefill.json -m Qwen/Qwen3-0.6B -o output/prefill
-
-# Build the generation sub-model
-winml build -c qwen3_decoder_gen.json -m Qwen/Qwen3-0.6B -o output/gen
-```
-
-Each `winml build` runs the full pipeline: export (via torch dynamo) → optimize → quantize → compile. The output directories contain the final ONNX files ready for inference.
-
-To target a specific execution provider (e.g., QNN for NPU):
-
-```bash
-winml build -c qwen3_decoder_prefill.json -m Qwen/Qwen3-0.6B -o output/prefill --ep qnn
-winml build -c qwen3_decoder_gen.json -m Qwen/Qwen3-0.6B -o output/gen --ep qnn
-```
-
-## Step 3: Benchmark each sub-model
-
-```bash
-winml perf output/prefill -d npu
-winml perf output/gen -d npu
-```
-
-This lets you identify whether the prefill or generation phase is the bottleneck on your target hardware.
-
-## Step 4: Run inference (Python API)
-
-The `WinMLQwen3Model` class combines both sub-models into a single generation pipeline that implements HuggingFace's `GenerationMixin` interface:
-
-```python
-from winml.modelkit.models.hf.qwen import WinMLQwen3Model
-
-# Build and load both sub-models in one call
-model = WinMLQwen3Model.from_pretrained("Qwen/Qwen3-0.6B", task="text-generation")
-
-# Or load pre-built ONNX files (skips re-export/optimization)
-from winml.modelkit.models.auto import WinMLAutoModel
-from transformers import AutoConfig
-
-prefill = WinMLAutoModel.from_pretrained("output/prefill/model.onnx", skip_build=True)
-gen = WinMLAutoModel.from_pretrained("output/gen/model.onnx", skip_build=True)
-config = AutoConfig.from_pretrained("Qwen/Qwen3-0.6B")
-
-model = WinMLQwen3Model(
-    sub_models={"decoder_prefill": prefill, "decoder_gen": gen},
-    config=config,
-)
-
-# Generate text using HF's standard generate() API
-from transformers import AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
-inputs = tokenizer("Hello, how are you?", return_tensors="pt")
-output_ids = model.generate(**inputs, max_new_tokens=50)
-print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
-```
-
-### Customizing shape config per sub-model
-
-You can pass different `shape_config` to each sub-model via `sub_model_kwargs`:
-
-```python
-model = WinMLQwen3Model.from_pretrained(
-    "Qwen/Qwen3-0.6B",
-    task="text-generation",
-    sub_model_kwargs={
-        "decoder_prefill": {"shape_config": {"max_cache_len": 512, "seq_len": 64}},
-        "decoder_gen":     {"shape_config": {"max_cache_len": 512, "seq_len": 1}},
-    },
-)
-```
-
-## How it works internally
-
-The composite model architecture for Qwen3:
-
-```mermaid
-graph LR
-    A[winml config] -->|"(qwen3, text-generation)"| B[Composite Registry]
-    B --> C[decoder_prefill config]
-    B --> D[decoder_gen config]
-    C --> E[winml build → prefill.onnx]
-    D --> F[winml build → gen.onnx]
-    E --> G[WinMLQwen3Model]
-    F --> G
-    G -->|GenerationMixin.generate| H[Token output]
-```
-
-Key design decisions:
-
-- **Dynamo export required** — TorchScript fails for Qwen3; dynamo produces opset 18 graphs.
-- **Sliding-window KV cache** — Uses Slice+Concat (FIFO) instead of index_copy_. New KV tokens are appended at the end of the buffer; oldest tokens are evicted.
-- **Static shapes throughout** — Both sub-models have fixed input/output shapes, enabling ahead-of-time compilation for NPU.
-- **GenericTask registration** — Sub-models are registered as `WinMLModelForGenericTask` so their raw ONNX outputs (logits + KV) are preserved without task-specific post-processing.
-
-## Other composite models
-
-The same composite model pattern is used for:
-
-- **T5** (`google-t5/t5-small`) — encoder + decoder architecture for translation/summarization
-- **Mu2** — encoder-decoder with custom code (`trust_remote_code=True`)
-
-## See also
-
-- [BERT — Config + Build + Perf](bert-config-build.md) — single-model workflow
-- [ConvNeXt — Primitive commands](convnext-primitives.md) — step-by-step pipeline
-- [Config and build](../concepts/config-and-build.md) — concept overview
diff --git a/mkdocs.yml b/mkdocs.yml
index 3b9aafba2..bd00618cd 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -122,7 +122,7 @@ nav:
   - Samples:
       - ConvNeXt — Primitives Walkthrough: samples/convnext-primitives.md
       - BERT — Config + Build + Perf: samples/bert-config-build.md
-      - Qwen3 — Composite Models: samples/qwen3-composite.md
+      - CLIP — Composite Models: samples/clip-composite.md
   - Tutorials:
       - Overview: tutorials/index.md
       - ConvNeXt on NPU: tutorials/npu-convnext.md

From b5cd5c19f17b061f73119f95409c7e49c6bd1422 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Mon, 8 Jun 2026 14:51:22 +0800
Subject: [PATCH 083/143] docs: remove internal planning files (superpowers/,
 naming-convention, pytest-best-practices, README), add Privacy to nav

---
 docs/README.md                                |  130 -
 docs/naming-convention.md                     |  104 -
 docs/pytest-best-practices.md                 | 2832 -----------------
 .../superpowers/2026-05-26-v3-known-issues.md |  102 -
 .../analyze-and-optimize.md                   |   43 -
 .../2026-05-27-doc-issues/analyze.md          |   44 -
 .../bert-config-build.md                      |   96 -
 .../2026-05-27-doc-issues/build.md            |   37 -
 .../compile-and-epcontext.md                  |   38 -
 .../2026-05-27-doc-issues/compile.md          |   31 -
 .../2026-05-27-doc-issues/config-and-build.md |   57 -
 .../2026-05-27-doc-issues/config.md           |   43 -
 .../convnext-primitives.md                    |   82 -
 .../2026-05-27-doc-issues/end-to-end.md       |   15 -
 .../2026-05-27-doc-issues/eps-and-devices.md  |   26 -
 .../eval-and-datasets.md                      |   18 -
 .../superpowers/2026-05-27-doc-issues/eval.md |   23 -
 .../2026-05-27-doc-issues/export.md           |   32 -
 .../2026-05-27-doc-issues/graphs-and-ir.md    |   21 -
 .../2026-05-27-doc-issues/how-it-works.md     |   22 -
 docs/superpowers/2026-05-27-doc-issues/hub.md |   37 -
 .../2026-05-27-doc-issues/index.md            |   24 -
 .../2026-05-27-doc-issues/inspect.md          |   35 -
 .../2026-05-27-doc-issues/installation.md     |   13 -
 .../2026-05-27-doc-issues/load-and-export.md  |   43 -
 .../2026-05-27-doc-issues/npu-convnext.md     |   18 -
 .../2026-05-27-doc-issues/optimize.md         |   48 -
 .../2026-05-27-doc-issues/overview.md         |   17 -
 .../perf-and-monitoring.md                    |   18 -
 .../superpowers/2026-05-27-doc-issues/perf.md |   41 -
 .../primitives-and-pipeline.md                |   32 -
 .../2026-05-27-doc-issues/quantization.md     |   29 -
 .../2026-05-27-doc-issues/quantize.md         |   33 -
 .../2026-05-27-doc-issues/quickstart.md       |   13 -
 .../2026-05-27-doc-issues/qwen3-composite.md  |   43 -
 docs/superpowers/2026-05-27-doc-issues/sys.md |   33 -
 .../2026-05-27-doc-issues/tutorials-index.md  |   27 -
 .../weight-and-activation.md                  |   18 -
 .../2026-05-27-validated-issues.md            |  151 -
 .../plans/2026-05-20-modelkit-docs-site.md    | 1031 ------
 .../plans/2026-05-24-docs-expansion-v2.md     |  996 ------
 .../2026-05-20-modelkit-docs-site-design.md   |  239 --
 .../2026-05-24-docs-expansion-v2-design.md    |  263 --
 mkdocs.yml                                    |    5 +-
 44 files changed, 1 insertion(+), 7002 deletions(-)
 delete mode 100644 docs/README.md
 delete mode 100644 docs/naming-convention.md
 delete mode 100644 docs/pytest-best-practices.md
 delete mode 100644 docs/superpowers/2026-05-26-v3-known-issues.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/analyze.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/bert-config-build.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/build.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/compile.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/config-and-build.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/config.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/end-to-end.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/eval.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/export.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/how-it-works.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/hub.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/index.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/inspect.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/installation.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/load-and-export.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/npu-convnext.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/optimize.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/overview.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/perf.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/quantization.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/quantize.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/quickstart.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/sys.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/tutorials-index.md
 delete mode 100644 docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md
 delete mode 100644 docs/superpowers/2026-05-27-validated-issues.md
 delete mode 100644 docs/superpowers/plans/2026-05-20-modelkit-docs-site.md
 delete mode 100644 docs/superpowers/plans/2026-05-24-docs-expansion-v2.md
 delete mode 100644 docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md
 delete mode 100644 docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md

diff --git a/docs/README.md b/docs/README.md
deleted file mode 100644
index 6fa5d68bf..000000000
--- a/docs/README.md
+++ /dev/null
@@ -1,130 +0,0 @@
-# Contributing to winml-cli docs
-
-This folder hosts the source for the [winml-cli](https://github.com/microsoft/winml-cli) documentation site, built with [MkDocs Material](https://squidfunk.github.io/mkdocs-material/).
-
-## Quick reference
-
-| Task | Command |
-|---|---|
-| Install dev deps | `uv sync --extra dev` |
-| Live preview | `uv run mkdocs serve` |
-| Build for CI | `uv run mkdocs build --strict` |
-| Publish (one-shot from laptop) | `uv run mkdocs gh-deploy --force` |
-| Publish (CI workflow) | GitHub Actions → "Build & Publish Docs" → Run workflow |
-
-## What's in here
-
-```
-docs/
-├── index.md                          ← landing page
-├── getting-started/                  ← 3 onboarding pages
-├── concepts/                         ← 12 conceptual pages in two sub-groups
-│   ├── how-it-works.md, graphs-and-ir.md, weight-and-activation.md,
-│   │     eps-and-devices.md, quantization.md         (Fundamentals)
-│   └── primitives-and-pipeline.md, load-and-export.md, analyze-and-optimize.md,
-│         compile-and-epcontext.md, perf-and-monitoring.md, eval-and-datasets.md,
-│         config-and-build.md                         (WinML CLI workflows)
-├── commands/                         ← per-command reference (overview + 12 commands)
-├── samples/                          ← reference-style walkthroughs
-├── tutorials/                        ← classroom-style walkthroughs
-├── reference/                        ← P2 stubs
-├── troubleshooting.md                ← P2 stub
-├── contributing.md                   ← P2 stub
-│
-├── superpowers/                      ← specs, plans, review notes (excluded from build)
-├── design/                           ← internal ADRs and design docs (excluded)
-├── naming-convention.md              ← internal style guide (excluded)
-└── pytest-best-practices.md          ← internal style guide (excluded)
-```
-
-The site config (`mkdocs.yml`) lives at the repo root, not inside `docs/`. The build outputs to `site/` (gitignored).
-
-## Local development
-
-### Prerequisites
-
-Python 3.10+ and [uv](https://github.com/astral-sh/uv).
-
-### Setup and preview
-
-```bash
-# from the repo root
-uv sync --extra dev
-uv run mkdocs serve
-```
-
-Open http://127.0.0.1:8000/ in a browser. The server auto-reloads when you edit any `.md` file under `docs/`. Changes to `mkdocs.yml` (nav, theme, plugins) require a manual server restart.
-
-### Validate before pushing
-
-```bash
-uv run mkdocs build --strict
-```
-
-`--strict` must exit 0 with no `WARNING` lines. Common causes of strict-mode failures:
-
-- A new page added without an entry in `nav:` (gives a "not included in nav" warning)
-- A nav entry pointing at a file that doesn't exist
-- A relative link like `[text](other-page.md)` whose target file is missing
-- A markdown anchor like `[link](#section-heading)` that doesn't match any heading slug
-
-## Publishing
-
-The site publishes to **GitHub Pages** from the `gh-pages` branch. The repo's `Settings → Pages` source is set to "Deploy from a branch" → `gh-pages` → `/ (root)`.
-
-### One-shot publish from your laptop
-
-```bash
-uv run mkdocs gh-deploy --force
-```
-
-This builds the site locally, commits the static HTML to a local `gh-pages` branch, and force-pushes it to `origin/gh-pages`. GitHub Pages picks up the new commit within ~30–60 seconds.
-
-### Publish via CI
-
-The workflow at `.github/workflows/docs.yml` does the same thing in CI:
-
-1. `Settings → Actions → Build & Publish Docs → Run workflow`
-2. Select the branch you want to publish from (typically `main`)
-
-The workflow is `workflow_dispatch` only — there is no automatic publish on push. If you want auto-publish on every push to `main`, change the trigger:
-
-```yaml
-on:
-  push:
-    branches: [main]
-    paths:
-      - 'docs/**'
-      - 'mkdocs.yml'
-      - 'pyproject.toml'
-      - '.github/workflows/docs.yml'
-  workflow_dispatch:
-```
-
-## Authoring conventions
-
-- **Product name**: `winml-cli` (lowercase, hyphenated) throughout user-facing prose. Use `WinML CLI` (or `Windows ML`) only where the broader Microsoft brand is meant.
-- **Command name**: the CLI invocation is always `winml <subcommand>`. Never `wmk`.
-- **Flag verification**: every flag mentioned in docs must exist in `src/winml/modelkit/commands/<cmd>.py`. Run `uv run winml <cmd> --help` to confirm.
-- **Source citations**: when documenting source-grounded behavior (e.g., "the default opset is 17"), cite the file path and ideally the symbol name. Avoid line numbers — they drift fast.
-- **Mermaid diagrams**: use `pymdownx.superfences` syntax (already configured in `mkdocs.yml`).
-- **Tabbed code blocks**: use `pymdownx.tabbed` (`=== "Label"` followed by a blank line and 4-space-indented code block).
-- **Admonitions**: `!!! note "Title"`, `!!! warning "Title"`, `!!! info "Title"`.
-- **No emojis** in pages unless they're part of an external attribution (e.g., a GitHub badge).
-
-## Excluded paths
-
-The following are present in `docs/` but **excluded from the published site** via the `exclude_docs:` block in `mkdocs.yml`. They are kept in-repo for contributors:
-
-- `docs/design/` — internal architecture decision records and design notes
-- `docs/superpowers/` — specs, plans, and review notes accumulated during doc development
-- `docs/naming-convention.md` — internal naming conventions for code review
-- `docs/pytest-best-practices.md` — internal testing style guide
-
-If you add new internal-only content, either place it under one of these excluded paths or add a new entry to `exclude_docs` in `mkdocs.yml`.
-
-## See also
-
-- [MkDocs Material reference](https://squidfunk.github.io/mkdocs-material/reference/)
-- [MkDocs Material navigation setup](https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/)
-- [MkDocs Material color palette](https://squidfunk.github.io/mkdocs-material/setup/changing-the-colors/)
diff --git a/docs/naming-convention.md b/docs/naming-convention.md
deleted file mode 100644
index f1cd3a9a5..000000000
--- a/docs/naming-convention.md
+++ /dev/null
@@ -1,104 +0,0 @@
-# WinML CLI Naming Convention
-
-This document defines the naming rules for the WinML CLI codebase. All new code and refactored code must follow these conventions.
-
-## 1. Acronyms in Class Names
-
-Domain acronyms in PascalCase class names **retain their uppercase form**, except for two-letter abbreviations used as generic prefixes.
-
-### Canonical Acronym Table
-
-| Acronym | Meaning | Class Casing | Example |
-|---------|---------|--------------|---------|
-| ONNX | Open Neural Network Exchange | `ONNX` | `ONNXStaticAnalyzer`, `ONNXLoader` |
-| EP | Execution Provider | `EP` | `EPChecker`, `EPConfig`, `EPMonitor` |
-| QDQ | Quantize-Dequantize | `QDQ` | `QDQParameterConfig`, `QDQGenerator` |
-| QNN | Qualcomm Neural Network | `QNN` | `QNNMonitor` |
-| Op | Operator (2-letter prefix) | `Op` | `OpUnsupportedError` |
-| IO | Input/Output | `IO` | `IOConfigInfo` |
-| HTP | Hexagon Tensor Processor | `HTP` | `HTPConfig`, `HTPExporter`, `HTPMetadataBuilder` |
-
-### Why `Op` Not `OP`
-
-Two-letter acronyms used as **class name prefixes** use PascalCase:
-
-- `OPUnsupported` reads ambiguously as three tokens (O-P-Unsupported)
-- `OpUnsupported` reads clearly as two tokens (Op-Unsupported)
-- Consistent with conventions like `Id` vs `ID`
-
-All-caps is acceptable in **constants** (e.g., `SUPPORTED_OPS`).
-
-### Canonical Execution Provider Names
-
-Execution providers appear mainly in constants, EP-name strings, and config keys rather than as class prefixes. Each EP has a fixed canonical short name (used in our code) and an ORT full name (the `*ExecutionProvider` symbol).
-
-| Short name | ORT full name | Device | Vendor / Notes |
-|------------|---------------|--------|----------------|
-| `CPU` | `CPUExecutionProvider` | CPU | Default fallback. |
-| `CUDA` | `CUDAExecutionProvider` | GPU | NVIDIA. All caps. |
-| `DML` | `DmlExecutionProvider` | GPU | DirectML. Use `DML` in our code; do not write `DirectML` as the EP name. |
-| `MIGraphX` | `MIGraphXExecutionProvider` | GPU | AMD. Exact casing (mixed case). |
-| `NvTensorRTRTX` | `NvTensorRTRTXExecutionProvider` | GPU | NVIDIA TensorRT-RTX. Exact casing; do not shorten to `TensorRT`. |
-| `OpenVINO` | `OpenVINOExecutionProvider` | CPU / GPU / NPU | Intel. Exact casing. Alias: `ov`. |
-| `QNN` | `QNNExecutionProvider` | NPU | Qualcomm. All caps. |
-| `VitisAI` | `VitisAIExecutionProvider` | NPU | AMD Ryzen AI. Exact casing. Alias: `vitis`. |
-
-### Other Canonical Identifiers
-
-| Token | Meaning | Notes |
-|-------|---------|-------|
-| `HF_` | HuggingFace (constant/variable prefix) | e.g., `HF_MODEL_CLASS_MAPPING`, `HF_TASK_DEFAULTS`. Not used as a class prefix. |
-
-## 2. Module and Package Names
-
-Follow PEP 8: all lowercase with underscores.
-
-```
-correct:   onnx_op.py, ep_checker.py, qdq_fix.py
-wrong:     OnnxOp.py, EP_Checker.py
-```
-
-## 3. Function and Method Names
-
-Snake_case, lowercase.
-
-```
-correct:   normalize_ep_name(), generate_build_config()
-wrong:     normalizeEPName(), GenerateBuildConfig()
-```
-
-## 4. Constants
-
-UPPER_CASE with underscores.
-
-```
-correct:   SUPPORTED_EPS, EP_ALIASES, DEVICE_TO_DEVICE_TYPE
-wrong:     supportedEps, ep_aliases
-```
-
-## 5. Directory Abbreviation Policy
-
-The codebase uses a mix of abbreviated and full directory names. The established names are frozen — do not rename existing directories for consistency alone. For **new** directories, prefer full names unless the abbreviation is widely recognized in the domain (e.g., `optim`, `eval`, `quant`).
-
-| Established Abbreviation | Full Form |
-|---|---|
-| `optim` | optimization |
-| `quant` | quantization |
-| `eval` | evaluation |
-| `sysinfo` | system information |
-| `optracing` | operator tracing |
-
-## 6. Avoid Name Collisions Across Hierarchy
-
-Do not reuse a parent or sibling package name at a deeper level. When creating new subpackages, verify the name does not already exist elsewhere in the tree.
-
-Known collisions to be aware of:
-
-| Name | Locations | Issue |
-|---|---|---|
-| `winml` | top-level namespace, `modelkit/winml.py`, `models/winml/` | 3-level collision |
-| `core` | `modelkit/core/`, `analyze/core/` | same name, different content |
-| `models` | `modelkit/models/`, `analyze/models/` | ML models vs data models |
-| `utils` | `modelkit/utils/`, `analyze/utils/` | no shared content |
-| `pattern` | `modelkit/pattern/`, `analyze/pattern/` | active vs near-empty |
-| `inspect` | `modelkit/inspect/` | shadows Python stdlib |
diff --git a/docs/pytest-best-practices.md b/docs/pytest-best-practices.md
deleted file mode 100644
index 30142d2df..000000000
--- a/docs/pytest-best-practices.md
+++ /dev/null
@@ -1,2832 +0,0 @@
-# Complete Pytest Best Practices Guide (2025)
-
-A comprehensive guide covering all aspects of pytest, from basic usage to advanced patterns and project organization.
-
-## Table of Contents
-
-1. [Project Structure & Organization](#project-structure--organization)
-2. [Test Discovery & Naming Conventions](#test-discovery--naming-conventions)
-3. [Fixtures: The Heart of Pytest](#fixtures-the-heart-of-pytest)
-4. [Markers & Test Categorization](#markers--test-categorization)
-5. [Parametrization: Data-Driven Testing](#parametrization-data-driven-testing)
-6. [Assertions & Error Handling](#assertions--error-handling)
-7. [Configuration & Settings](#configuration--settings)
-8. [Conftest.py: Shared Test Logic](#conftest-py-shared-test-logic)
-9. [Mocking & Monkeypatching](#mocking--monkeypatching)
-10. [Database Testing Patterns](#database-testing-patterns)
-11. [Performance & Optimization](#performance--optimization)
-12. [CI/CD Integration](#cicd-integration)
-13. [Plugin Ecosystem](#plugin-ecosystem)
-14. [Snapshot & Regression Testing](#snapshot--regression-testing)
-15. [Property-Based Testing with Hypothesis](#property-based-testing-with-hypothesis)
-16. [Test Asset Generation & Management](#test-asset-generation--management)
-17. [Common Patterns & Anti-Patterns](#common-patterns--anti-patterns)
-18. [Debugging & Troubleshooting](#debugging--troubleshooting)
-19. [Best Practices Checklist](#best-practices-checklist)
-
----
-
-## Project Structure & Organization
-
-### Recommended Layout
-
-```
-project/
-├── src/                        # Source code
-│   └── myproject/
-│       ├── __init__.py
-│       ├── core/
-│       │   ├── __init__.py
-│       │   └── engine.py
-│       ├── utils/
-│       │   ├── __init__.py
-│       │   └── helpers.py
-│       └── api/
-│           ├── __init__.py
-│           └── endpoints.py
-├── tests/                      # Test directory
-│   ├── __init__.py            # Makes tests a package (optional - see note below)
-│   ├── conftest.py            # Shared fixtures and configuration
-│   ├── unit/                  # Unit tests
-│   │   ├── __init__.py
-│   │   ├── test_engine.py
-│   │   └── test_helpers.py
-│   ├── integration/           # Integration tests
-│   │   ├── __init__.py
-│   │   └── test_api.py
-│   ├── e2e/                   # End-to-end tests
-│   │   ├── __init__.py
-│   │   └── test_workflows.py
-│   └── fixtures/              # Shared test data/utilities
-│       ├── __init__.py
-│       └── test_data.py
-├── pyproject.toml            # Modern Python project config (preferred)
-├── pytest.ini                 # Legacy pytest configuration (avoid)
-├── .coveragerc               # Coverage configuration
-└── tox.ini                   # Multiple environment testing
-```
-
-### Key Principles
-
-1. **Mirror Source Structure**: Test directory structure should mirror your source code
-2. **Separate Test Types**: Keep unit, integration, and e2e tests in separate directories
-3. **`__init__.py` in Tests**: Optional - use only when you need to import between test modules (see detailed explanation below)
-4. **Centralize Fixtures**: Use `conftest.py` for shared fixtures
-
-### Should You Use `__init__.py` in Test Directories?
-
-The use of `__init__.py` in test directories is **optional** and depends on your specific needs:
-
-#### When to USE `__init__.py` in tests ✅
-
-1. **Cross-test imports**: When you need to import helper functions or classes between test modules
-   ```python
-   # tests/unit/test_user.py
-   from tests.helpers.factories import UserFactory  # Requires __init__.py
-   ```
-
-2. **Test utilities as a package**: When you have reusable test utilities that need to be imported
-   ```
-   tests/
-   ├── __init__.py
-   ├── helpers/
-   │   ├── __init__.py
-   │   ├── factories.py
-   │   └── assertions.py
-   ```
-
-3. **Namespace packages**: When you need to avoid naming conflicts with application modules
-   ```python
-   # Disambiguates tests.models from myapp.models
-   from tests.models import TestUser
-   from myapp.models import User
-   ```
-
-#### When NOT to use `__init__.py` in tests ❌
-
-1. **Simple test structures**: Most projects don't need it - pytest discovers tests without it
-2. **Import mode conflicts**: Can cause issues with pytest's import mechanisms
-3. **Accidental test collection**: May cause pytest to collect non-test files
-
-#### Best Practice Recommendation
-
-**Default approach**: Start WITHOUT `__init__.py` in test directories. Only add it when you have a specific need for cross-test imports or test utilities.
-
-```
-# Recommended minimal structure
-tests/
-├── conftest.py          # Shared fixtures (no __init__.py needed)
-├── unit/
-│   └── test_models.py   # Tests work without __init__.py
-└── integration/
-    └── test_api.py
-```
-
-#### pytest.ini Configuration for Import Issues
-
-If you encounter import issues, configure pytest's import mode instead of adding `__init__.py`:
-
-```ini
-# pytest.ini
-[pytest]
-# Use importlib mode for better import handling
-import_mode = importlib
-
-# Or use prepend mode (default)
-import_mode = prepend
-```
-
-### Alternative Layouts
-
-#### Tests Outside Application Code (Recommended)
-```
-project/
-├── src/myproject/
-└── tests/
-```
-
-#### Tests as Part of Application (Less Common)
-```
-project/
-└── myproject/
-    ├── core/
-    │   ├── engine.py
-    │   └── tests/
-    │       └── test_engine.py
-    └── utils/
-        ├── helpers.py
-        └── tests/
-            └── test_helpers.py
-```
-
----
-
-## Test Discovery & Naming Conventions
-
-### Default Discovery Rules
-
-Pytest automatically discovers tests following these patterns:
-
-- **Test files**: `test_*.py` or `*_test.py`
-- **Test classes**: `Test*` (must not have an `__init__` method)
-- **Test functions**: `test_*`
-- **Test methods**: `test_*` inside `Test*` classes
-
-### Naming Best Practices
-
-```python
-# ❌ Bad: Unclear test names
-def test_1():
-    pass
-
-def test_user():
-    pass
-
-def test_function():
-    pass
-
-# ✅ Good: Descriptive test names
-def test_user_creation_with_valid_email():
-    """Test that a user can be created with a valid email address."""
-    pass
-
-def test_user_creation_fails_with_duplicate_email():
-    """Test that creating a user with an existing email raises an error."""
-    pass
-
-def test_password_reset_sends_email_to_registered_user():
-    """Test that password reset email is sent to registered users."""
-    pass
-```
-
-### Test Class Organization
-
-```python
-class TestUserAuthentication:
-    """Test cases for user authentication functionality."""
-
-    def test_login_with_valid_credentials_returns_token(self):
-        """Test successful login returns authentication token."""
-        pass
-
-    def test_login_with_invalid_password_returns_401(self):
-        """Test login with wrong password returns 401 status."""
-        pass
-
-    def test_login_with_nonexistent_user_returns_404(self):
-        """Test login with non-existent user returns 404 status."""
-        pass
-```
-
-### Custom Discovery Configuration
-
-```ini
-# pytest.ini
-[pytest]
-# Custom patterns for test discovery
-python_files = test_*.py check_*.py
-python_classes = Test* Check*
-python_functions = test_* check_*
-
-# Ignore specific directories
-norecursedirs = .git .tox build dist *.egg
-```
-
----
-
-## Fixtures: The Heart of Pytest
-
-### Basic Fixture Concepts
-
-```python
-import pytest
-
-# Simple fixture
-@pytest.fixture
-def sample_data():
-    """Provide sample data for tests."""
-    return {"name": "John", "age": 30}
-
-# Fixture with teardown
-@pytest.fixture
-def database_connection():
-    """Create database connection and clean up after test."""
-    conn = create_connection()
-    yield conn  # This is where the test runs
-    conn.close()  # Teardown happens after test
-
-# Using fixtures in tests
-def test_user_data(sample_data):
-    assert sample_data["name"] == "John"
-```
-
-### Fixture Scopes
-
-```python
-# Function scope (default) - run once per test function
-@pytest.fixture(scope="function")
-def function_resource():
-    return expensive_setup()
-
-# Class scope - run once per test class
-@pytest.fixture(scope="class")
-def class_resource():
-    return expensive_setup()
-
-# Module scope - run once per module
-@pytest.fixture(scope="module")
-def module_resource():
-    return expensive_setup()
-
-# Session scope - run once per test session
-@pytest.fixture(scope="session")
-def session_resource():
-    return expensive_setup()
-
-# Package scope - run once per package
-@pytest.fixture(scope="package")
-def package_resource():
-    return expensive_setup()
-```
-
-### Advanced Fixture Patterns
-
-#### Factory Fixtures
-```python
-@pytest.fixture
-def make_user():
-    """Factory fixture for creating users."""
-    created_users = []
-
-    def _make_user(name, email=None):
-        user = User(name=name, email=email or f"{name}@example.com")
-        created_users.append(user)
-        return user
-
-    yield _make_user
-
-    # Cleanup all created users
-    for user in created_users:
-        user.delete()
-
-def test_user_interactions(make_user):
-    alice = make_user("alice")
-    bob = make_user("bob", "bob@company.com")
-    assert alice.can_message(bob)
-```
-
-#### Parametrized Fixtures
-```python
-@pytest.fixture(params=["sqlite", "postgresql", "mysql"])
-def database(request):
-    """Test with multiple database backends."""
-    return setup_database(request.param)
-
-def test_query_performance(database):
-    # This test runs three times, once for each database
-    result = database.execute("SELECT * FROM users")
-    assert result.execution_time < 100  # ms
-```
-
-#### Dynamic Fixture Scope
-```python
-def determine_scope(fixture_name, config):
-    """Dynamically determine fixture scope based on config."""
-    if config.getoption("--quick", None):
-        return "session"  # Reuse fixtures for speed
-    return "function"    # Fresh fixtures for isolation
-
-@pytest.fixture(scope=determine_scope)
-def api_client():
-    return APIClient()
-```
-
-#### Fixture Dependencies
-```python
-@pytest.fixture
-def config():
-    return load_config()
-
-@pytest.fixture
-def database(config):
-    return Database(config["db_url"])
-
-@pytest.fixture
-def api_client(config, database):
-    # Fixtures can depend on other fixtures
-    return APIClient(config["api_url"], database)
-```
-
-### Auto-use Fixtures
-
-```python
-@pytest.fixture(autouse=True)
-def reset_global_state():
-    """Automatically run before each test without explicit request."""
-    clear_caches()
-    reset_singletons()
-    yield
-    # Cleanup happens after test
-
-@pytest.fixture(autouse=True, scope="session")
-def configure_test_environment():
-    """Set up test environment once for entire session."""
-    os.environ["TESTING"] = "true"
-    configure_logging("debug")
-```
-
-### Fixture Finalization
-
-```python
-@pytest.fixture
-def resource_with_finalizer(request):
-    """Using request.addfinalizer for cleanup."""
-    resource = acquire_resource()
-
-    def cleanup():
-        release_resource(resource)
-
-    request.addfinalizer(cleanup)
-    return resource
-
-# Equivalent using yield
-@pytest.fixture
-def resource_with_yield():
-    """Using yield for cleanup (preferred)."""
-    resource = acquire_resource()
-    yield resource
-    release_resource(resource)
-```
-
----
-
-## Markers & Test Categorization
-
-### Built-in Markers
-
-```python
-import pytest
-import sys
-
-# Skip marker
-@pytest.mark.skip(reason="Not implemented yet")
-def test_future_feature():
-    pass
-
-# Conditional skip
-@pytest.mark.skipif(sys.version_info < (3, 10), reason="Requires Python 3.10+")
-def test_pattern_matching():
-    match value:
-        case 1: return "one"
-        case _: return "other"
-
-# Expected failure
-@pytest.mark.xfail(reason="Known bug #123")
-def test_known_issue():
-    assert buggy_function() == expected_value
-
-# Strict xfail - fails if test passes
-@pytest.mark.xfail(strict=True, reason="Should be fixed in v2.0")
-def test_upcoming_fix():
-    assert new_feature() == expected
-
-# Platform-specific tests
-@pytest.mark.skipif(sys.platform != "linux", reason="Linux only test")
-def test_linux_specific():
-    pass
-
-# Import skip
-def test_optional_dependency():
-    numpy = pytest.importorskip("numpy", minversion="1.20.0")
-    # Test only runs if numpy >= 1.20.0 is available
-```
-
-### Custom Markers
-
-```ini
-# pytest.ini - Register custom markers
-[pytest]
-markers =
-    slow: marks tests as slow (deselect with '-m "not slow"')
-    smoke: core functionality that must always work
-    integration: requires external services
-    unit: fast isolated unit tests
-    flaky: tests that occasionally fail
-    requires_db: tests that need database access
-    requires_network: tests that need network access
-```
-
-```python
-# Using custom markers
-@pytest.mark.slow
-@pytest.mark.integration
-def test_full_workflow():
-    """Test complete user workflow with external services."""
-    pass
-
-@pytest.mark.smoke
-def test_critical_functionality():
-    """Test that must always pass."""
-    pass
-
-# Multiple markers
-@pytest.mark.unit
-@pytest.mark.smoke
-def test_core_logic():
-    """Fast unit test for critical functionality."""
-    pass
-```
-
-### Marker Expressions
-
-```bash
-# Run only smoke tests
-pytest -m smoke
-
-# Run all tests except slow ones
-pytest -m "not slow"
-
-# Complex expressions
-pytest -m "smoke and not slow"
-pytest -m "(unit or integration) and not flaky"
-
-# List all markers
-pytest --markers
-```
-
-### Applying Markers Dynamically
-
-```python
-# In conftest.py
-def pytest_collection_modifyitems(items):
-    """Dynamically add markers during collection."""
-    for item in items:
-        # Add marker based on test location
-        if "integration" in str(item.fspath):
-            item.add_marker(pytest.mark.integration)
-
-        # Add marker based on test name
-        if "slow" in item.name:
-            item.add_marker(pytest.mark.slow)
-```
-
----
-
-## Parametrization: Data-Driven Testing
-
-### Basic Parametrization
-
-```python
-import pytest
-
-# Single parameter
-@pytest.mark.parametrize("number", [1, 2, 3, 4, 5])
-def test_square(number):
-    assert number ** 2 == number * number
-
-# Multiple parameters
-@pytest.mark.parametrize("input,expected", [
-    (2, 4),
-    (3, 9),
-    (4, 16),
-    (-2, 4),
-])
-def test_square_with_expected(input, expected):
-    assert input ** 2 == expected
-
-# Using test IDs for better output
-@pytest.mark.parametrize("input,expected", [
-    (2, 4),
-    (3, 9),
-    (-2, 4),
-], ids=["positive_2", "positive_3", "negative_2"])
-def test_square_with_ids(input, expected):
-    assert input ** 2 == expected
-
-# ID function
-def idfn(val):
-    return f"num_{val}"
-
-@pytest.mark.parametrize("number", [1, 2, 3], ids=idfn)
-def test_with_id_function(number):
-    assert number > 0
-```
-
-### Advanced Parametrization
-
-```python
-# Nested parametrization
-@pytest.mark.parametrize("x", [1, 2])
-@pytest.mark.parametrize("y", [10, 20])
-def test_multiplication(x, y):
-    # Runs 4 times: (1,10), (1,20), (2,10), (2,20)
-    assert x * y == y * x
-
-# Parametrize with marks
-@pytest.mark.parametrize("test_input,expected", [
-    ("3+5", 8),
-    ("2+4", 6),
-    pytest.param("6*9", 42, marks=pytest.mark.xfail(reason="Hitchhiker's joke")),
-    pytest.param("1/0", 0, marks=pytest.mark.skip(reason="Division by zero")),
-])
-def test_eval(test_input, expected):
-    assert eval(test_input) == expected
-
-# Indirect parametrization (parametrize fixtures)
-@pytest.mark.parametrize("db_name", ["sqlite", "postgres"], indirect=True)
-def test_database_operations(db_name):
-    # db_name fixture receives the parameter value
-    assert db_name.connect()
-```
-
-### Parametrization Patterns
-
-```python
-# Test class parametrization
-@pytest.mark.parametrize("browser", ["chrome", "firefox", "safari"])
-class TestWebApplication:
-    def test_login(self, browser):
-        # Each test method runs with each browser
-        pass
-
-    def test_search(self, browser):
-        pass
-
-# Dynamic parametrization
-def pytest_generate_tests(metafunc):
-    """Dynamically parametrize tests."""
-    if "dynamic_value" in metafunc.fixturenames:
-        values = load_test_values_from_file()
-        metafunc.parametrize("dynamic_value", values)
-
-# Parametrization from fixtures
-@pytest.fixture(params=["admin", "user", "guest"])
-def user_role(request):
-    return create_user_with_role(request.param)
-
-def test_permissions(user_role):
-    # Test runs for each user role
-    assert user_role.can_access("/dashboard") == user_role.is_admin
-```
-
----
-
-## Assertions & Error Handling
-
-### Enhanced Assertions
-
-```python
-# Pytest rewrites assert statements for better output
-def test_assertion_introspection():
-    data = {"name": "Alice", "items": [1, 2, 3]}
-    # Pytest shows detailed diff on failure
-    assert data == {"name": "Bob", "items": [1, 2, 3]}
-
-# Custom assertion messages
-def test_with_message():
-    result = complex_calculation()
-    assert result > 0, f"Expected positive result, got {result}"
-```
-
-### Exception Testing
-
-```python
-import pytest
-
-# Basic exception testing
-def test_raises_exception():
-    with pytest.raises(ValueError):
-        raise ValueError("Invalid value")
-
-# Check exception message
-def test_exception_message():
-    with pytest.raises(ValueError, match="Invalid.*value"):
-        raise ValueError("Invalid value provided")
-
-# Access exception info
-def test_exception_info():
-    with pytest.raises(ValueError) as exc_info:
-        raise ValueError("test error")
-
-    assert str(exc_info.value) == "test error"
-    assert exc_info.type == ValueError
-
-# Test multiple exceptions (ExceptionGroup)
-def test_exception_group():
-    with pytest.raises(ExceptionGroup) as exc_info:
-        raise ExceptionGroup("errors", [
-            ValueError("error 1"),
-            TypeError("error 2")
-        ])
-
-    assert len(exc_info.value.exceptions) == 2
-```
-
-### Warning Testing
-
-```python
-import warnings
-import pytest
-
-def test_warns():
-    with pytest.warns(UserWarning):
-        warnings.warn("This is a warning", UserWarning)
-
-def test_warns_with_match():
-    with pytest.warns(DeprecationWarning, match="deprecated"):
-        warnings.warn("This function is deprecated", DeprecationWarning)
-
-def test_no_warnings():
-    # Ensure no warnings are raised
-    with warnings.catch_warnings():
-        warnings.simplefilter("error")
-        clean_function()  # Should not raise any warnings
-```
-
-### Approximate Comparisons
-
-```python
-import pytest
-
-def test_float_comparison():
-    assert 0.1 + 0.2 == pytest.approx(0.3)
-
-def test_list_approximate():
-    assert [0.1 + 0.2, 0.2 + 0.4] == pytest.approx([0.3, 0.6])
-
-def test_dict_approximate():
-    assert {"a": 0.1 + 0.2} == pytest.approx({"a": 0.3})
-
-# Custom tolerance
-def test_custom_tolerance():
-    assert 1.0001 == pytest.approx(1.0, rel=1e-3)
-    assert 1.0001 == pytest.approx(1.0, abs=1e-3)
-```
-
----
-
-## Configuration & Settings
-
-### Configuration File Priority (Critical Knowledge)
-
-Understanding configuration file priority is essential for debugging pytest configuration issues.
-
-**Priority Order** (first match wins - configurations are NEVER merged):
-
-| Priority | File | Notes |
-|----------|------|-------|
-| 1 (Highest) | `pytest.toml` / `.pytest.toml` | New in pytest 9.0, native TOML |
-| 2 | `pytest.ini` / `.pytest.ini` | Classic pytest config |
-| 3 | `pyproject.toml` | Modern Python project standard |
-| 4 | `tox.ini` | Tox integration |
-| 5 (Lowest) | `setup.cfg` | Legacy, not recommended |
-
-> ⚠️ **Critical Gotcha**: If an empty `pytest.ini` file exists in your project, ALL settings in `pyproject.toml` will be ignored! This is a common source of confusion. Delete any empty `pytest.ini` files.
-
-**Configuration Sections by File Type**:
-
-| File Type | Section Name |
-|-----------|--------------|
-| pytest.ini | `[pytest]` |
-| pyproject.toml (pytest 6.0-8.x) | `[tool.pytest.ini_options]` |
-| pyproject.toml (pytest 9.0+) | `[tool.pytest]` |
-| tox.ini | `[pytest]` |
-| setup.cfg | `[tool:pytest]` |
-
-**Best Practice**: Use `pyproject.toml` as your single source of truth for all Python tooling configuration (pytest, ruff, mypy, etc.).
-
-### pyproject.toml Configuration (Recommended)
-
-Using `pyproject.toml` is the modern, preferred approach for Python project configuration. It consolidates all project metadata and tool configurations in one place.
-
-```toml
-# pyproject.toml
-[tool.pytest.ini_options]
-# Minimum pytest version
-minversion = "7.0"
-
-# Default command line options
-addopts = [
-    "--strict-markers",      # Fail on unknown markers
-    "--strict-config",       # Fail on config errors
-    "--import-mode=importlib",  # Use standard import system (recommended)
-    "--verbose",             # Verbose output
-    "-ra",                   # Show all test outcomes
-    "--cov=myproject",       # Coverage for your project
-    "--cov-report=html",     # HTML coverage report
-    "--cov-report=term-missing",  # Terminal report with missing lines
-]
-
-> 💡 **Recommended**: Always include `--import-mode=importlib` in your `addopts`. This uses Python's standard import system instead of modifying `sys.path`, avoiding common import issues. This has been the default since pytest 6.0 but explicitly setting it ensures consistent behavior.
-
-# Test discovery
-testpaths = ["tests"]
-python_files = ["test_*.py", "*_test.py"]
-python_classes = ["Test*", "*Tests"]
-python_functions = ["test_*"]
-
-# Python path configuration
-pythonpath = ["src"]
-
-# Import mode (importlib is recommended for most projects)
-import_mode = "importlib"
-
-# Custom markers registration
-markers = [
-    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
-    "integration: requires external services",
-    "unit: fast isolated unit tests",
-    "smoke: core functionality that must always work",
-    "flaky: tests that occasionally fail",
-    "requires_network: tests that need network access",
-]
-
-# Output configuration
-console_output_style = "progress"
-
-# Directories to ignore
-norecursedirs = [".git", ".tox", "dist", "build", "*.egg", "__pycache__"]
-
-# Logging configuration
-log_cli = true
-log_cli_level = "INFO"
-log_cli_format = "%(asctime)s [%(levelname)8s] %(message)s"
-log_cli_date_format = "%Y-%m-%d %H:%M:%S"
-
-# Warning filters
-filterwarnings = [
-    "error",                          # Turn warnings into errors
-    "ignore::UserWarning",            # Ignore user warnings
-    "ignore::DeprecationWarning",     # Ignore deprecation warnings
-    "default:.*deprecated.*:DeprecationWarning",  # Show deprecation warnings with "deprecated" in message
-]
-
-# Required plugins
-required_plugins = [
-    "pytest-cov>=4.0",
-]
-
-# Test timeout (requires pytest-timeout)
-timeout = 300
-timeout_method = "thread"
-
-# Strict xfail
-xfail_strict = true
-
-# Asyncio configuration (requires pytest-asyncio)
-asyncio_mode = "auto"
-
-# Coverage configuration (can also be in [tool.coverage])
-[tool.coverage.run]
-source = ["myproject"]
-omit = [
-    "*/tests/*",
-    "*/venv/*",
-    "*/.venv/*",
-    "*/migrations/*",
-    "*/__pycache__/*",
-    "*/.pytest_cache/*",
-]
-
-[tool.coverage.report]
-precision = 2
-show_missing = true
-skip_covered = false
-exclude_lines = [
-    "pragma: no cover",
-    "def __repr__",
-    "raise AssertionError",
-    "raise NotImplementedError",
-    "if __name__ == .__main__.:",
-    "if TYPE_CHECKING:",
-    "if typing.TYPE_CHECKING:",
-]
-
-[tool.coverage.html]
-directory = "htmlcov"
-
-[tool.coverage.xml]
-output = "coverage.xml"
-```
-
-### Complete pyproject.toml Example
-
-Here's a complete `pyproject.toml` that includes project metadata along with pytest configuration:
-
-```toml
-[build-system]
-requires = ["setuptools>=64", "wheel"]
-build-backend = "setuptools.build_meta"
-
-[project]
-name = "myproject"
-version = "1.0.0"
-description = "My awesome project"
-readme = "README.md"
-requires-python = ">=3.8"
-license = {text = "MIT"}
-authors = [
-    {name = "Your Name", email = "you@example.com"},
-]
-dependencies = [
-    "requests>=2.28.0",
-    "pydantic>=2.0.0",
-]
-
-[project.optional-dependencies]
-dev = [
-    "pytest>=7.0.0",
-    "pytest-cov>=4.0.0",
-    "pytest-mock>=3.10.0",
-    "pytest-asyncio>=0.21.0",
-    "pytest-timeout>=2.1.0",
-    "pytest-xdist>=3.0.0",
-    "black>=23.0.0",
-    "ruff>=0.1.0",
-    "mypy>=1.0.0",
-]
-
-[project.urls]
-Homepage = "https://github.com/username/myproject"
-Documentation = "https://myproject.readthedocs.io"
-Repository = "https://github.com/username/myproject.git"
-Issues = "https://github.com/username/myproject/issues"
-
-[tool.setuptools.packages.find]
-where = ["src"]
-
-[tool.pytest.ini_options]
-# ... (configuration from above)
-
-[tool.black]
-line-length = 88
-target-version = ["py38", "py39", "py310", "py311"]
-include = '\.pyi?$'
-
-[tool.ruff]
-line-length = 88
-target-version = "py38"
-select = [
-    "E",   # pycodestyle errors
-    "W",   # pycodestyle warnings
-    "F",   # pyflakes
-    "I",   # isort
-    "N",   # pep8-naming
-    "UP",  # pyupgrade
-]
-
-[tool.mypy]
-python_version = "3.8"
-warn_return_any = true
-warn_unused_configs = true
-disallow_untyped_defs = true
-```
-
-### Migration from pytest.ini to pyproject.toml
-
-If you have an existing `pytest.ini`, here's how to migrate:
-
-```ini
-# OLD: pytest.ini
-[pytest]
-markers =
-    slow: slow tests
-testpaths = tests
-```
-
-Becomes:
-
-```toml
-# NEW: pyproject.toml
-[tool.pytest.ini_options]
-markers = [
-    "slow: slow tests",
-]
-testpaths = ["tests"]
-```
-
-### pytest 9.0+ Native TOML Configuration
-
-Starting with pytest 9.0, you can use the native `[tool.pytest]` table which provides cleaner TOML syntax:
-
-```toml
-# pytest 9.0+ (native TOML arrays - cleaner syntax)
-[tool.pytest]
-minversion = "9.0"
-
-# Test discovery
-testpaths = ["tests"]
-pythonpath = ["."]
-python_files = ["test_*.py", "*_test.py"]
-python_classes = ["Test*"]
-python_functions = ["test_*"]
-norecursedirs = [".git", ".tox", "dist", "build", ".venv", "__pycache__"]
-
-# Command line options (native TOML arrays)
-addopts = [
-    "--strict-markers",
-    "--strict-config",
-    "--import-mode=importlib",
-    "-ra",
-    "--tb=short",
-]
-
-# Markers
-markers = [
-    "slow: marks tests as slow",
-    "integration: integration tests",
-]
-
-# Warning filters
-filterwarnings = [
-    "error",
-    "ignore::DeprecationWarning",
-]
-
-# Required plugins
-required_plugins = [
-    "pytest-cov>=4.0",
-]
-```
-
-**Benefits over `[tool.pytest.ini_options]`**:
-- Native TOML array syntax (clearer than space-separated strings in some cases)
-- Better TOML type support
-- Future-proof configuration format
-- Reserved by pytest team for enhanced features
-
-**Migration**: Simply rename `[tool.pytest.ini_options]` to `[tool.pytest]` when upgrading to pytest 9.0+.
-
-### Legacy pytest.ini (Not Recommended)
-
-While `pytest.ini` still works, it's considered legacy. Use `pyproject.toml` instead for these benefits:
-- Single configuration file for all Python tools
-- Better IDE support
-- TOML format is more readable
-- Standardized by PEP 518 and PEP 621
-
-### Command Line Configuration
-
-```bash
-# Common command line options
-pytest -v                    # Verbose output
-pytest -q                    # Quiet output
-pytest -s                    # No capture, show print statements
-pytest -x                    # Stop on first failure
-pytest --maxfail=3          # Stop after 3 failures
-pytest -k "user"            # Run tests matching "user"
-pytest -m "not slow"        # Run tests not marked as slow
-pytest --lf                 # Run last failed tests
-pytest --ff                 # Run failed tests first
-pytest --tb=short           # Short traceback format
-pytest --tb=no              # No traceback
-pytest --setup-show         # Show fixture setup/teardown
-pytest --fixtures           # Show available fixtures
-pytest --markers            # Show available markers
-pytest --collect-only       # Only collect tests, don't run
-pytest --cache-clear        # Clear cache before run
-pytest --doctest-modules    # Run doctests
-pytest --cov=myproject      # Coverage report
-pytest --cov-report=html    # HTML coverage report
-pytest --durations=10       # Show 10 slowest tests
-pytest --pdb                # Drop to debugger on failure
-pytest --pdbcls=IPython.terminal.debugger:TerminalPdb  # Use IPython debugger
-```
-
----
-
-## Conftest.py: Shared Test Logic
-
-### Fixture Sharing
-
-```python
-# tests/conftest.py - Available to all tests
-import pytest
-import tempfile
-from pathlib import Path
-
-@pytest.fixture(scope="session")
-def test_data_dir():
-    """Shared test data directory."""
-    return Path(__file__).parent / "data"
-
-@pytest.fixture
-def temp_dir():
-    """Create temporary directory for test."""
-    with tempfile.TemporaryDirectory() as tmp:
-        yield Path(tmp)
-
-# tests/unit/conftest.py - Available to unit tests only
-@pytest.fixture
-def mock_database():
-    """Mock database for unit tests."""
-    return MockDatabase()
-
-# tests/integration/conftest.py - Available to integration tests only
-@pytest.fixture(scope="module")
-def real_database():
-    """Real database connection for integration tests."""
-    db = Database()
-    yield db
-    db.cleanup()
-```
-
-### Hooks in conftest.py
-
-```python
-# Modify test collection
-def pytest_collection_modifyitems(config, items):
-    """Modify test collection."""
-    # Add markers based on test file location
-    for item in items:
-        # Add markers based on location
-        if "integration" in str(item.fspath):
-            item.add_marker(pytest.mark.integration)
-
-        # Skip tests based on environment
-        if "requires_gpu" in item.keywords and not has_gpu():
-            item.add_marker(pytest.mark.skip(reason="GPU not available"))
-
-# Custom command line options
-def pytest_addoption(parser):
-    """Add custom command line options."""
-    parser.addoption(
-        "--run-slow",
-        action="store_true",
-        default=False,
-        help="Run slow tests"
-    )
-    parser.addoption(
-        "--integration",
-        action="store_true",
-        default=False,
-        help="Run integration tests"
-    )
-
-# Configure based on options
-def pytest_configure(config):
-    """Configure pytest based on command line options."""
-    if config.getoption("--run-slow"):
-        config.option.markexpr = "slow"
-
-# Custom markers registration
-def pytest_configure(config):
-    config.addinivalue_line(
-        "markers", "slow: marks tests as slow"
-    )
-```
-
-#### Hook Execution Order Control
-
-Control when your hooks run relative to other plugins:
-
-```python
-@pytest.hookimpl(tryfirst=True)
-def pytest_collection_modifyitems(items):
-    """Execute BEFORE other implementations."""
-    # Priority operations here
-    pass
-
-@pytest.hookimpl(trylast=True)
-def pytest_collection_modifyitems(items):
-    """Execute AFTER other implementations."""
-    # Cleanup or final modifications
-    pass
-```
-
-#### Wrapper Hooks (Advanced)
-
-Wrap other hook implementations for cross-cutting concerns:
-
-```python
-@pytest.hookimpl(wrapper=True)
-def pytest_runtest_makereport(item, call):
-    """Wrap report generation for custom handling."""
-    # Code before other hooks run
-    outcome = yield  # Run wrapped hooks
-    report = outcome.get_result()
-
-    # Code after - modify or log report
-    if report.when == "call" and report.failed:
-        # Handle test failure
-        log_failure(item.nodeid, report.longreprtext)
-
-    return report
-
-@pytest.hookimpl(wrapper=True, tryfirst=True)
-def pytest_runtest_setup(item):
-    """Wrap setup with timing."""
-    start = time.time()
-    yield  # Run actual setup
-    duration = time.time() - start
-    item.setup_duration = duration
-```
-
-#### Storing Data Across Hooks
-
-Use `item.stash` for type-safe data storage:
-
-```python
-from pytest import StashKey
-
-# Define typed keys
-phase_report_key = StashKey[dict]()
-timing_key = StashKey[float]()
-
-@pytest.hookimpl(wrapper=True)
-def pytest_runtest_makereport(item, call):
-    """Store reports for fixture access."""
-    outcome = yield
-    report = outcome.get_result()
-
-    # Store in stash (type-safe)
-    item.stash.setdefault(phase_report_key, {})[report.when] = report
-    return report
-
-@pytest.fixture
-def test_outcome(request):
-    """Fixture to access test outcome."""
-    yield
-    report = request.node.stash.get(phase_report_key, {}).get("call")
-    if report and report.failed:
-        # Handle failure in fixture teardown
-        pass
-```
-
-#### Custom Report Sections
-
-Add extra information to test reports:
-
-```python
-@pytest.hookimpl(tryfirst=True, wrapper=True)
-def pytest_runtest_makereport(item, call):
-    outcome = yield
-    report = outcome.get_result()
-
-    # Add custom sections to report
-    if report.when == "call":
-        report.sections.append(
-            ("Custom Info", f"Test: {item.nodeid}\nDuration: {call.duration:.2f}s")
-        )
-
-    return report
-```
-
-### Plugin Registration
-
-```python
-# Register external plugins
-pytest_plugins = [
-    "myproject.testing.fixtures",
-    "myproject.testing.helpers",
-]
-
-# Conditional plugin loading
-import sys
-if sys.platform.startswith("win"):
-    pytest_plugins.append("myproject.testing.windows")
-```
-
----
-
-## Mocking & Monkeypatching
-
-### Using pytest-mock
-
-```python
-# Install: pip install pytest-mock
-
-def test_with_mock(mocker):
-    """Using pytest-mock plugin."""
-    # Mock a module function
-    mock_func = mocker.patch("mymodule.function")
-    mock_func.return_value = 42
-
-    # Mock an object method
-    mock_method = mocker.patch.object(MyClass, "method")
-    mock_method.return_value = "mocked"
-
-    # Spy on a function
-    spy = mocker.spy(mymodule, "function")
-    mymodule.function()
-    spy.assert_called_once()
-
-# Using side effects
-def test_side_effects(mocker):
-    mock = mocker.patch("mymodule.function")
-    mock.side_effect = [1, 2, 3]  # Returns different values each call
-
-    assert mymodule.function() == 1
-    assert mymodule.function() == 2
-    assert mymodule.function() == 3
-
-# Mock with exceptions
-def test_mock_exception(mocker):
-    mock = mocker.patch("mymodule.function")
-    mock.side_effect = ValueError("Error!")
-
-    with pytest.raises(ValueError):
-        mymodule.function()
-```
-
-### Monkeypatch
-
-```python
-def test_monkeypatch_env(monkeypatch):
-    """Monkeypatch environment variables."""
-    monkeypatch.setenv("API_KEY", "test-key")
-    monkeypatch.delenv("OLD_VAR", raising=False)
-
-    assert os.environ["API_KEY"] == "test-key"
-    assert "OLD_VAR" not in os.environ
-
-def test_monkeypatch_attribute(monkeypatch):
-    """Monkeypatch object attributes."""
-    class MyClass:
-        value = 10
-
-    obj = MyClass()
-    monkeypatch.setattr(obj, "value", 20)
-    assert obj.value == 20
-
-def test_monkeypatch_module(monkeypatch):
-    """Monkeypatch module functions."""
-    import time
-
-    def mock_time():
-        return 123456.0
-
-    monkeypatch.setattr(time, "time", mock_time)
-    assert time.time() == 123456.0
-
-def test_monkeypatch_dict(monkeypatch):
-    """Monkeypatch dictionary items."""
-    config = {"url": "production.com"}
-    monkeypatch.setitem(config, "url", "test.com")
-    assert config["url"] == "test.com"
-```
-
-### Advanced Mocking Patterns
-
-```python
-# Context manager mocking
-def test_context_manager(mocker):
-    mock_cm = mocker.MagicMock()
-    mock_cm.__enter__.return_value = "resource"
-    mock_cm.__exit__.return_value = None
-
-    mocker.patch("mymodule.get_resource", return_value=mock_cm)
-
-    with mymodule.get_resource() as resource:
-        assert resource == "resource"
-
-    mock_cm.__enter__.assert_called_once()
-    mock_cm.__exit__.assert_called_once()
-
-# Property mocking
-def test_property_mock(mocker):
-    mock_property = mocker.PropertyMock(return_value=42)
-    mocker.patch("mymodule.MyClass.my_property", new_callable=mock_property)
-
-    obj = mymodule.MyClass()
-    assert obj.my_property == 42
-    mock_property.assert_called_once()
-
-# Async mocking
-async def test_async_mock(mocker):
-    mock_async = mocker.AsyncMock(return_value="async result")
-    mocker.patch("mymodule.async_function", mock_async)
-
-    result = await mymodule.async_function()
-    assert result == "async result"
-    mock_async.assert_awaited_once()
-```
-
----
-
-## Database Testing Patterns
-
-Testing database interactions requires careful isolation and cleanup strategies.
-
-### Transaction-Based Isolation
-
-The most reliable approach is rolling back transactions after each test:
-
-```python
-import pytest
-
-@pytest.fixture
-def db_session(engine):
-    """Create a transactional test session."""
-    connection = engine.connect()
-    transaction = connection.begin()
-    session = Session(bind=connection)
-
-    yield session
-
-    session.close()
-    transaction.rollback()
-    connection.close()
-
-def test_user_creation(db_session):
-    """Test runs in transaction that gets rolled back."""
-    user = User(name="test")
-    db_session.add(user)
-    db_session.flush()
-
-    assert user.id is not None
-    # Transaction rolled back - no cleanup needed
-```
-
-### pytest-django Database Access
-
-```python
-import pytest
-
-# Mark test to enable database access
-@pytest.mark.django_db
-def test_user_creation():
-    User.objects.create(username="testuser")
-    assert User.objects.count() == 1
-
-# Transaction testing (for testing transaction behavior)
-@pytest.mark.django_db(transaction=True)
-def test_atomic_operations():
-    with transaction.atomic():
-        User.objects.create(username="user1")
-        # Test atomic behavior
-
-# Multiple database support
-@pytest.mark.django_db(databases=["default", "secondary"])
-def test_multi_db():
-    User.objects.using("secondary").create(username="remote_user")
-```
-
-### Database Blocker Pattern
-
-Control database access at fixture level:
-
-```python
-@pytest.fixture
-def setup_data(django_db_blocker):
-    """Fixture that needs temporary DB access."""
-    with django_db_blocker.unblock():
-        # Database operations allowed here
-        User.objects.create(username="fixture_user")
-    # Database blocked again outside context
-
-@pytest.fixture
-def no_db_fixture(django_db_blocker):
-    """Ensure no accidental DB access."""
-    with django_db_blocker.block():
-        yield  # DB access will raise error
-```
-
-### Query Count Assertions
-
-Prevent N+1 query issues:
-
-```python
-def test_efficient_queries(django_assert_num_queries):
-    """Assert exact number of queries."""
-    with django_assert_num_queries(3):
-        list(User.objects.all())
-        list(Post.objects.all())
-        list(Comment.objects.all())
-
-def test_max_queries(django_assert_max_num_queries):
-    """Assert maximum query count."""
-    with django_assert_max_num_queries(5):
-        # Complex operation that should be efficient
-        process_users()
-```
-
-### SQLAlchemy Testing Patterns
-
-```python
-import pytest
-from sqlalchemy import create_engine
-from sqlalchemy.orm import sessionmaker
-
-@pytest.fixture(scope="session")
-def engine():
-    """Create test database engine."""
-    return create_engine("sqlite:///:memory:")
-
-@pytest.fixture(scope="session")
-def tables(engine):
-    """Create all tables."""
-    Base.metadata.create_all(engine)
-    yield
-    Base.metadata.drop_all(engine)
-
-@pytest.fixture
-def db_session(engine, tables):
-    """Create a new database session for each test."""
-    connection = engine.connect()
-    transaction = connection.begin()
-    Session = sessionmaker(bind=connection)
-    session = Session()
-
-    yield session
-
-    session.close()
-    transaction.rollback()
-    connection.close()
-```
-
-### Factory Pattern for Test Data
-
-```python
-import pytest
-from factory import Factory, Faker, SubFactory
-
-class UserFactory(Factory):
-    class Meta:
-        model = User
-
-    username = Faker("user_name")
-    email = Faker("email")
-
-class PostFactory(Factory):
-    class Meta:
-        model = Post
-
-    title = Faker("sentence")
-    author = SubFactory(UserFactory)
-
-@pytest.fixture
-def user_factory(db_session):
-    """Factory fixture for creating test users."""
-    def _create_user(**kwargs):
-        user = UserFactory.build(**kwargs)
-        db_session.add(user)
-        db_session.flush()
-        return user
-    return _create_user
-
-def test_user_posts(user_factory):
-    author = user_factory(username="author")
-    post = PostFactory.build(author=author)
-    assert post.author.username == "author"
-```
-
----
-
-## Performance & Optimization
-
-### Parallel Execution with pytest-xdist
-
-```bash
-# Install pytest-xdist
-pip install pytest-xdist
-```
-
-#### Basic Usage
-
-```bash
-pytest -n auto          # Use all available CPUs
-pytest -n 4             # Use 4 workers
-pytest -n logical       # Use logical cores (requires psutil)
-```
-
-#### Distribution Strategies
-
-Understanding distribution strategies is critical for efficient parallel testing:
-
-```bash
-# Load balancing (default) - distributes tests as workers become available
-pytest -n auto --dist load
-
-# Group by scope - keeps tests sharing fixtures on same worker (RECOMMENDED)
-pytest -n auto --dist loadscope
-
-# Group by file - all tests in a file run on same worker
-pytest -n auto --dist loadfile
-
-# Each test runs on every worker (for environment-specific testing)
-pytest -n 2 --dist each
-```
-
-**When to Use Each Strategy**:
-
-| Strategy | Use Case | Performance |
-|----------|----------|-------------|
-| `load` | Independent tests, no shared state | Best parallelization |
-| `loadscope` | Tests sharing expensive fixtures | Balanced (recommended default) |
-| `loadfile` | File-level isolation needed | Good for integration tests |
-| `each` | Multi-environment testing | Multiplies test count |
-
-#### Grouping Tests with xdist_group Marker
-
-Force related tests to run on the same worker:
-
-```python
-import pytest
-
-@pytest.mark.xdist_group(name="database")
-def test_create_user():
-    """Runs on same worker as other 'database' group tests."""
-    db.create_user("alice")
-
-@pytest.mark.xdist_group(name="database")
-def test_query_user():
-    """Guaranteed same worker as test_create_user."""
-    user = db.get_user("alice")
-    assert user is not None
-
-@pytest.mark.xdist_group(name="api")
-def test_api_endpoint():
-    """Runs on potentially different worker."""
-    pass
-```
-
-#### Session-Scoped Fixtures with Parallel Execution
-
-Session-scoped fixtures require special handling in parallel execution to avoid race conditions:
-
-```python
-import json
-from pathlib import Path
-from filelock import FileLock  # pip install filelock
-
-@pytest.fixture(scope="session")
-def expensive_shared_data(tmp_path_factory, worker_id):
-    """Thread-safe session fixture for parallel execution."""
-    # Single worker mode - no synchronization needed
-    if worker_id == "master":
-        return generate_expensive_data()
-
-    # Multi-worker mode - use file locking
-    root_tmp = tmp_path_factory.getbasetemp().parent
-    data_file = root_tmp / "shared_data.json"
-    lock_file = str(data_file) + ".lock"
-
-    with FileLock(lock_file):
-        if data_file.is_file():
-            # Another worker already created the data
-            return json.loads(data_file.read_text())
-        else:
-            # First worker creates the data
-            data = generate_expensive_data()
-            data_file.write_text(json.dumps(data))
-            return data
-
-@pytest.fixture(scope="session")
-def database_url(tmp_path_factory, worker_id):
-    """Per-worker database for parallel isolation."""
-    # Each worker gets its own database
-    return f"sqlite:///test_db_{worker_id}.sqlite"
-```
-
-#### Configuration for Parallel Execution
-
-```toml
-# pyproject.toml
-[tool.pytest.ini_options]
-addopts = [
-    "-n", "auto",
-    "--dist", "loadscope",
-]
-```
-
-> ⚠️ **Warning**: Not all tests are parallelization-safe. Tests that modify global state, shared files, or external services may conflict. Use `xdist_group` or run such tests serially with `-n 0`.
-
-### Test Duration Analysis
-
-```python
-# Show test durations
-pytest --durations=10   # Show 10 slowest tests
-pytest --durations=0    # Show all test durations
-
-# In conftest.py - Custom timing
-import time
-
-@pytest.fixture(autouse=True)
-def measure_test_time(request):
-    start = time.time()
-    yield
-    duration = time.time() - start
-    print(f"\n{request.node.name} took {duration:.2f}s")
-```
-
-### Caching
-
-```python
-# Using pytest cache
-def test_expensive_computation(cache):
-    # Check cache
-    result = cache.get("computation_result", None)
-    if result is None:
-        # Compute and cache
-        result = expensive_computation()
-        cache.set("computation_result", result)
-
-    assert result == expected_value
-
-# Cache command line
-pytest --cache-show     # Show cache contents
-pytest --cache-clear    # Clear cache
-```
-
-### Fixture Optimization
-
-```python
-# Reuse expensive fixtures with broader scope
-@pytest.fixture(scope="session")
-def expensive_resource():
-    """Create once, use many times."""
-    resource = create_expensive_resource()
-    yield resource
-    resource.cleanup()
-
-# Lazy fixture creation
-@pytest.fixture
-def maybe_expensive():
-    """Only created if actually used by test."""
-    return ExpensiveObject()
-
-# Fixture factories for controlled creation
-@pytest.fixture
-def resource_factory():
-    resources = []
-
-    def _make_resource(**kwargs):
-        resource = Resource(**kwargs)
-        resources.append(resource)
-        return resource
-
-    yield _make_resource
-
-    # Cleanup all at once
-    for resource in resources:
-        resource.cleanup()
-```
-
----
-
-## CI/CD Integration
-
-### GitHub Actions Example
-
-```yaml
-# .github/workflows/test.yml
-name: Tests
-
-on: [push, pull_request]
-
-jobs:
-  test:
-    runs-on: ${{ matrix.os }}
-    strategy:
-      matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
-
-    steps:
-    - uses: actions/checkout@v4
-
-    - name: Set up Python
-      uses: actions/setup-python@v4
-      with:
-        python-version: ${{ matrix.python-version }}
-
-    - name: Install dependencies
-      run: |
-        python -m pip install --upgrade pip
-        pip install -e ".[test]"
-
-    - name: Run tests
-      run: |
-        pytest -v --cov=myproject --cov-report=xml
-
-    - name: Upload coverage
-      uses: codecov/codecov-action@v3
-      with:
-        file: ./coverage.xml
-```
-
-### Test Stages
-
-```yaml
-# Multi-stage testing
-stages:
-  - quick-tests
-  - full-tests
-  - integration-tests
-
-quick-tests:
-  script:
-    - pytest -m "unit and not slow" --fail-fast
-
-full-tests:
-  script:
-    - pytest -m "not integration"
-
-integration-tests:
-  script:
-    - pytest -m integration
-  only:
-    - main
-    - merge_requests
-```
-
-### Coverage Configuration
-
-```ini
-# .coveragerc
-[run]
-source = myproject
-omit =
-    */tests/*
-    */venv/*
-    */migrations/*
-    */__init__.py
-
-[report]
-precision = 2
-show_missing = True
-skip_covered = False
-
-[html]
-directory = htmlcov
-
-[xml]
-output = coverage.xml
-```
-
----
-
-## Plugin Ecosystem
-
-### Essential Plugins
-
-```bash
-# Coverage
-pip install pytest-cov
-
-# Parallel execution
-pip install pytest-xdist
-
-# Mocking
-pip install pytest-mock
-
-# Timeout
-pip install pytest-timeout
-
-# HTML reports
-pip install pytest-html
-
-# BDD
-pip install pytest-bdd
-
-# Benchmarking
-pip install pytest-benchmark
-
-# Django
-pip install pytest-django
-
-# Asyncio
-pip install pytest-asyncio
-
-# Flake8 integration
-pip install pytest-flake8
-
-# Order randomization
-pip install pytest-randomly
-```
-
-### Plugin Usage Examples
-
-```python
-# pytest-timeout
-@pytest.mark.timeout(10)  # 10 second timeout
-def test_slow_operation():
-    perform_slow_operation()
-
-# pytest-benchmark
-def test_performance(benchmark):
-    result = benchmark(my_function, arg1, arg2)
-    assert result == expected
-
-# pytest-randomly (randomize test order)
-# Just install and it works automatically
-# Use --randomly-seed=1234 to reproduce order
-```
-
-### Async Testing with pytest-asyncio
-
-#### Installation and Configuration
-
-```bash
-pip install pytest-asyncio
-```
-
-```toml
-# pyproject.toml
-[tool.pytest.ini_options]
-asyncio_mode = "auto"  # Automatically handle async tests
-```
-
-#### Basic Async Tests
-
-```python
-import pytest
-
-@pytest.mark.asyncio
-async def test_async_function():
-    """Test async function."""
-    result = await async_operation()
-    assert result == expected
-
-@pytest.mark.asyncio
-async def test_async_context_manager():
-    """Test async context manager."""
-    async with AsyncResource() as resource:
-        result = await resource.fetch()
-        assert result is not None
-```
-
-#### Async Fixtures
-
-```python
-@pytest.fixture
-async def async_client():
-    """Async fixture with proper cleanup."""
-    client = await create_async_client()
-    yield client
-    await client.close()
-
-@pytest.fixture(scope="session")
-async def async_database():
-    """Session-scoped async fixture."""
-    db = await Database.connect()
-    yield db
-    await db.disconnect()
-
-@pytest.mark.asyncio
-async def test_with_async_fixtures(async_client, async_database):
-    """Test using async fixtures."""
-    result = await async_client.query(async_database)
-    assert result is not None
-```
-
-#### Fixture Scopes for Async
-
-```python
-# Function scope (default) - new event loop per test
-@pytest.fixture
-async def function_resource():
-    return await create_resource()
-
-# Session scope - shared across tests
-@pytest.fixture(scope="session")
-async def session_resource():
-    resource = await expensive_async_setup()
-    yield resource
-    await resource.cleanup()
-```
-
-> ⚠️ **Deprecation Warning**: Sync tests depending on async fixtures will warn in pytest 8.x and error in future versions. Always use `@pytest.mark.asyncio` for tests using async fixtures.
-
-#### Event Loop Scope (pytest-asyncio 0.21+)
-
-```python
-# Control event loop scope
-@pytest.fixture(scope="session")
-def event_loop_policy():
-    """Use uvloop for faster async."""
-    import uvloop
-    return uvloop.EventLoopPolicy()
-
-# Or via configuration
-# pyproject.toml
-[tool.pytest.ini_options]
-asyncio_default_fixture_loop_scope = "function"
-```
-
----
-
-## Snapshot & Regression Testing
-
-Snapshot testing captures expected output and compares against future runs.
-
-### Using syrupy (Recommended)
-
-```bash
-pip install syrupy
-```
-
-```python
-def test_api_response(snapshot):
-    """Compare API response against snapshot."""
-    response = api_client.get("/users/1")
-    assert response.json() == snapshot
-
-def test_html_output(snapshot):
-    """Compare rendered HTML."""
-    html = render_template("user_profile.html", user=mock_user)
-    assert html == snapshot
-
-def test_complex_object(snapshot):
-    """Snapshot complex data structures."""
-    result = process_data(input_data)
-    assert result == snapshot
-```
-
-### Snapshot Management
-
-```bash
-# Update all snapshots (after intentional changes)
-pytest --snapshot-update
-
-# Review snapshot changes interactively
-pytest --snapshot-warn-unused
-
-# CI mode - fail on snapshot mismatch
-pytest  # Default behavior
-```
-
-### Custom Snapshot Serializers
-
-```python
-from syrupy.extensions.json import JSONSnapshotExtension
-
-@pytest.fixture
-def snapshot_json(snapshot):
-    """Use JSON serialization for snapshots."""
-    return snapshot.use_extension(JSONSnapshotExtension)
-
-def test_json_api(snapshot_json):
-    response = api.get("/data")
-    assert response.json() == snapshot_json
-```
-
-### Inline Snapshots
-
-```python
-def test_inline(snapshot):
-    """Snapshot stored in test file itself."""
-    result = calculate_value()
-    assert result == snapshot(result)  # First run creates snapshot
-```
-
-### Best Practices for Snapshot Testing
-
-1. **Use for stable outputs**: HTML, JSON responses, serialized objects
-2. **Avoid for volatile data**: Timestamps, random IDs, system-specific paths
-3. **Review diffs carefully**: Snapshot updates should be intentional
-4. **Combine with unit tests**: Snapshots complement, not replace, assertions
-5. **Keep snapshots small**: Large snapshots are hard to review
-
----
-
-## Property-Based Testing with Hypothesis
-
-Property-based testing generates random inputs to find edge cases.
-
-### Installation
-
-```bash
-pip install hypothesis
-```
-
-### Basic Property Tests
-
-```python
-from hypothesis import given, strategies as st
-
-@given(st.integers())
-def test_integer_properties(x):
-    """Test properties that should hold for all integers."""
-    assert x + 0 == x
-    assert x * 1 == x
-    assert x - x == 0
-
-@given(st.lists(st.integers()))
-def test_sort_is_idempotent(data):
-    """Sorting twice equals sorting once."""
-    assert sorted(data) == sorted(sorted(data))
-
-@given(st.lists(st.integers()))
-def test_sort_preserves_length(data):
-    """Sorting doesn't change length."""
-    assert len(sorted(data)) == len(data)
-
-@given(st.text())
-def test_string_roundtrip(s):
-    """Encoding and decoding returns original."""
-    assert s.encode("utf-8").decode("utf-8") == s
-```
-
-### Combining with pytest Fixtures
-
-```python
-@given(st.integers(min_value=1, max_value=100))
-def test_with_fixture(db_session, quantity):
-    """Property test with pytest fixture."""
-    order = Order(quantity=quantity)
-    db_session.add(order)
-    db_session.flush()
-
-    assert order.total == order.price * quantity
-
-@pytest.mark.parametrize("discount", [0, 10, 25, 50])
-@given(st.integers(min_value=1))
-def test_parametrized_property(discount, price):
-    """Combine parametrize with hypothesis."""
-    discounted = apply_discount(price, discount)
-    assert discounted <= price
-```
-
-### Custom Strategies
-
-```python
-from hypothesis import strategies as st
-
-# Email strategy
-emails = st.emails()
-
-# Custom composite strategy
-@st.composite
-def user_data(draw):
-    """Generate valid user data."""
-    return {
-        "username": draw(st.text(min_size=3, max_size=20)),
-        "email": draw(st.emails()),
-        "age": draw(st.integers(min_value=18, max_value=120)),
-    }
-
-@given(user_data())
-def test_user_creation(data):
-    user = User(**data)
-    assert user.is_valid()
-```
-
-### Controlling Test Generation
-
-```python
-from hypothesis import given, settings, Verbosity
-
-@given(st.integers())
-@settings(
-    max_examples=500,        # More thorough testing
-    deadline=1000,           # 1 second timeout per example
-    verbosity=Verbosity.verbose,
-)
-def test_thorough(x):
-    assert some_property(x)
-
-@given(st.integers())
-@settings(max_examples=10)  # Quick smoke test
-def test_quick(x):
-    assert basic_property(x)
-```
-
-### Example Database for Reproducibility
-
-```python
-from hypothesis import given, settings, Phase
-
-@given(st.integers())
-@settings(
-    database=None,  # Disable example database
-    phases=[Phase.generate],  # Only generate, don't replay
-)
-def test_stateless(x):
-    pass
-```
-
-### Best Practices
-
-1. **Test properties, not examples**: Focus on invariants that always hold
-2. **Keep tests fast**: Each example should be quick
-3. **Use `@settings(deadline=None)`** for slow operations
-4. **Review failing examples**: Hypothesis shows minimal failing case
-5. **Combine with unit tests**: Property tests find edge cases, unit tests verify specific behavior
-
----
-
-## Test Asset Generation & Management
-
-Dynamic test asset generation ensures tests are self-contained, reproducible, and independent of external files. This is especially critical for ML/ONNX testing where models must be generated programmatically.
-
-### Core Principle: Code-Generated Assets
-
-**CARDINAL RULE**: Never rely on pre-existing files or LLM-generated test data. All test assets must be generated by code during test execution.
-
-```python
-# ❌ BAD: Relying on pre-existing files
-def test_model_optimization():
-    model = onnx.load("tests/fixtures/bert_model.onnx")  # External dependency!
-    optimized = optimize(model)
-    assert optimized is not None
-
-# ✅ GOOD: Generate assets programmatically
-def test_model_optimization(simple_model_fixture):
-    """Model is generated by fixture - no external dependencies."""
-    optimized = optimize(simple_model_fixture)
-    assert optimized is not None
-```
-
-### Fixture-Based Asset Generation
-
-#### Session-Scoped Expensive Assets
-
-For expensive-to-generate assets, use session scope to generate once per test session:
-
-```python
-# conftest.py
-import onnx
-from onnx import helper, TensorProto
-import numpy as np
-
-@pytest.fixture(scope="session")
-def base_model() -> onnx.ModelProto:
-    """Generate a base ONNX model for testing.
-
-    Session-scoped to avoid regenerating for every test.
-    """
-    # Create input
-    X = helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 128])
-
-    # Create nodes
-    nodes = [
-        helper.make_node("Relu", ["input"], ["relu_out"], name="relu_1"),
-        helper.make_node("Sigmoid", ["relu_out"], ["output"], name="sigmoid_1"),
-    ]
-
-    # Create output
-    Y = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 128])
-
-    # Build graph and model
-    graph = helper.make_graph(nodes, "test_graph", [X], [Y])
-    model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
-
-    return model
-```
-
-#### Function-Scoped Mutable Assets
-
-For assets that tests may modify, use function scope:
-
-```python
-@pytest.fixture(scope="function")
-def mutable_model(base_model) -> onnx.ModelProto:
-    """Create a fresh copy for tests that modify the model."""
-    import copy
-    return copy.deepcopy(base_model)
-```
-
-### Pattern-Specific Model Generation
-
-Generate models containing specific patterns for targeted testing:
-
-```python
-# tests/optim/conftest.py
-
-@pytest.fixture(scope="session")
-def gelu_pattern_model() -> onnx.ModelProto:
-    """Generate model with GELU approximation pattern.
-
-    GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))
-    This pattern should be detected and fused by GELU fusion optimizers.
-    """
-    X = helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 768])
-
-    # Create GELU approximation nodes
-    nodes = [
-        # x³
-        helper.make_node("Pow", ["input", "three"], ["x_cubed"], name="pow_1"),
-        # 0.044715 * x³
-        helper.make_node("Mul", ["x_cubed", "coef"], ["scaled_cube"], name="mul_1"),
-        # x + 0.044715 * x³
-        helper.make_node("Add", ["input", "scaled_cube"], ["sum_1"], name="add_1"),
-        # sqrt(2/π) * (x + 0.044715 * x³)
-        helper.make_node("Mul", ["sum_1", "sqrt_2_pi"], ["tanh_input"], name="mul_2"),
-        # tanh(...)
-        helper.make_node("Tanh", ["tanh_input"], ["tanh_out"], name="tanh_1"),
-        # 1 + tanh(...)
-        helper.make_node("Add", ["one", "tanh_out"], ["one_plus_tanh"], name="add_2"),
-        # 0.5 * x
-        helper.make_node("Mul", ["half", "input"], ["half_x"], name="mul_3"),
-        # 0.5 * x * (1 + tanh(...))
-        helper.make_node("Mul", ["half_x", "one_plus_tanh"], ["output"], name="mul_4"),
-    ]
-
-    # Create initializers for constants
-    initializers = [
-        numpy_helper.from_array(np.array([3.0], dtype=np.float32), "three"),
-        numpy_helper.from_array(np.array([0.044715], dtype=np.float32), "coef"),
-        numpy_helper.from_array(np.array([0.7978845608], dtype=np.float32), "sqrt_2_pi"),
-        numpy_helper.from_array(np.array([1.0], dtype=np.float32), "one"),
-        numpy_helper.from_array(np.array([0.5], dtype=np.float32), "half"),
-    ]
-
-    Y = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 768])
-    graph = helper.make_graph(nodes, "gelu_pattern", [X], [Y], initializers)
-
-    return helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
-
-
-@pytest.fixture(scope="session")
-def matmul_add_pattern_model() -> onnx.ModelProto:
-    """Generate model with MatMul+Add pattern for Gemm fusion testing."""
-    X = helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 512])
-
-    # Weight and bias initializers
-    weight = numpy_helper.from_array(
-        np.random.randn(512, 256).astype(np.float32), "weight"
-    )
-    bias = numpy_helper.from_array(
-        np.random.randn(256).astype(np.float32), "bias"
-    )
-
-    nodes = [
-        helper.make_node("MatMul", ["input", "weight"], ["matmul_out"], name="matmul_1"),
-        helper.make_node("Add", ["matmul_out", "bias"], ["output"], name="add_1"),
-    ]
-
-    Y = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 256])
-    graph = helper.make_graph(nodes, "matmul_add_pattern", [X], [Y], [weight, bias])
-
-    return helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
-```
-
-### Multi-Pattern Test Models
-
-For comprehensive testing, generate models with multiple patterns:
-
-```python
-@pytest.fixture(scope="session")
-def all_patterns_model() -> onnx.ModelProto:
-    """Generate model with ALL optimization patterns for comprehensive testing.
-
-    Patterns included (with prefixes for identification):
-    - p01_identity_: Identity elimination pattern
-    - p02_dropout_: Dropout elimination pattern
-    - p03_reshape_: Reshape fusion pattern
-    - p04_transpose_: Transpose optimization pattern
-    - p05_conv_: Conv optimization pattern
-    - p06_matmuladdrelu_: MatMul+Add+Relu fusion pattern
-    - p07_attention_: Attention pattern
-    - p08_biasgelu_: Bias+GELU fusion pattern
-    - p09_skiplayernorm_: SkipLayerNorm pattern
-
-    Node naming convention: {pattern_prefix}{operation}_{index}
-    Example: p06_matmuladdrelu_matmul_1
-    """
-    # Implementation generates all patterns in one model
-    # Each pattern uses consistent naming for verification
-    ...
-```
-
-### Conftest Hierarchy for Asset Sharing
-
-Organize conftest files hierarchically for proper asset sharing:
-
-```
-tests/
-├── conftest.py                    # Root: Core helpers (optimize_at_level, etc.)
-├── optim/
-│   ├── conftest.py               # Optim-wide: Base model fixtures
-│   ├── capabilities/
-│   │   ├── conftest.py           # Capability-specific: Pattern models, ORT names
-│   │   ├── test_gelu_fusion.py
-│   │   └── test_matmul_add.py
-│   ├── pipes/
-│   │   ├── conftest.py           # Pipe-specific: Pipe configs, mock models
-│   │   ├── test_pipe_graph.py
-│   │   └── test_pipe_fusion.py
-│   └── integration/
-│       ├── conftest.py           # Integration: Complex model fixtures
-│       └── test_optimizer.py
-```
-
-#### Root conftest.py - Core Helpers
-
-```python
-# tests/conftest.py
-"""Root conftest - Core testing utilities."""
-
-import onnx
-import onnxruntime as ort
-import tempfile
-from pathlib import Path
-
-def optimize_at_level(
-    model: onnx.ModelProto,
-    level: int = 2,
-    disabled_optimizers: list[str] | None = None,
-) -> onnx.ModelProto:
-    """Apply ORT graph optimization at specified level.
-
-    This is the RAW ORT API helper - does NOT use Pipe classes.
-    Use this in capability tests for isolation testing.
-    """
-    opts = ort.SessionOptions()
-    opts.graph_optimization_level = ort.GraphOptimizationLevel(level)
-
-    if disabled_optimizers:
-        for name in disabled_optimizers:
-            opts.add_session_config_entry(
-                f"session.disable_specified_optimizers",
-                ",".join(disabled_optimizers)
-            )
-
-    with tempfile.TemporaryDirectory() as tmpdir:
-        input_path = Path(tmpdir) / "input.onnx"
-        output_path = Path(tmpdir) / "output.onnx"
-
-        onnx.save(model, str(input_path))
-        opts.optimized_model_filepath = str(output_path)
-
-        # Create session to trigger optimization
-        ort.InferenceSession(str(input_path), opts)
-
-        return onnx.load(str(output_path))
-```
-
-#### Domain conftest.py - Shared Fixtures
-
-```python
-# tests/optim/capabilities/conftest.py
-"""Capability test fixtures - Pattern-specific models."""
-
-import pytest
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    import onnx
-
-# Import pattern model generators
-from tests.optim.conftest import (
-    gelu_pattern_model,
-    matmul_add_pattern_model,
-    all_patterns_model,
-)
-
-def get_all_ort_names() -> list[str]:
-    """Get all registered ORT optimizer names for isolation testing."""
-    return [
-        "GeluFusionL2",
-        "BiasGeluFusion",
-        "MatMulAddFusion",
-        "LayerNormFusion",
-        # ... all 49+ ORT optimizer names
-    ]
-
-@pytest.fixture(scope="session")
-def ort_optimizer_names() -> list[str]:
-    """Fixture providing all ORT optimizer names."""
-    return get_all_ort_names()
-```
-
-### Asset Verification Helpers
-
-Create helpers to verify generated assets have expected structure:
-
-```python
-# tests/helpers/model_verification.py
-
-def count_nodes_by_op(model: onnx.ModelProto, op_type: str) -> int:
-    """Count nodes of specific operation type."""
-    return sum(1 for n in model.graph.node if n.op_type == op_type)
-
-def count_nodes_by_prefix(model: onnx.ModelProto, prefix: str) -> int:
-    """Count nodes with name prefix (for pattern identification)."""
-    return sum(1 for n in model.graph.node if n.name.startswith(prefix))
-
-def count_nodes_by_prefix_and_op(
-    model: onnx.ModelProto, prefix: str, op_type: str
-) -> int:
-    """Count nodes matching both prefix and operation type."""
-    return sum(
-        1 for n in model.graph.node
-        if n.name.startswith(prefix) and n.op_type == op_type
-    )
-
-def verify_pattern_exists(
-    model: onnx.ModelProto,
-    pattern_prefix: str,
-    expected_ops: list[str],
-) -> bool:
-    """Verify a pattern exists in the model with expected operations."""
-    for op in expected_ops:
-        if count_nodes_by_prefix_and_op(model, pattern_prefix, op) == 0:
-            return False
-    return True
-```
-
-### Differential Testing with Generated Assets
-
-Test optimization effects by comparing before/after states:
-
-```python
-def test_gelu_fusion_effectiveness(gelu_pattern_model):
-    """Test that GELU fusion actually reduces node count."""
-    from tests.conftest import optimize_at_level
-    from tests.helpers.model_verification import count_nodes_by_op
-
-    # Before optimization
-    before_tanh = count_nodes_by_op(gelu_pattern_model, "Tanh")
-    before_mul = count_nodes_by_op(gelu_pattern_model, "Mul")
-
-    # Apply optimization with GELU fusion enabled
-    optimized = optimize_at_level(
-        gelu_pattern_model,
-        level=2,
-        disabled_optimizers=[]  # All enabled
-    )
-
-    # After optimization - GELU pattern should be fused
-    after_tanh = count_nodes_by_op(optimized, "Tanh")
-    after_mul = count_nodes_by_op(optimized, "Mul")
-
-    # Verify fusion occurred
-    assert after_tanh < before_tanh, "GELU fusion should reduce Tanh nodes"
-    assert after_mul < before_mul, "GELU fusion should reduce Mul nodes"
-```
-
-### Best Practices Summary
-
-1. **Always generate assets in code**: Never rely on external files
-2. **Use appropriate fixture scope**: Session for expensive, function for mutable
-3. **Name patterns consistently**: Use prefixes for pattern identification
-4. **Create verification helpers**: Standardize how you check asset structure
-5. **Document pattern structure**: Explain what each generated model contains
-6. **Test asset generation**: Verify fixtures produce expected structures
-7. **Use conftest hierarchy**: Share assets at appropriate levels
-8. **Prefer RAW APIs in unit tests**: Don't couple to higher-level abstractions
-
----
-
-## Common Patterns & Anti-Patterns
-
-### Patterns ✅
-
-```python
-# Good: Descriptive test names
-def test_user_registration_sends_welcome_email():
-    pass
-
-# Good: Focused tests
-def test_calculate_tax_for_standard_rate():
-    income = 50000
-    assert calculate_tax(income) == 10000
-
-# Good: Using fixtures for setup
-@pytest.fixture
-def authenticated_client(client, user):
-    client.login(username=user.username, password="password")
-    return client
-
-# Good: Parametrize instead of loops
-@pytest.mark.parametrize("value,expected", [
-    (1, 1),
-    (2, 4),
-    (3, 9),
-])
-def test_square(value, expected):
-    assert value ** 2 == expected
-
-# Good: Clear test structure (Arrange-Act-Assert)
-def test_user_creation():
-    # Arrange
-    data = {"username": "john", "email": "john@example.com"}
-
-    # Act
-    user = User.create(**data)
-
-    # Assert
-    assert user.username == "john"
-    assert user.email == "john@example.com"
-```
-
-### Anti-Patterns ❌
-
-```python
-# Bad: Test doing too much
-def test_everything():
-    user = create_user()
-    post = create_post(user)
-    comment = create_comment(post)
-    assert user.is_active
-    assert post.author == user
-    assert comment.post == post
-    # Too many things tested at once
-
-# Bad: Modifying global state
-def test_with_global_state():
-    global CONFIG
-    CONFIG["debug"] = True  # Don't modify globals
-    assert my_function() == expected
-
-# Bad: Tests depending on order
-def test_first():
-    global shared_data
-    shared_data = setup_data()
-
-def test_second():
-    # Depends on test_first running first
-    assert shared_data.value == expected
-
-# Bad: Catching all exceptions
-def test_broad_exception():
-    try:
-        risky_operation()
-    except Exception:  # Too broad
-        pass  # Test passes even if unexpected error
-
-# Bad: No assertion
-def test_without_assertion():
-    result = my_function()
-    # No assert - test always passes
-```
-
----
-
-## Debugging & Troubleshooting
-
-### Debugging Techniques
-
-```python
-# Drop into debugger on failure
-pytest --pdb
-
-# Drop into IPython debugger
-pytest --pdbcls=IPython.terminal.debugger:TerminalPdb
-
-# Set breakpoint in code
-def test_debug():
-    value = calculate()
-    import pdb; pdb.set_trace()  # or breakpoint() in Python 3.7+
-    assert value == expected
-
-# Print debugging (use -s flag)
-def test_with_print():
-    print("Debug info:", value)  # Visible with pytest -s
-    assert value == expected
-
-# Capture logs
-def test_with_logging(caplog):
-    with caplog.at_level(logging.INFO):
-        my_function()
-    assert "Expected message" in caplog.text
-
-# Detailed failure info
-pytest -vv  # Very verbose
-pytest --tb=short  # Short traceback
-pytest --tb=line   # One line per failure
-pytest --tb=no     # No traceback
-```
-
-### Common Issues & Solutions
-
-```python
-# Issue: Import errors
-# Solution: Check PYTHONPATH and use --import-mode
-pytest --import-mode=importlib
-
-# Issue: Fixture not found
-# Solution: Check scope and conftest.py location
-pytest --fixtures  # List available fixtures
-
-# Issue: Tests not discovered
-# Solution: Check naming conventions
-pytest --collect-only  # See what's collected
-
-# Issue: Flaky tests
-# Solution: Use pytest-rerunfailures
-pip install pytest-rerunfailures
-pytest --reruns 3 --reruns-delay 1
-
-# Issue: Test isolation
-# Solution: Use fixtures and avoid global state
-@pytest.fixture(autouse=True)
-def reset_state():
-    cleanup_before_test()
-    yield
-    cleanup_after_test()
-```
-
----
-
----
-
-## Deprecations & Migration Guide
-
-### Deprecated Patterns to Avoid
-
-Understanding deprecated patterns helps maintain forward compatibility.
-
-#### Marker Access (Changed in pytest 4.0+)
-
-```python
-# ❌ DEPRECATED - will be removed
-marker = item.get_marker("slow")
-
-# ✅ CURRENT - use these instead
-marker = item.get_closest_marker("slow")  # Single marker
-markers = list(item.iter_markers("slow"))  # Multiple markers
-```
-
-#### Hook Decorators (Changed in pytest 7.0+)
-
-```python
-# ❌ DEPRECATED
-@pytest.mark.tryfirst
-def pytest_collection_modifyitems(items):
-    pass
-
-# ✅ CURRENT
-@pytest.hookimpl(tryfirst=True)
-def pytest_collection_modifyitems(items):
-    pass
-```
-
-#### pytest_namespace Hook (Removed in pytest 8.0)
-
-```python
-# ❌ REMOVED - no longer works
-def pytest_namespace():
-    return {"my_value": 42}
-
-# ✅ CURRENT - use pytest_configure instead
-def pytest_configure(config):
-    config.my_value = 42
-```
-
-#### Async Fixtures with Sync Tests (Warning in pytest 8.x+)
-
-```python
-# ❌ DEPRECATED - will warn and eventually error
-@pytest.fixture
-async def async_data():
-    return await fetch_data()
-
-def test_sync(async_data):  # Sync test using async fixture
-    assert async_data is not None
-
-# ✅ CURRENT - explicit async handling
-@pytest.fixture
-def async_data():
-    import asyncio
-    return asyncio.run(fetch_data())
-
-def test_sync(async_data):
-    assert async_data is not None
-
-# OR use async test
-@pytest.fixture
-async def async_data():
-    return await fetch_data()
-
-@pytest.mark.asyncio
-async def test_async(async_data):
-    assert async_data is not None
-```
-
-#### yield_fixture Decorator (Removed)
-
-```python
-# ❌ REMOVED
-@pytest.yield_fixture
-def resource():
-    r = acquire()
-    yield r
-    release(r)
-
-# ✅ CURRENT - use regular fixture with yield
-@pytest.fixture
-def resource():
-    r = acquire()
-    yield r
-    release(r)
-```
-
-### Migration Checklist
-
-When upgrading pytest versions, check for:
-
-- [ ] Replace `item.get_marker()` with `item.get_closest_marker()`
-- [ ] Replace `@pytest.mark.tryfirst/trylast` with `@pytest.hookimpl(tryfirst=True/trylast=True)`
-- [ ] Remove any `pytest_namespace` hooks
-- [ ] Update async fixtures to use explicit handling
-- [ ] Replace `@pytest.yield_fixture` with `@pytest.fixture`
-- [ ] Check `--strict-config` passes with your configuration
-- [ ] Review `filterwarnings` for any pytest deprecation warnings
-
-### Version Compatibility Matrix
-
-| Feature | Minimum Version | Notes |
-|---------|-----------------|-------|
-| `pyproject.toml` support | pytest 6.0 | `[tool.pytest.ini_options]` |
-| Native TOML `[tool.pytest]` | pytest 9.0 | Cleaner syntax |
-| `--import-mode=importlib` | pytest 6.0 | Recommended default |
-| `@pytest.hookimpl` | pytest 7.0 | Replaces mark decorators |
-| `item.iter_markers()` | pytest 4.0 | Replaces `get_marker()` |
-| `required_plugins` | pytest 7.0 | With `--strict-config` |
-
-## Best Practices Checklist
-
-### ✅ DO's
-
-1. **Write descriptive test names** that explain what is being tested
-2. **Use fixtures** for setup and teardown
-3. **Keep tests focused** - one concept per test
-4. **Use parametrize** for data-driven tests
-5. **Organize tests** to mirror source code structure
-6. **Register custom markers** in pytest.ini
-7. **Use appropriate scopes** for fixtures
-8. **Mock external dependencies** in unit tests
-9. **Run fastest tests first** in CI/CD
-10. **Use pytest.raises** for exception testing
-11. **Document complex test scenarios**
-12. **Use tmp_path fixture** for file operations
-13. **Configure pytest** in pyproject.toml or pytest.ini
-14. **Use pytest plugins** to extend functionality
-15. **Profile slow tests** and optimize
-16. **Start without `__init__.py`** in test directories - add only when needed
-17. **Use `--import-mode=importlib`** for modern import handling
-18. **Declare `required_plugins`** for team/CI consistency
-19. **Use `--strict-config`** to catch configuration errors early
-20. **Handle async fixtures properly** with `@pytest.mark.asyncio`
-21. **Use file locking** for session fixtures with parallel execution
-
-### ❌ DON'Ts
-
-1. **Don't write tests that depend on execution order**
-2. **Don't use global state** that affects other tests
-3. **Don't catch broad exceptions** without re-raising
-4. **Don't hardcode paths** - use fixtures and tmp_path
-5. **Don't skip writing tests** for "simple" functions
-6. **Don't mix test types** in the same file
-7. **Don't use production credentials** in tests
-8. **Don't ignore flaky tests** - fix or mark them
-9. **Don't write tests without assertions**
-10. **Don't duplicate test logic** - use fixtures
-11. **Don't test implementation details** - test behavior
-12. **Don't use time.sleep** - use proper synchronization
-13. **Don't modify source code** for testing - use mocks
-14. **Don't run all tests locally** for every change
-15. **Don't ignore test warnings** - fix or suppress explicitly
-16. **Don't add `__init__.py` to tests by default** - pytest works without it
-17. **Don't use deprecated marker access** - use `get_closest_marker()` not `get_marker()`
-18. **Don't mix sync tests with async fixtures** - will warn/error in pytest 8+
-19. **Don't ignore configuration file priority** - empty `pytest.ini` blocks `pyproject.toml`
-20. **Don't use `@pytest.yield_fixture`** - use `@pytest.fixture` with yield
-21. **Don't forget `xdist_group`** when tests must share state in parallel execution
-
-### Final Recommendations
-
-1. **Start Simple**: Begin with basic tests and add complexity as needed
-2. **Test First**: Consider TDD for complex logic
-3. **Continuous Integration**: Run tests automatically on every commit
-4. **Code Coverage**: Aim for high coverage but focus on critical paths
-5. **Performance**: Monitor and optimize test suite performance
-6. **Documentation**: Document complex test scenarios and fixtures
-7. **Maintenance**: Regularly update and refactor tests
-8. **Team Standards**: Establish and follow team testing conventions
-
-Remember: Good tests are as important as good code. They provide confidence, documentation, and safety for refactoring.
diff --git a/docs/superpowers/2026-05-26-v3-known-issues.md b/docs/superpowers/2026-05-26-v3-known-issues.md
deleted file mode 100644
index 96a175178..000000000
--- a/docs/superpowers/2026-05-26-v3-known-issues.md
+++ /dev/null
@@ -1,102 +0,0 @@
-# winml-cli docs v3 — Known issues
-
-> **Date:** 2026-05-26
-> **Branch:** `docs/v3` (squashed as `gim-doc` tag)
-> **Status:** Fact-checked findings from a 3-agent critical review pass. Each issue verified against the actual source/files.
-
-Issues identified after the v3 doc set was assembled, fact-checked against `src/winml/modelkit/` and the actual doc files. Five issues are real and pending fix; three were claimed by reviewers but dismissed on second pass.
-
----
-
-## Confirmed issues — pending fix
-
-### 1. Stale link display text across 7 files (10+ occurrences)
-
-Several pages were renamed during the Concepts restructure but their inbound link **display text** still uses the old titles. The link URLs themselves all resolve correctly (strict build passes); the issue is the visible label readers see.
-
-| Stale text | Should be | Locations |
-|---|---|---|
-| `Quantization & QDQ` | `Datatype and Quantization` | `commands/eval.md:95`, `commands/hub.md:112`, `samples/convnext-primitives.md:83`, `samples/convnext-primitives.md:175`, `tutorials/npu-convnext.md:278` |
-| `Quantization concepts` | `Datatype and Quantization` | `commands/quantize.md:115` |
-| `Concepts → Quantization and QDQ` | `Concepts → Datatype and Quantization` | `tutorials/npu-convnext.md:137` |
-| `ONNX & Execution Providers` / `ONNX and execution providers` | `EP and Device` | `commands/compile.md:110`, `commands/eval.md:96`, `commands/inspect.md:104`, `commands/overview.md:69`, `commands/perf.md:102`, `commands/sys.md:114`, `samples/convnext-primitives.md:108`, `samples/convnext-primitives.md:176` |
-| `Load and export concept` | `Load and export` | `commands/export.md:105`, `commands/inspect.md:100`, `commands/perf.md:101` |
-
-**Fix:** sed-sweep all five label patterns to the new titles.
-
-### 2. WinML CLI concept sub-group ordering misaligned with workflow
-
-`mkdocs.yml` lists the WinML CLI Concepts sub-group in this order:
-
-```
-Primitives and pipeline
-Load and export
-Analyze and optimize
-Compile and EPContext
-Perf and monitoring
-Eval and datasets
-Config and build      ← last
-```
-
-But `winml config` is **Step 1** of the End-to-End Tour (`getting-started/end-to-end.md`), so a reader who finishes the Tour and turns to Concepts to go deeper has to walk past 5 other pages before reaching `config-and-build.md`, which documents what they just did.
-
-**Fix:** reorder so `Config and build` follows `Primitives and pipeline`:
-
-```
-Primitives and pipeline
-Config and build
-Load and export
-Analyze and optimize
-Compile and EPContext
-Perf and monitoring
-Eval and datasets
-```
-
-### 3. `graphs-and-ir.md:29` opset 17 / GroupNorm factual error
-
-Current text:
-
-> "Opset 17 introduced layer-normalisation and group-normalisation operators in native form, eliminating the multi-node decompositions required by earlier opsets…"
-
-Per the ONNX changelog, **`LayerNormalization` was added in opset 17** but **`GroupNormalization` was added in opset 18**. The compound claim is wrong.
-
-**Fix:** rewrite to "Opset 17 introduced LayerNormalization in native form; GroupNormalization arrived in opset 18." Or drop the GroupNorm mention entirely.
-
-### 4. ConvNeXt "Pick the right page" admonition missing from `end-to-end.md`
-
-The admonition appears at the top of `samples/convnext-primitives.md:3` and `tutorials/npu-convnext.md:3` but is **absent** from `getting-started/end-to-end.md`. The three pages all use `facebook/convnext-tiny-224` and a reader coming from the End-to-End Tour has no signpost telling them about the other two pages.
-
-**Fix:** add a matching `!!! info "Pick the right ConvNeXt page"` admonition near the top of `getting-started/end-to-end.md`.
-
-### 5. `end-to-end.md:108` capital-B inconsistency
-
-Line 108 reads `[Config and Build](../concepts/config-and-build.md)` (capital B). The nav label and line 88 of the same file use lowercase `Config and build`.
-
-**Fix:** change to lowercase `b` to match.
-
----
-
-## Issues claimed by reviewers but rejected on fact-check
-
-### #2 (rejected) — Quickstart link description
-
-A UX reviewer claimed `quickstart.md:63` says "full pipeline against a Qualcomm NPU". Actual text is "full pipeline from Hugging Face to NPU". The exact phrasing the reviewer quoted is not present. The link wording is mildly NPU-leaning but not the misrepresentation claimed. Optional minor wording tweak; not pursued here.
-
-### #5 (rejected) — `<artifact>.onnx` placeholder ambiguity
-
-A UX reviewer claimed Step 3 leaves the reader guessing the per-device filename. The actual prose at `end-to-end.md:121-125` explicitly lists all three filenames (`convnext_tiny_qnn_ctx.onnx`, `convnext_tiny_dml_ctx.onnx`, `convnext_tiny.onnx`) and tells readers where to find them. Reviewer missed reading the next sentence.
-
-### #7 (rejected) — `weight-and-activation.md` forward-reference to `w8a16`
-
-A UX reviewer claimed the page mentions `w8a16` before defining it. Actual text at line 25 defines it inline: "The compound precision shorthand `w8a16` (8-bit weights, 16-bit activations)". Reviewer wrong.
-
-### #9 (partial → effectively rejected) — `optim` fields not declared on dataclass
-
-A factual reviewer flagged that `WinMLOptimizationConfig` is a free-form dict subclass with no declared fields, so the JSON example field names (`gelu_fusion`, `layer_norm_fusion`, `matmul_add_fusion`) "may not be real". Verified that the fields **are** real keys recognized by the optimizer at `src/winml/modelkit/optim/pipes/graph.py:242-243`. The example is correct. Not a defect.
-
----
-
-## Items intentionally left as-is
-
-- **"WinML CLI" sub-group naming.** The sub-group inside Concepts is named `WinML CLI`, which is recursive (the product is `winml-cli`). Suggested rename to "Workflows" was proposed and explicitly declined earlier. No change.
-- **Singular vs plural style split between Fundamentals and WinML CLI sub-groups.** Fundamentals uses singular pair-topics ("Graph and IR", "Weight and Activation", "EP and Device", "Datatype and Quantization") per the user's preference; WinML CLI still uses plurals ("Primitives and pipeline", "Eval and datasets"). The user has not asked to reconcile.
diff --git a/docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md b/docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md
deleted file mode 100644
index 030d88c8f..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/analyze-and-optimize.md
+++ /dev/null
@@ -1,43 +0,0 @@
-# Issues: docs/concepts/analyze-and-optimize.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-
-- **`--output results.json` flag for `winml analyze`** (line 9): The doc says "add `--output results.json` to save the report as JSON". The actual flag is `--output` (source: `commands/analyze.py` line 653 `@cli_utils.output_option("Save JSON output to file")`). This is valid and correct.
-
-- **`--preset` flag on `winml optimize`** (line 21): The doc says "Use presets (`--preset transformer-optimized`, `--preset qnn-compatible`) as a starting point." No `--preset` flag exists on `winml optimize`. The command has `--config` (a config file) and capability flags, but no `--preset` option (source: `commands/optimize.py` — the full file was read and contains no `--preset` option). This is a fabricated flag that would cause `Error: No such option: --preset` if a user tries it.
-
-## Important (misleading or stale claim)
-
-- **Exit codes described as 0/1/2** (line 10): Doc says "zero is full support, one is partial support with unsupported operators, two is a configuration error." Source confirms: `commands/analyze.py` line 1212-1213 (`sys.exit(0 if overall_supported else 1)`) and lines 734-736, 1216-1222 use `sys.exit(2)` for errors. This matches the doc.
-
-- **`--save-node unsupported` or `--save-node partial`** (line 11): Doc says "Use `--save-node unsupported` or `--save-node partial`". Source shows `--save-node` with `multiple=True` and choices `["partial", "unsupported"]` (`commands/analyze.py` lines 673-676). The flag exists and the values are valid.
-
-- **`--max-optim-iterations` default described as "three"** (line 26): Doc says "default: three". Source confirms `default: 3` in the help text (`commands/build.py` line 310) and `hack_max_optim_iterations` defaults to `3` in the build pipeline (`commands/build.py` line 1112, 1234). Correct.
-
-- **`--no-analyze` on `winml build`** (line 27): The doc says "`winml build` runs analyze and optimize in an alternating loop" and "Use `--no-analyze` to skip the loop". Source confirms `--no-analyze` on `winml build` (`commands/build.py` lines 294-298) which sets `hack_max_optim_iterations = 0`. Correct.
-
-- **`--commit a specific combination to a `--config` file`** (line 21): Doc says "commit a specific combination to a `--config` file". The `winml optimize` command has `--config` / `-c` (source: `commands/optimize.py` lines 176-180). This is valid.
-
-## Minor (style, polish, low-impact)
-
-- **`--list-capabilities` and `--list-rewrites` flags** (lines 17, 19): Both exist on `winml optimize` → `commands/optimize.py` lines 153, 160. Correct.
-
-- **Pattern-rewrite flag form `--enable-<source-slug>-<target-slug>`** (line 19): Consistent with source → `commands/optimize.py` lines 217-224, which documents `--enable-gelu-singlegelu` as example. Correct.
-
-- **Cross-links** `[compile-and-epcontext.md]`, `[primitives-and-pipeline.md]`, `[../commands/analyze.md]`, `[../commands/optimize.md]` (lines 31-34): All files exist.
-
-## Verified correct (anchored claims you checked)
-
-- `winml analyze` `--ep` flag exists and takes provider name → `commands/analyze.py` lines 628-639
-- `winml analyze` `--device` flag with CPU/GPU/NPU choices → `commands/analyze.py` lines 641-650
-- `winml analyze` `--information` / `--no-information` flag (default: enabled) → `commands/analyze.py` lines 654-657
-- `winml analyze` `--output` flag for JSON → `commands/analyze.py` line 653
-- `winml analyze` exit codes 0/1/2 → `commands/analyze.py` lines 1212-1213, 1216-1222
-- `winml optimize` `--enable-<name>` / `--disable-<name>` flag pattern → `commands/optimize.py` lines 124-131
-- `winml optimize` `--list-capabilities` flag → `commands/optimize.py` lines 153-158
-- `winml optimize` `--list-rewrites` flag → `commands/optimize.py` lines 160-164
-- `winml optimize` `--config` file flag → `commands/optimize.py` lines 176-180
-- Fusions include GeLU, LayerNorm, MatMul+Add → `optim/pipes/graph.py` lines 242-243
-- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/analyze.md b/docs/superpowers/2026-05-27-doc-issues/analyze.md
deleted file mode 100644
index ef5d9cf2e..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/analyze.md
+++ /dev/null
@@ -1,44 +0,0 @@
-# Issues: docs/commands/analyze.md
-
-Source verified against: `src/winml/modelkit/commands/analyze.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **`--device` default is documented as `NPU`** (doc line 21: "Default: `NPU`") but source line 644 sets `default="auto"` with `show_default=True`. Running `winml analyze --model model.onnx` will use `device="auto"` (infer from local availability), not NPU. A user relying on the doc to know their model will be analyzed against NPU by default will be wrong.
-
-- **`--ep` default is documented as "none — all supported EPs are analyzed"** (doc line 20) but source line 633 sets `default="auto"`. The "auto" mode (source lines 759–768) infers from local availability, not "all supported EPs". Running with no `--ep` is not the same as `--ep all`. The doc's description of the default behavior is wrong.
-
-- **`--run-unknown-op` default is documented as "enabled"** (doc line 26: "flag / enabled") but source line 668 has `default=False`. The pitfall at doc line 84 even says "Disable when the local machine lacks the required libraries" — implying it is on by default — which is incorrect. The correct default is disabled; users must pass `--run-unknown-op` to enable it.
-
-- **`--optim-config` flag is missing from the flag table.** Source lines 677–681 define `@click.option("--optim-config", type=click.Path(path_type=Path), default=None, help="Save auto-discovered optimization config to JSON file")`. This is a functional flag for saving optimization settings and is not documented at all.
-
-- **`--model` has no short form `-m` in the analyze command.** The doc flag table shows no short for `--model` (doc line 19 has empty Short column), which is correct — `model_path_option` in `cli.py` line 68 uses `"--model", "-m"`. Wait — actually it does have `-m`. Let me clarify: the doc table (line 19) shows `| \`--model\` | | \`PATH\` |` with an *empty* Short column, meaning the doc claims there is no short `-m` form. But `model_path_option` (cli.py line 68) uses `click.option("--model", "-m", ...)`, so `-m` is valid. This is a documentation error — users will not know they can use `-m model.onnx`.
-
-- **`--verbose` / `-v` and `--quiet` / `-q` flags are absent from the flag table.** Source uses `@cli_utils.verbosity_options` (line 651) which adds `--verbose / -v` (count) and `--quiet / -q` (flag) — see `cli.py` lines 181–209. Neither appears in the doc.
-
-- **`--config` / `-c` (build config) flag is absent from the flag table.** Source uses `@cli_utils.build_config_option` (line 652) which adds `-c/--config` accepting a `WinMLBuildConfig` JSON file — see `cli.py` lines 212–222. The doc does not mention this.
-
-## Important (misleading or stale)
-
-- **`--ep` choice type** — doc says it accepts full names and short aliases. Source line 634 uses `type=click.Choice([*ALL_EP_NAMES, "all", "auto"], case_sensitive=False)`. The "auto" and "all" values are valid choices but are not mentioned in the doc. The doc's description "When omitted, all supported EPs are analyzed" is wrong (see Critical above); the actual valid special values are "all" and "auto".
-
-- **`--device` choice type** — source line 644 uses `type=click.Choice([*SUPPORTED_DEVICES, "all", "auto"], case_sensitive=False)`. The "all" and "auto" values are not mentioned in the doc.
-
-- **Example "Analyze against all supported EPs"** (doc line 37) runs `winml analyze --model microsoft/resnet-50.onnx` with no `--ep`. Given the actual default is `auto` (not all), the example's described output showing both QNN and OpenVINO may or may not match what runs on a given machine.
-
-## Minor (polish)
-
-- The "Common pitfalls" section says "Omitting `--ep` analyzes every EP" (line 82) — this repeats the incorrect claim from the default description.
-- Exit code documentation (codes 0, 1, 2) matches source lines 1212–1214 and is correct.
-
-## Verified correct (key claims checked)
-
-- `--model` exists (via `model_path_option`) and is required → `cli.py` line 57, `analyze.py` line 627.
-- `--information/--no-information` flag exists with `default=True` → source lines 654–658.
-- `--htp-metadata` flag exists with `type=click.Path(exists=True)`, default `None` → source lines 659–664.
-- `--run-unknown-op/--no-run-unknown-op` flag exists → source lines 665–669.
-- `--save-node` flag exists as `multiple=True, type=Choice(["partial", "unsupported"])` → source lines 670–676.
-- `--output / -o` flag exists → via `cli_utils.output_option`, `cli.py` line 98.
-- Static analysis via `ONNXStaticAnalyzer` → source line 819.
-- Exit codes 0/1/2 → source lines 1212–1218.
-- VitisAI special-cases `--run-unknown-op` to always False → source lines 537–542.
diff --git a/docs/superpowers/2026-05-27-doc-issues/bert-config-build.md b/docs/superpowers/2026-05-27-doc-issues/bert-config-build.md
deleted file mode 100644
index 4d696beef..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/bert-config-build.md
+++ /dev/null
@@ -1,96 +0,0 @@
-# Issues: docs/samples/bert-config-build.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- **Final artifact name is wrong.** Step 2 output block says:
-  `Final artifact: bert_out/bert-base-uncased_ctx.onnx`
-  The actual build pipeline in `commands/build.py` (line 714) always writes the
-  final output as `model.onnx` inside the output directory:
-  `final_path = resolved_dir / _name("model.onnx")`
-  For a non-cached build the artifact is `bert_out/model.onnx`, not
-  `bert_out/bert-base-uncased_ctx.onnx`. The `_name()` helper only prepends a
-  cache key when `--use-cache` is active; with `-o bert_out/` it stays `model.onnx`.
-
-- **Step 3 perf command references the wrong artifact.**
-  `winml perf -m bert_out/bert-base-uncased_ctx.onnx` will fail because the file
-  does not exist (see above). Should be `winml perf -m bert_out/model.onnx`.
-
-## Important
-
-- **`build` command flag: doc uses `-o bert_out/` but the flag is `--output-dir`.**
-  In `commands/build.py` line 250-252 the short alias `-o` maps to `--output-dir`.
-  The `-o` short form is defined, so the command works — but the doc never
-  mentions `--output-dir` anywhere (the "Customizing the config" section also
-  uses `-o`), leaving readers who try `--help` unable to find it easily.
-  The step 2 command itself is syntactically valid; this is a doc clarity issue.
-
-- **JSON excerpt uses `"optim"` key.** `config/build.py` line 17 in the config
-  hierarchy comment shows `optim: WinMLOptimizationConfig`. The serialised key
-  from `WinMLBuildConfig.to_dict()` must be verified. Check that `optim` (not
-  `optimize` or `optimization`) is the actual JSON key. Based on the config
-  hierarchy definition in `config/build.py` the field is named `optim`, which
-  aligns with the doc. Verified plausible, but should be confirmed by reading the
-  `to_dict()` / `from_dict()` implementation in `config/build.py`.
-
-- **JSON excerpt `"optim"` section fields: `gelu_fusion`, `layer_norm_fusion`,
-  `matmul_add_fusion`.** These field names must match `WinMLOptimizationConfig`.
-  The optimize command uses a capability registry; the field names in the
-  serialised JSON depend on how `WinMLOptimizationConfig.to_dict()` names them.
-  The doc claims them without source verification — they may differ from the
-  actual serialised keys.
-
-- **JSON excerpt `"compile"` section.** The doc shows:
-  ```json
-  "compile": {
-    "execution_provider": "qnn",
-    "enable_ep_context": true,
-    "compiler": "ort"
-  }
-  ```
-  These map to `WinMLCompileConfig.to_dict()` in `compiler/configs.py` lines 232-247.
-  `execution_provider`, `enable_ep_context`, and `compiler` are all present in
-  `to_dict()`. Verified correct for those three keys.
-
-- **Note mentions `--max-optim-iterations` flag.** In `commands/build.py` line
-  307 the flag is `--max-optim-iterations` (not `--max-optimize-iterations`).
-  The doc spells it `--max-optim-iterations`, which matches. Verified correct.
-
-- **`--no-quant` and `--no-compile` flags on `winml build`.** Both exist in
-  `build.py` (`--no-quant` line 272, `--no-compile/--compile` line 277). Verified.
-
-- **`winml config --precision fp16`.** `config.py` has `-p`/`--precision` with
-  `type=str` accepting `fp16`. Verified valid.
-
-- **`bert-base-uncased` model ID.** The canonical HF ID is
-  `google-bert/bert-base-uncased`; `bert-base-uncased` is a redirect that still
-  works. The doc uses the short alias consistently. Acceptable but not canonical.
-
-## Minor
-
-- **Step 1: `winml config -m bert-base-uncased -t text-classification -o bert_config.json`.**
-  The `-t` flag on `config` is for `--task`. Verified in `config.py` line 78-79.
-  Valid.
-
-- **Note: `quant.weight_type` and `quant.activation_type` editing instructions.**
-  The doc suggests setting these to `"int8"` or `"uint16"`. Valid options per
-  `quantize.py` line 71: `type=click.Choice(["uint8", "int8", "uint16", "int16"])`.
-  Verified correct.
-
-## Verified correct
-
-- `winml config -m bert-base-uncased -t text-classification -o bert_config.json`
-  — all flags valid.
-- `winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/`
-  (`-o` short form) — command structure valid (see Critical note on artifact name).
-- `winml build ... --no-quant` — flag verified in `build.py`.
-- Top-level JSON keys `loader`, `export`, `optim`, `quant`, `compile` — match
-  `WinMLBuildConfig` field names.
-- `quant.mode`, `quant.weight_type`, `quant.activation_type`, `quant.samples`,
-  `quant.calibration_method`, `quant.task`, `quant.model_name` — all present as
-  fields on `WinMLQuantizationConfig` (verified in `quantize.py` config usage).
-- No `wmk` or `ModelKit` strings in user-facing prose.
-- Cross-links to `convnext-primitives.md`, `../concepts/config-and-build.md`,
-  `../commands/config.md`, `../commands/build.md`, `../commands/perf.md` are
-  consistent with repo structure.
diff --git a/docs/superpowers/2026-05-27-doc-issues/build.md b/docs/superpowers/2026-05-27-doc-issues/build.md
deleted file mode 100644
index 157a69542..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/build.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# Issues: docs/commands/build.md
-
-Source verified against: `src/winml/modelkit/commands/build.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **`--random-init` flag does not exist.** The flag table lists `--random-init` as "Skip weight download; build with random weights". A full search of `build.py` finds no `--random-init` or `random_init` option definition. The behavior (random-weight build) is supported by omitting `-m` (see `build.py:247`: "Omit for random-weight build"), but there is no `--random-init` flag. Users who pass `--random-init` will get "No such option".
-- **`--config` / `-c` listed as *(required)* but source marks it `required=False`.** `build.py:237` sets `required=False` with `default=None`. When `-c` is omitted, config is auto-generated from `-m`. The doc makes it sound mandatory.
-
-## Important (misleading or stale)
-
-- **`--qnn-sdk-root` should not appear in this page.** The flag does not exist in `build.py` (confirmed: zero hits for `qnn_sdk_root` or `qnn-sdk-root` in the option definitions). It is a `winml compile` flag only. Its appearance in the flag table is a copy-paste error.
-- **`--no-compile` is documented as a simple flag but source defines a `--no-compile/--compile` toggle pair.** `build.py:275-282` shows `--no-compile/--compile` as a boolean toggle with `default=None`. The doc only shows `--no-compile`, omitting `--compile` (which forces compilation on when the config has a compile section). The `--compile` positive form is useful and undocumented.
-- **Flag table omits `--trust-remote-code`.** `build.py:312-314` defines this via `cli_utils.trust_remote_code_option(...)`. Users building custom architecture models (e.g., Mu2) need it.
-- **`--max-optim-iterations` table shows default `3` but source default is `None`.** `build.py:309` sets `default=None`. The actual default of `3` is enforced inside the pipeline helpers (`build.py:1112, 1234`), not at the CLI layer. If the user does not pass the flag, Click resolves it as `None`, not `3`.
-
-## Minor (polish)
-
-- **Flag table omits `--verbose` / `-v`.** Defined at `build.py:315-320`.
-- **"How it works" says pipeline is "export → optimize → quantize → compile" in the intro, but the synopsis shows the full correct form.** The command map table in overview.md correctly shows "export → optimize → quantize → compile". The build.md intro paragraph at line 44 says only "export → quantize → compile" (missing optimize). Minor omission but inconsistent.
-
-## Verified correct (key claims checked)
-
-- `--config` / `-c` path, optional → `build.py:233-241`
-- `--model` / `-m` string default None → `build.py:242-248`
-- `--output-dir` / `-o` path default None → `build.py:249-256`
-- `--use-cache` flag default false → `build.py:257-262`
-- `--rebuild` flag default false → `build.py:263-268`
-- `--no-quant` flag default false → `build.py:269-274`
-- `--no-optimize` flag default false → `build.py:299-304`
-- `--no-analyze` flag default false → `build.py:293-298`
-- `--ep` defined via `cli_utils.ep_option` → `build.py:283-286`
-- `--device` defined via `cli_utils.device_option` default `auto` → `build.py:287-292`
-- Mutual exclusion: `--output-dir` and `--use-cache` → `build.py:376-379`
-- `--use-cache` not supported in module mode → `build.py:491-495`
-- ONNX input skips export stage → `build.py:691-711` (`_build_onnx_pipeline`)
-- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md b/docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md
deleted file mode 100644
index 9ca0efc38..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/compile-and-epcontext.md
+++ /dev/null
@@ -1,38 +0,0 @@
-# Issues: docs/concepts/compile-and-epcontext.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-
-- **`--no-quant` on `winml compile`** (line 29): The doc says "`winml compile` also accepts `--no-quant` to skip the quantization pass for already-quantized (QDQ) models." There is no `--no-quant` flag on `winml compile`. The `commands/compile.py` file was fully read and contains no `--no-quant` option. This is a flag that exists on `winml build`, not `winml compile`. A user passing `--no-quant` to `winml compile` will get `Error: No such option: --no-quant`.
-
-## Important (misleading or stale claim)
-
-- **`--ep qnn` and `--ep vitisai` described as "QNN-family EPs"** (line 11): The doc lumps these together as both producing "EP context blobs". Source shows `WinMLCompileConfig.for_provider()` treats them distinctly — `vitisai` uses `VitisAIExecutionProvider` and `qnn` uses `QNNExecutionProvider` (`commands/compile.py` lines 214-221, `compiler/configs.py` lines 209-221). Both do produce EPContext, but the doc's grouping as interchangeable is a simplification that may mislead users.
-
-- **External EPContext described as "default"** (lines 17-21): Doc says "By default the blob is written as a sidecar `.bin` file alongside the `.onnx`." Source confirms `embed_context: bool = False` as default in `EPConfig` (`compiler/configs.py` line 46), so external is indeed the default. Correct.
-
-- **`--embed` flag** (line 17): Doc says "Passing `--embed` instead inlines the blob". Source confirms `--embed` is a flag on `winml compile` (`commands/compile.py` lines 96-99), which sets `embed_context=True`. Correct.
-
-- **`--compiler qairt` and `--qnn-sdk-root`** (line 13): Doc says "select `--compiler qairt` and point `--qnn-sdk-root`". Source confirms both flags on `winml compile` (`commands/compile.py` lines 83-93). Correct.
-
-- **`--no-validate` flag** (line 34): The actual flag on `winml compile` is `--validate/--no-validate` (source: `commands/compile.py` lines 72-74). The doc says "The `--no-validate` flag skips that pass." This is accurate — `--no-validate` is the negative form of the `--validate/--no-validate` pair.
-
-## Minor (style, polish, low-impact)
-
-- **Validation described as "default: enabled"** (line 33): Confirmed — `WinMLCompileConfig.validate: bool = True` (`compiler/configs.py` line 86) and `--validate/--no-validate` defaults to `True` (`commands/compile.py` line 74). Correct.
-
-- **Cross-links** `[eps-and-devices.md]`, `[analyze-and-optimize.md]`, `[../commands/compile.md]`, `[../commands/build.md]` (lines 39-43): All target files exist.
-
-## Verified correct (anchored claims you checked)
-
-- `winml compile` `--ep` flag exists → `commands/compile.py` lines 66-69
-- `winml compile` `--device` flag with auto/npu/gpu/cpu choices → `commands/compile.py` lines 58-65
-- `winml compile` `--compiler` flag with choices `["ort", "qairt"]` → `commands/compile.py` lines 83-87
-- `winml compile` `--qnn-sdk-root` flag exists → `commands/compile.py` lines 88-93
-- `winml compile` `--embed` flag exists → `commands/compile.py` lines 96-99
-- `winml compile` `--validate/--no-validate` flag exists, default enabled → `commands/compile.py` lines 72-74
-- `EPConfig.embed_context` defaults to `False` (external sidecar) → `compiler/configs.py` line 46
-- `EPConfig.enable_ep_context` defaults to `True` → `compiler/configs.py` line 45
-- Compiler backend `ort` is the default → `commands/compile.py` line 87 (`default="ort"`)
-- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/compile.md b/docs/superpowers/2026-05-27-doc-issues/compile.md
deleted file mode 100644
index 418997126..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/compile.md
+++ /dev/null
@@ -1,31 +0,0 @@
-# Issues: docs/commands/compile.md
-
-Source verified against: `src/winml/modelkit/commands/compile.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **`--device` default listed as `npu` but source default is `auto`.** The flag table and "Common pitfalls" both claim "default is `npu`" and "`--device` default is `npu`, not `auto`". Source `compile.py:59-65` defines `default="auto"`. Users relying on the doc who expect NPU targeting without passing `--device` will instead get auto-detection. This is a direct behavioral contradiction.
-
-## Important (misleading or stale)
-
-- **`--no-quant` flag does not exist in compile.py.** The flag table shows `--no-quant` with description "Flag retained for compatibility; quantization is no longer performed during compile." A search of `compile.py` finds zero occurrences of `no-quant`, `no_quant`, or `--no-quant`. The flag is documented but not defined; any user who passes it will get a "No such option" error.
-- **`--validate` / `--no-validate` is a toggle pair, not a simple `--no-validate` flag.** Source `compile.py:72-74` defines `--validate/--no-validate` as a boolean toggle with `default=True`. The table shows only `--no-validate` as an independent flag; this is accurate in effect but hides the positive form `--validate` and implies a different UI contract.
-- **`--output` (file path) is not documented in the flag table.** Source `compile.py:51` registers `cli_utils.output_option(...)`, which adds `--output` / `-o`. The table jumps straight to `--output-dir`. Users cannot discover `-o` for writing to a specific file path.
-
-## Minor (polish)
-
-- **Flag table omits `--verbose` / `-v`.** Defined at `compile.py:76-81`.
-- **"Common pitfalls" says `--no-quant` is a no-op** — this is correct in spirit (quantization is not done at compile time), but the flag does not exist, so the pitfall note is misleading. Replace with a note that the flag was removed and users should not pass it.
-
-## Verified correct (key claims checked)
-
-- `--model` / `-m` optional (required unless `--list`) → `compile.py:44-50`
-- `--output-dir` path default None → `compile.py:53-57`
-- `--device` choice `auto|npu|gpu|cpu` → `compile.py:59-65`
-- `--ep` choice of provider names → `compile.py:66-69` via `cli_utils.ep_option`
-- `--compiler` choice `ort|qairt` default `ort` → `compile.py:82-87`
-- `--qnn-sdk-root` path default None → `compile.py:88-93`
-- `--embed` flag default false → `compile.py:94-99`
-- `--list` flag default false → `compile.py:100-106`
-- `--compiler qairt` requires `--qnn-sdk-root` → `compile.py:206-208` (passes to `ep_config.qnn_sdk_root`; failure occurs in compiler layer)
-- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/config-and-build.md b/docs/superpowers/2026-05-27-doc-issues/config-and-build.md
deleted file mode 100644
index 4db7ccc18..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/config-and-build.md
+++ /dev/null
@@ -1,57 +0,0 @@
-# Issues: docs/concepts/config-and-build.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-
-- **JSON example `compile` section uses wrong field names** (lines 85-90): The doc shows:
-  ```json
-  "compile": {
-    "ep_config": {
-      "provider": "qnn",
-      "enable_ep_context": true
-    }
-  }
-  ```
-  However, `WinMLCompileConfig.to_dict()` does NOT nest under `ep_config`; it serializes flat with keys `execution_provider`, `provider_options`, `enable_ep_context`, `embed_context`, `compiler`, `qnn_sdk_root`, `device` (source: `src/winml/modelkit/compiler/configs.py` lines 230-245). `WinMLCompileConfig.from_dict()` reads `data.get("execution_provider", "qnn")` (line 253), not `ep_config.provider`. A user who copy-pastes this JSON and passes it to `winml build` will get a config with `provider="qnn"` default (silently ignored nested key), making compilation silent failure or wrong EP.
-
-- **JSON example `optim` section uses non-canonical field names** (lines 75-80): The doc shows:
-  ```json
-  "optim": {
-    "gelu_fusion": false,
-    "layer_norm_fusion": false,
-    "matmul_add_fusion": false
-  }
-  ```
-  `WinMLOptimizationConfig` is a `dict` subclass that accepts arbitrary kwargs (source: `src/winml/modelkit/optim/config.py` lines 13-31). The field names `gelu_fusion`, `layer_norm_fusion`, `matmul_add_fusion` correspond to capability python_names, which exist in the optimizer (source: `src/winml/modelkit/optim/pipes/graph.py` lines 242-243). These are valid keys but there are no hard-coded defaults for them — the generated JSON would only include keys that were explicitly set. A freshly generated config from `winml config` would likely have `{}` for `optim` unless capabilities are explicitly configured. The presence of all-`false` values is misleading; a real generated config would omit them.
-
-## Important (misleading or stale claim)
-
-- **`WinMLBuildConfig` described as having five nested sub-configs** (lines 48-56, table): The doc lists `loader`, `export`, `optim`, `quant`, `compile`. The actual dataclass also has `eval: WinMLEvaluationConfig | None` and `auto: bool` (source: `src/winml/modelkit/config/build.py` lines 132-138). The table is incomplete; `eval` section is a valid config key that affects `winml eval` behavior when running from a build config.
-
-- **`winml config` `--no-compile` default behavior** (line 33): Doc says "sets the `compile` section to `null`". In the CLI, `--no-compile` is the default (`default=True` for `no_compile`, source: `commands/config.py` lines 162-165), meaning compilation is always excluded unless `--compile` is passed. The doc does not mention that compile is off by default from `winml config`.
-
-- **`WinMLBuildConfig` defined in `src/winml/modelkit/config/build.py`** (line 47): Correct file path. However the description says "one per pipeline stage" — there are actually 6 stages with the `eval` field, not 5 as stated.
-
-## Minor (style, polish, low-impact)
-
-- **`--output-dir` and `--use-cache` enforcement** (line 111): Doc says "enforced at runtime, not parse time". This is accurate — source `commands/build.py` line 377 shows a `click.UsageError` raised in the command body.
-
-- **Cross-links** `[../commands/config.md]` and `[../commands/build.md]` (lines 161-162): Both files exist in `docs/commands/`.
-
-- **Cross-link** `[primitives-and-pipeline.md]` (line 158): File exists in `docs/concepts/`.
-
-## Verified correct (anchored claims you checked)
-
-- `winml config -m microsoft/resnet-50 -o resnet50.json` syntax is valid → `commands/config.py` lines 66-73 (`-m`/`--model`, `-o`/`--output`)
-- `--task` flag exists on `winml config` → `commands/config.py` lines 77-80
-- `--no-quant` flag exists on `winml config` → `commands/config.py` lines 155-159
-- `--trust-remote-code` flag exists on `winml config` → `commands/config.py` line 166
-- `-o` omission prints to stdout → `commands/config.py` lines 487-490
-- `winml build -c resnet50.json -m microsoft/resnet-50 --output-dir output/` valid → `commands/build.py` lines 233-256
-- `--use-cache` writes to `~/.cache/winml/` → `commands/build.py` lines 258-262
-- `--no-quant`, `--no-compile`, `--no-optimize` CLI overrides exist on `winml build` → `commands/build.py` lines 273, 275-282, 300-304
-- `WinMLBuildConfig.from_dict()` reads `loader`, `export`, `optim`, `quant`, `compile`, `eval` sections → `config/build.py` lines 152-172
-- `WinMLLoaderConfig`, `WinMLExportConfig`, `WinMLOptimizationConfig`, `WinMLQuantizationConfig`, `WinMLCompileConfig` all exist → `config/build.py` lines 54-64
-- JSON `quant` section fields `weight_type`, `activation_type`, `samples` exist → `quant/config.py` lines 55, 65-66
-- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/config.md b/docs/superpowers/2026-05-27-doc-issues/config.md
deleted file mode 100644
index 7112e9396..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/config.md
+++ /dev/null
@@ -1,43 +0,0 @@
-# Issues: docs/commands/config.md
-
-Source verified against: `src/winml/modelkit/commands/config.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **`--no-compile` default is wrong.** The doc (line 32) states default is `off` (meaning compile *is* included by default). Source line 163 defines `--no-compile/--compile` with `"no_compile"` and `default=True`. The default is `no_compile=True`, meaning compilation is *excluded* from the generated config by default. A user reading the doc will expect compilation to be in the config and be surprised to find `"compile": null` in the output.
-
-- **`--verbose` flag is missing from the flag table.** Source lines 147–152 define `@click.option("-v", "--verbose", is_flag=True, default=False, ...)`. This is a real flag that enables `logging.DEBUG` (line 226) and is not documented in the flag table.
-
-- **`--ep` short form** — the doc flag table (line 27) shows no short form for `--ep`. The source uses `@cli_utils.ep_option(required=False, ...)` (line 126), and `ep_option` in `cli.py` line 140 registers `"--ep", "--execution-provider"` with no `-e` short. The doc correctly shows no short form, but it lists the full name without mentioning `--execution-provider` as an alias. This is a minor completeness issue but not an error.
-
-## Important (misleading or stale)
-
-- **`--no-compile` documentation**: The doc entry says default is `off` and the description reads "Omit compilation from the generated config (sets `compile` to `null`). Use this when you want to inspect the optimized ONNX before EP-specific compilation." Since `no_compile` defaults to `True`, compilation is omitted *by default* — the entire framing of `--no-compile` as an opt-in is backwards. Users do not need to pass `--no-compile` to skip compilation; they need `--compile` to include it.
-
-- **`--device` Choice values** — the doc says type is `auto|npu|gpu|cpu` (line 28). Source line 121 confirms `type=click.Choice(["auto", "npu", "gpu", "cpu"], case_sensitive=False)`. This is accurate.
-
-- **`--config / -c` help text says "JSON override file in `WinMLBuildConfig` format"** (doc line 24). Source line 103 uses `type=click.Path(exists=True)` and the flag is called `config_file`. The doc correctly describes behavior.
-
-- **`--ep` accepts aliases** — doc says values include `qnn`, `dml`, `migraphx`, `tensorrt`, `vitisai`, `openvino`, `cpu`. The actual choices come from `ALL_EP_NAMES` via `ep_option` (cli.py line 138). The list of aliases in the doc should be verified against `SUPPORTED_EPS` / `ALL_EP_NAMES` constants. The doc lists `dml` and `migraphx` which may or may not be in `ALL_EP_NAMES` — this should be confirmed.
-
-## Minor (polish)
-
-- The doc example `winml config -m facebook/convnext-tiny-224.onnx --no-quant --no-compile` (line 80) uses `--no-compile` as if it toggles something off, but since `no_compile=True` by default, `--no-compile` is a no-op here. The example is not wrong (it still works) but implies `--no-compile` is doing work when it is already the default.
-- `--trust-remote-code` is correctly listed in the flag table and matches source (via `@cli_utils.trust_remote_code_option()` at line 166).
-
-## Verified correct (key claims checked)
-
-- `-m / --model` exists with short `-m`, optional (not required), default `None` → source lines 67–74.
-- `-t / --task` exists with short `-t`, default `None` → source lines 75–79.
-- `--model-class` exists, no short form, default `None` → source lines 80–85.
-- `--model-type` exists, no short form, default `None` → source lines 86–94.
-- `--module` exists, no short form, default `None` → source lines 95–99.
-- `-c / --config` exists, type `Path(exists=True)`, default `None` → source lines 100–107.
-- `--shape-config` exists, type `Path(exists=True)`, default `None` → source lines 108–117.
-- `-d / --device` exists with Choice `["auto","npu","gpu","cpu"]`, default `"auto"` → source lines 118–125.
-- `-p / --precision` exists, type `str`, default `"auto"` → source lines 131–138.
-- `-o / --output` exists → via `cli_utils.output_option`, source line 140.
-- `--library` exists, default `"transformers"` → source lines 141–145.
-- `--no-quant` exists as `is_flag=True, default=False` → source lines 153–158.
-- At least one of `-m`, `--model-type`, `--model-class` required → source lines 229–241.
-- ONNX file input path sets `export=None` → source lines 297–311.
diff --git a/docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md b/docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md
deleted file mode 100644
index 87fa70150..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/convnext-primitives.md
+++ /dev/null
@@ -1,82 +0,0 @@
-# Issues: docs/samples/convnext-primitives.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- **Compiled artifact filenames are wrong for CPU and GPU (Step 5 + Step 6).**
-  The doc claims `winml compile --device cpu` writes `convnext_int8_cpu_ctx.onnx`
-  and `--device gpu` writes `convnext_int8_dml_ctx.onnx`. Both claims are false.
-  - `WinMLCompileConfig.for_cpu()` sets `enable_ep_context=False`
-    (`compiler/configs.py` line 165). CPUExecutionProvider does not generate an
-    EPContext file, so no `_cpu_ctx.onnx` is written at all.
-  - `WinMLCompileConfig.for_dml()` also sets `enable_ep_context=False`
-    (`compiler/configs.py` line 175). DML does not produce an EPContext either.
-  - Additionally, the session filename convention uses the resolved device string,
-    so if an EPContext were produced it would be `convnext_int8_gpu_ctx.onnx`
-    (device="gpu"), not `convnext_int8_dml_ctx.onnx`.
-  - The paragraph at the end of Step 5 restates the incorrect filenames and must
-    be corrected alongside the tab blocks.
-
-- **`winml perf --device gpu` line uses the non-existent artifact
-  `convnext_int8_dml_ctx.onnx`.** Because DML compile does not produce a ctx file
-  (see above), the benchmark command as written will fail with a file-not-found
-  error. The entire GPU tab in Step 6 is based on a false premise.
-
-## Important
-
-- **`--output` flag on `winml perf` is described as writing a JSON file.**
-  The doc says "Use the JSON output written by `--output`". The actual flag name
-  in `perf.py` is `-o` / `--output`, output defaults to a timestamped path under
-  `~/.cache/winml/perf/`. This description is essentially correct, but the page
-  never shows what the flag looks like in a command, which may confuse readers.
-  Minor wording issue only.
-
-- **Step 7 `winml eval` uses `--dataset imagenet-1k`.** HuggingFace's canonical
-  dataset ID for ImageNet-1k gated access is `imagenet-1k`, which matches. This
-  cannot be independently verified without HF credentials, but the ID is standard
-  and consistent with other pages.
-
-- **Note claims `--device auto` is not valid on `winml eval`.**
-  `eval.py` line 69: `type=click.Choice(["auto", "cpu", "gpu", "npu"])` — `auto`
-  IS listed as a valid choice. The doc's note "Note that `--device` accepts only
-  `cpu`, `gpu`, or `npu` — it does not accept `auto`" is incorrect.
-
-## Minor
-
-- **Cross-link to `../getting-started/end-to-end.md` in the admonition.**
-  Not verifiable without checking that file, but the link pattern is consistent
-  with other pages.
-
-- **Step 2: `winml config -m ... -o convnext_config.json`** — the `-o` flag is
-  correct for `config.py` (`cli_utils.output_option`). Verified correct.
-
-- **Step 3 export output text shows `Starting HTP export...` and
-  `Success! Model exported to: convnext.onnx`** — matches actual console output
-  strings in `export.py` lines 388 and 417. Verified correct.
-
-- **`--method entropy` mentioned in Step 4 note.** `quantize.py` line 65:
-  `type=click.Choice(["minmax", "entropy", "percentile"])`. `entropy` is valid.
-
-## Verified correct
-
-- `winml inspect -m facebook/convnext-tiny-224` — `-m` flag exists, model ID is
-  a real HF repo.
-- `winml config -m facebook/convnext-tiny-224 -o convnext_config.json` — flags
-  all exist in `config.py`.
-- `winml export -m facebook/convnext-tiny-224 -o convnext.onnx` — `-m` and `-o`
-  exist in `export.py`, `-o` is required for export.
-- `winml quantize -m convnext.onnx -o convnext_int8.onnx --precision int8 --samples 32`
-  — all flags verified in `quantize.py`.
-- `winml compile -m convnext_int8.onnx --output-dir . --device npu --qnn-sdk-root`
-  — `--output-dir`, `--device`, `--qnn-sdk-root` all exist in `compile.py`.
-- `winml compile --device npu` requiring `--qnn-sdk-root` or `QNN_SDK_ROOT` —
-  consistent with `compile.py` and source notes.
-- `winml perf` flags `--device`, `--iterations` — verified in `perf.py`.
-- `winml eval` flags `-m`, `--model-id`, `--dataset`, `--split`, `--samples`,
-  `--device` — verified in `eval.py`.
-- NPU artifact `convnext_int8_qnn_ctx.onnx` — consistent with session.py naming
-  (`{stem}_{device}_ctx.onnx` with device="npu"). Verified plausible.
-- "Pick the right ConvNeXt page" admonition links to `../tutorials/npu-convnext.md`
-  — resolves correctly; counterpart admonition in npu-convnext.md links back here.
-- No `wmk` or `ModelKit` strings found in user-facing prose.
diff --git a/docs/superpowers/2026-05-27-doc-issues/end-to-end.md b/docs/superpowers/2026-05-27-doc-issues/end-to-end.md
deleted file mode 100644
index 8a1e8c4af..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/end-to-end.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# Issues: docs/getting-started/end-to-end.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-- Artifact filename pattern is wrong for DML and CPU (end-to-end.md:123–125). The doc claims the GPU artifact is named `convnext_tiny_dml_ctx.onnx` and the CPU artifact is `convnext_tiny.onnx`. Source: `compiler/configs.py` `for_dml()` sets `enable_ep_context=False` and `for_cpu()` also sets `enable_ep_context=False`. When `enable_ep_context=False`, `compile.py` `_finalize_output` is never called (the `if ep_config.enable_ep_context:` guard in `CompileStage.process`), meaning no `_ctx.onnx` is produced and `winml build --no-compile` leaves only the quantized ONNX. Neither `convnext_tiny_dml_ctx.onnx` nor a special CPU variant filename is produced; the DML and CPU "compile" steps are no-ops that return `None` from `for_provider`. The correct behavior is that only QNN (and OpenVINO, VitisAI, NvTensorRTRTX) produce `_ctx.onnx` artifacts; DML/CPU compile is skipped entirely.
-- `winml build` `--no-quant` / `--no-compile` flags exist in source (build.py:270, 276), but the doc also mentions `--no-optimize` (end-to-end.md:106) — this flag exists (`build.py:300`), so that claim is correct. However, the doc omits any mention that `--no-compile/--compile` is actually a toggle pair and `--compile` can be used to force enable compilation (build.py:277–280). Minor gap but not a factual error.
-
-## Important
-- `winml build` warning box (end-to-end.md:111–113): states the build reads `QNN_SDK_ROOT` from the environment. This is correct for the `winml build` wrapper, which does NOT expose `--qnn-sdk-root` (build.py has no such option). The doc is consistent with the source. No error here.
-- `--device auto` priority order claimed as "NPU first, then GPU, then CPU" (end-to-end.md:7–8): confirmed correct by `sysinfo/device.py` `_DEVICE_PRIORITY: tuple[str, ...] = ("npu", "gpu", "cpu")`.
-- Tabbed `sys` output EP names (end-to-end.md:54–57): `QNNExecutionProvider -> NPU`, `DmlExecutionProvider -> GPU`, `CPUExecutionProvider -> CPU`. Cross-referencing `EP_SUPPORTED_DEVICES` in `constants.py`: `QNNExecutionProvider` maps to `("npu", "gpu")` not just `"npu"`. The display in `_output_ep_text` shows the first device from `get_ep_device_map()` which joins with `/`, so it would render `QNNExecutionProvider -> NPU/GPU`, not just `NPU`. The sample output in the doc shows only `-> NPU`, which is inaccurate.
-
-## Minor
-- Step 3 perf command uses placeholder `<artifact>.onnx` (end-to-end.md:119). Given the critical artifact naming issue above, the example filenames shown in the tabbed blocks (`convnext_tiny_qnn_ctx.onnx`, `convnext_tiny_dml_ctx.onnx`, `convnext_tiny.onnx`) are not the actual file stems that `winml build` produces for a model named `convnext-tiny-224`. The actual stem would depend on the slug generated from the model ID (not verified here), but the `_dml_ctx` and plain `.onnx` names are definitely wrong per the critical issue above.
diff --git a/docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md b/docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md
deleted file mode 100644
index 495c81cdb..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/eps-and-devices.md
+++ /dev/null
@@ -1,26 +0,0 @@
-# Issues: docs/concepts/eps-and-devices.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-- (none)
-
-## Important (misleading or stale claim)
-- Line 13 (table row for `QNNExecutionProvider`): The table lists QNN's device as `npu` only. However, `src/winml/modelkit/utils/constants.py:184` declares `"QNNExecutionProvider": ("npu", "gpu")` — QNN also supports `gpu` as a secondary device. The table is therefore incomplete and will mislead users who want to run QNN on a GPU target.
-
-- Lines 35-38: The `--device` description says the default is `auto` and it picks "NPU > GPU > CPU". The source at `src/winml/modelkit/commands/build.py:289-290` sets `default="auto"` for `--device` in the build command, and `src/winml/modelkit/commands/analyze.py:645` also defaults to `"auto"`. Priority logic `NPU > GPU > CPU` is consistent with `EP_SUPPORTED_DEVICES` key order in `src/winml/modelkit/utils/constants.py:178-187`. So far accurate. However, `--device` on `winml analyze` accepts `CPU/GPU/NPU/all/auto` (uppercase; `src/winml/modelkit/commands/analyze.py:644-648`), not lowercase as shown in the doc examples on lines 37-40. The CLI itself normalizes case, so commands work, but showing `--device npu` (lowercase) in examples while the `type=click.Choice([*SUPPORTED_DEVICES, ...])` enumerates uppercase `"CPU"`, `"GPU"`, `"NPU"` (`src/winml/modelkit/utils/constants.py:163-167`) could be confusing. Since Click's `case_sensitive=False` is set on the analyze command, the examples aren't wrong, but readers inspecting help output will see uppercase choices.
-
-- Lines 48-53: Example shows `winml analyze --model model.onnx --ep QNNExecutionProvider --device npu`. The `analyze` command uses `--model` (confirmed at `src/winml/modelkit/utils/cli.py:69`), not `--model-path` or another variant. The example is correct in flag name.
-
-## Minor (style, polish, low-impact)
-- Lines 57-63: All cross-links (`graphs-and-ir.md`, `weight-and-activation.md`, `../commands/sys.md`, `../commands/analyze.md`) resolve to files on disk.
-- Line 22: `winml sys --list-ep` — flag `--list-ep` confirmed at `src/winml/modelkit/commands/sys.py:668-671`.
-
-## Verified correct (anchored claims you checked)
-- Lines 11-19 (EP table): `CPUExecutionProvider`, `DmlExecutionProvider`, `MIGraphXExecutionProvider`, `NvTensorRTRTXExecutionProvider`, `OpenVINOExecutionProvider`, `QNNExecutionProvider`, `VitisAIExecutionProvider` — all seven are in `EPName` Literal at `src/winml/modelkit/utils/constants.py:24-33`.
-- Table: `OpenVINOExecutionProvider` listed as supporting `npu / gpu / cpu` — confirmed by `"OpenVINOExecutionProvider": ("npu", "gpu", "cpu")` at `src/winml/modelkit/utils/constants.py:185`.
-- Table: `VitisAIExecutionProvider` listed as `npu` only — confirmed by `"VitisAIExecutionProvider": ("npu",)` at `src/winml/modelkit/utils/constants.py:183`.
-- Table: `DmlExecutionProvider` listed as `gpu` only — confirmed by `"DmlExecutionProvider": ("gpu",)` at `src/winml/modelkit/utils/constants.py:186`.
-- Table: `MIGraphXExecutionProvider` listed as `gpu` only — confirmed by `"MIGraphXExecutionProvider": ("gpu",)` at `src/winml/modelkit/utils/constants.py:182`.
-- Table: `NvTensorRTRTXExecutionProvider` listed as `gpu` only — confirmed by `"NvTensorRTRTXExecutionProvider": ("gpu",)` at `src/winml/modelkit/utils/constants.py:179`.
-- Lines 44-45: `--ep` accepts aliases `qnn`, `vitisai`, `dml`, `openvino` — confirmed in `EP_ALIASES` at `src/winml/modelkit/utils/constants.py:59-69`.
diff --git a/docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md b/docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md
deleted file mode 100644
index 84f5031b2..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/eval-and-datasets.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Issues: docs/concepts/eval-and-datasets.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- (none)
-
-## Important
-
-- Lines 1–7: The concept doc lists no `--ep`, `--precision`, `--dataset-script`, or `--trust-remote-code` flags, all of which exist in eval.py (lines with `@cli_utils.ep_option`, `--precision`, `--dataset-script`, `--trust-remote-code`). While a concept page need not enumerate every flag, omitting `--precision` is notable because the page is about post-quantization accuracy checks and `--precision` directly affects which model artifact is built.
-- Line 25 / `--samples` default: The concept doc does not state a default for `--samples`, but line 34 of docs/commands/eval.md lists the default as `100`. Source confirms `default=100` (eval.py). This is consistent, but the concept page example at line 35 uses `--samples 200` without noting the default, which is fine — no defect here on its own.
-
-## Minor
-
-- Line 22: States `--output` "accepts any `.json` path; if omitted, results are printed but not persisted." Source confirms this (no default for `output_path`). Accurate.
-- Line 35: `--streaming` flag description says it "fetches rows on demand instead of materialising the whole dataset locally." Source confirms `is_flag=True, default=False`. Accurate.
-- Line 38: `--column key=value` usage is consistent with source (`multiple=True`, key=value parsing in eval.py). Accurate.
diff --git a/docs/superpowers/2026-05-27-doc-issues/eval.md b/docs/superpowers/2026-05-27-doc-issues/eval.md
deleted file mode 100644
index eff4653ba..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/eval.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# Issues: docs/commands/eval.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- Line 24: `--device` type column shows `cpu|gpu|npu` with default `cpu`. Source defines `type=click.Choice(["auto", "cpu", "gpu", "npu"])` with `default="auto"` (eval.py). The `auto` choice is missing and the default is wrong.
-- Line 25: `-n` is listed as a short alias for `--samples`. Source defines `--samples` with no short flag (eval.py `@click.option("--samples", type=int, default=100, ...)`). The `-n` alias does not exist.
-
-## Important
-
-- Flags table is missing the following options that exist in source (eval.py):
-  - `--ep` — execution provider override (`@cli_utils.ep_option`)
-  - `--precision` — precision mode (`--precision`, default `auto`)
-  - `--dataset-script` — path to a dataset-building script
-  - `--trust-remote-code` — required flag when `--dataset-script` is used
-  - `--verbose` / `-v` — verbose output flag
-- Line 36: "How it works" section says `winml eval` loads the model via `WinMLAutoModel`. Source uses `WinMLEvaluationConfig` and calls `evaluate(cfg)` from the `eval` subpackage (eval.py). The class name `WinMLAutoModel` does not appear in eval.py; the description misrepresents the implementation.
-
-## Minor
-
-- Line 19: `--model` description says "Required (unless `--model-id` is provided directly)." Source actually raises `UsageError` if neither `-m` nor `--model-id` resolves a model, and `--model-id` alone (without `-m`) is accepted only to supply a HuggingFace ID. This nuance is slightly misleading but not a breaking inaccuracy.
-- Line 88: Pitfall note "`--streaming` skips the local cache." Source confirms this behaviour. Accurate.
diff --git a/docs/superpowers/2026-05-27-doc-issues/export.md b/docs/superpowers/2026-05-27-doc-issues/export.md
deleted file mode 100644
index 387b4fe5c..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/export.md
+++ /dev/null
@@ -1,32 +0,0 @@
-# Issues: docs/commands/export.md
-
-Source verified against: `src/winml/modelkit/commands/export.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- (none)
-
-## Important (misleading or stale)
-
-- **`--dynamo` description says "PyTorch 2.9+" but that version string is invented.** The source (`export.py:376-384`) only warns the flag is unsupported; no PyTorch version requirement is stated. Remove the version number claim to avoid confusion.
-- **`--torch-module` description says "Experimental — currently logs a warning"** — this is accurate, but the phrase "currently logs a warning" hides the fact that the flag is **completely ignored** (the option value is never forwarded to `export_onnx()`). Source at `export.py:364-373` explicitly states `TODO: Add torch_module support`. Use "has no effect" rather than "currently logs a warning".
-- **`--dynamo` same problem.** Source `export.py:376-384`: "dynamo=True is not supported by export_onnx(). TODO: Add dynamo support". The flag has zero effect; the table note says only "currently logs a warning".
-
-## Minor (polish)
-
-- **Flag table missing `--verbose` / `-v`.** `export.py:73-78` defines `--verbose / -v` as an explicit option with a `help` string. Every other command page includes `--verbose` in their tables; its absence on this page is inconsistent.
-- **`--clean-onnx` / `--no-hierarchy` are presented as two separate flags in the table but they are one option.** The source defines them as aliases of a single `--clean-onnx / --no-hierarchy` option with `"no_hierarchy"` as the internal parameter name (`export.py:85-92`). The table formatting (`--clean-onnx` / `--no-hierarchy` in one cell) is technically correct but the slash notation could mislead readers into thinking these are independent toggles.
-
-## Verified correct (key claims checked)
-
-- `--model` / `-m` required string → `export.py:65-70`
-- `--output` / `-o` required path → `export.py:71` via `cli_utils.output_option(required=True)`
-- `--with-report` is_flag default false → `export.py:79-84`
-- `--input-specs` path default None → `export.py:107-111`
-- `--task` / `-t` string default None → `export.py:113-118`
-- `--export-config` path default None → `export.py:119-124`
-- `--shape-config` path default None → `export.py:125-130`
-- `--shape-config` silently ignored when `--input-specs` is provided → `export.py:307-331` (input-specs overrides/patches auto-resolved tensors; shape_config is loaded only before auto-resolution, so if both are present the shape_config still applies to auto-resolution and input-specs then overrides it — the doc's "Ignored when `--input-specs` is provided" is a slight overstatement but matches the spirit)
-- Eight-step HTP export description → `export.py:153-161` (docstring)
-- `--dynamo` and `--torch-module` emit warnings and have no effect → `export.py:364-384`
-- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md b/docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md
deleted file mode 100644
index 44fd51061..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/graphs-and-ir.md
+++ /dev/null
@@ -1,21 +0,0 @@
-# Issues: docs/concepts/graphs-and-ir.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-- (none)
-
-## Important (misleading or stale claim)
-- Line 29: Citation "(`src/winml/modelkit/export/config.py`, line 75)". The file exists and `opset_version: int = 17` is indeed at line 75 (`src/winml/modelkit/export/config.py:75`). However, the doc says this value lives in `WinMLExportConfig` — correct — but the enclosing class declaration begins at line 33. The citation is precise enough to be useful but readers should be aware `line 75` is inside a `@dataclass`. No factual error, but the explanation "This is the value of `opset_version: int = 17` in `WinMLExportConfig` (`src/winml/modelkit/export/config.py`, line 75)" is accurate and verified.
-
-- Line 38: The export CLI example uses `--export-config export_cfg.json`. Verification of `winml export` is needed. The analyze command uses `--model`; the export command is at `src/winml/modelkit/commands/export.py`. The flag `--export-config` is not confirmed verified here, but is not the focus of this page's claims.
-
-## Minor (style, polish, low-impact)
-- Line 15: Claims metadata includes `winml.io.inputs` and `winml.hierarchy.tag`. Both strings are confirmed to exist in the source (`src/winml/modelkit/onnx/metadata.py` and `src/winml/modelkit/core/node_metadata.py`). The attribution "on individual nodes" for `winml.hierarchy.tag` is correct — it is a node-level attribute. The attribution of `winml.io.inputs` to "model level" is consistent with the metadata module. These are accurate.
-
-- Lines 53-60: All cross-links (`eps-and-devices.md`, `weight-and-activation.md`, `quantization.md`, `../commands/inspect.md`, `../commands/export.md`) resolve to files that exist on disk.
-
-## Verified correct (anchored claims you checked)
-- Line 29: `opset_version: int = 17` at `src/winml/modelkit/export/config.py:75` — confirmed exactly.
-- Line 15: `winml.hierarchy.tag` found in `src/winml/modelkit/export/htp/exporter.py` and `src/winml/modelkit/core/node_metadata.py`; `winml.io.inputs` found in `src/winml/modelkit/onnx/metadata.py` and `src/winml/modelkit/onnx/io.py`.
-- Lines 9-15: ONNX `ModelProto` / `GraphProto` structure description (inputs, outputs, nodes, initializers, metadata) matches standard ONNX format and how winml-cli uses it.
diff --git a/docs/superpowers/2026-05-27-doc-issues/how-it-works.md b/docs/superpowers/2026-05-27-doc-issues/how-it-works.md
deleted file mode 100644
index c33b92b98..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/how-it-works.md
+++ /dev/null
@@ -1,22 +0,0 @@
-# Issues: docs/concepts/how-it-works.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-- (none)
-
-## Important (misleading or stale claim)
-- Line 80: Doc says `winml build` auto-detects ONNX vs HF and calls "`build_hf_model` or `build_onnx_model`". This is inaccurate at the CLI layer. The build command (`src/winml/modelkit/commands/build.py`) orchestrates stages directly via `_build_hf_pipeline()` / `_build_onnx_pipeline()` inline functions. The named public API functions `build_hf_model` / `build_onnx_model` (from `src/winml/modelkit/build/hf.py` and `build/onnx.py`) are only called in module-mode (`_build_modules()`), not in the single-model code path. Telling readers "calls `build_hf_model` or `build_onnx_model`" misrepresents the actual dispatch.
-
-- Line 88: Example flag `--no-optimize` is valid (`src/winml/modelkit/commands/build.py:300`), but the comment "Skip optimization (for pre-quantized input)" is misleading. The source docstring says "Skip optimization (for pre-quantized ONNX models)" (`build.py:303`), and the flag is general-purpose (not limited to pre-quantized inputs). The doc's narrower framing could confuse users with other reasons to skip optimization.
-
-## Minor (style, polish, low-impact)
-- Line 12: Claims the pipeline API "powers `WinMLAutoModel.from_pretrained()`". `WinMLAutoModel` exists (`src/winml/modelkit/models/auto.py`) but the connection to the pipeline described here is not verifiable from the source at the cited commit; may be aspirational or referring to an internal API not exposed in this path.
-
-- Lines 116–122: Cross-links — `../commands/build.md`, `../commands/export.md`, `eps-and-devices.md`, and `config-and-build.md` all resolve to files that exist on disk. No broken links.
-
-## Verified correct (anchored claims you checked)
-- Lines 88-91: `--no-quant` and `--no-compile` flags exist in `src/winml/modelkit/commands/build.py:274` and `279-282` respectively. `--no-optimize` exists at line 300.
-- Lines 99-105: `WinMLBuildConfig` structure (loader/export/optim/quant/compile) matches `src/winml/modelkit/config/build.py:97-138`.
-- Lines 109-110: Setting `quant` or `compile` to null skips that stage; confirmed by `src/winml/modelkit/commands/build.py:948-949` (quant) and `src/winml/modelkit/commands/build.py:1038-1039` (compile).
-- Line 113: Config file written after optimize stage; confirmed by `src/winml/modelkit/commands/build.py:1192` (`config_path.write_text(...)`).
diff --git a/docs/superpowers/2026-05-27-doc-issues/hub.md b/docs/superpowers/2026-05-27-doc-issues/hub.md
deleted file mode 100644
index e69d31e3a..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/hub.md
+++ /dev/null
@@ -1,37 +0,0 @@
-# Issues: docs/commands/hub.md
-
-Source verified against: `src/winml/modelkit/commands/catalog.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **The command documented is `winml hub` but the source registers it as `winml catalog`.** Source line 362 is `@click.command()` with no `name=` argument; the function is named `catalog` (line 387), and the CLI is wired to `winml catalog` per the docstring (lines 6–17). Every invocation example in the doc uses `winml hub` (e.g. `$ winml hub`, `$ winml hub --model-type bert`) — these will all fail unless there is an alias registered elsewhere. The doc must either be renamed to `catalog.md` and updated throughout, or the alias must be verified.
-
-- **`--model` / `-m` flag for detail view does not exist in source.** The doc table lists `--model / -m` as "Show detailed latency and accuracy benchmarks for a specific model ID" (doc line 23). The source `catalog` command (lines 362–429) has no `--model` option. The source accepts only `--model-type / -t`, `--task / -k`, `--ep`, `--device`, and `--output`. There is no per-model detail view in the source at all. Any user running `winml hub --model ProsusAI/finbert` will get an "unrecognized option" error.
-
-- **`--ep` and `--device` flags are absent from the doc flag table entirely.** Source lines 377–385 add `ep_option(required=False)` and `device_option(required=False, default=None)`. The doc only lists four flags and makes no mention of `--ep` or `--device`. These are functional filters that change output — omitting them is a content gap that will confuse users trying to filter by EP or device.
-
-## Important (misleading or stale)
-
-- **"How it works" describes per-EP latency stats (avg, P50, P90, P95, P99, min, max, QPS) and accuracy verdicts (PASS/AT_RISK/REGRESSION)** — the source `catalog.py` makes no reference to these fields. The catalog data source is `hub_models.json` (line 64) and the rendering code (lines 276–306) shows columns: Model, Task, Size, Model Type, and optionally Devices or EPs. No latency stats or accuracy verdict columns appear in the rendered output. The "How it works" section describes functionality that either does not exist in this command or belongs to a different one (e.g., `winml perf`).
-
-- **Accuracy verdict description (`drop_pct`) in "How it works"** is not supported by any code in `catalog.py`. The `See also` section points to `quantization.md` to explain `drop_pct`, but this doc is describing `winml catalog` which has no such output.
-
-- **Example output shows "winml-cli Catalog"** (doc line 50) but source line 301 renders `"WinML CLI Catalog"`. Minor discrepancy.
-
-- **Pitfall says `--model` performs substring matching** (doc line 90–92) — this flag does not exist in source. The entire pitfall is based on a non-existent feature.
-
-- **Pitfall "no flag to dump entire catalog"** (doc line 97–99) says "omit all filters and add `--output`" — the source does support `--output` with no filters (lines 428–429), so this pitfall hint is correct, but the surrounding text refers to `--model` which does not exist.
-
-## Minor (polish)
-
-- The synopsis `$ winml hub [options]` uses the wrong command name; should be `$ winml catalog [options]`.
-- Cross-reference at doc line 108 reads `hub.md` in `sys.md` which will be a broken link if this doc is renamed.
-- The `--task` short flag warning pitfall ("use `-k`, not `-t`") is correct → source line 373 confirms `-k`.
-
-## Verified correct (key claims checked)
-
-- `--model-type / -t` filter exists, case-insensitive → source lines 363–369.
-- `--task / -k` filter exists, case-insensitive → source lines 370–376.
-- `--output / -o` saves JSON → source lines 428–429 via `cli_utils.output_option`.
-- Catalog loaded from local package data (no network) → source lines 53–65.
-- `_filter_models` applies exact case-insensitive equality on `model_type` and `task` → source lines 68–88.
diff --git a/docs/superpowers/2026-05-27-doc-issues/index.md b/docs/superpowers/2026-05-27-doc-issues/index.md
deleted file mode 100644
index 535c889f4..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/index.md
+++ /dev/null
@@ -1,24 +0,0 @@
-# Issues: docs/index.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-(none)
-
-## Important
-
-- **Anchor `#eps-winml-cli-supports` may not resolve.** The link `concepts/eps-and-devices.md#eps-winml-cli-supports` targets a heading "EPs winml-cli supports" (line 7 of that file). MkDocs lowercases and hyphenates heading text, so "EPs winml-cli supports" becomes `#eps-winml-cli-supports`. The "EPs" acronym normalizes correctly here — the anchor is valid as written, but this depends on MkDocs slug behaviour for acronyms (capitals are lowercased). Treat as worth verifying in the rendered site.
-
-## Minor
-
-- **"12 `winml` subcommands"** — the `docs/commands/` directory contains 12 `.md` files (analyze, build, compile, config, eval, export, hub, inspect, optimize, overview, perf, quantize, sys). `overview.md` is a landing page, not a subcommand. The actual executable subcommands registered in the CLI should be counted and verified; if hub or overview are not registered commands the "12" claim would be wrong.
-
-## Verified correct
-
-- No `wmk` or `ModelKit` strings in user-facing prose.
-- GitHub URL `https://github.com/microsoft/winml-cli` matches `pyproject.toml` URLs.
-- Links to `getting-started/installation.md`, `getting-started/quickstart.md`, `getting-started/end-to-end.md`, `concepts/how-it-works.md`, `commands/overview.md` all resolve to files that exist.
-- Link to `samples/convnext-primitives.md` resolves.
-- MIT licence link points to `https://github.com/microsoft/winml-cli/blob/main/LICENSE.txt`.
-- Tagline and bullets read naturally with no leftover `wmk`/`ModelKit` names.
diff --git a/docs/superpowers/2026-05-27-doc-issues/inspect.md b/docs/superpowers/2026-05-27-doc-issues/inspect.md
deleted file mode 100644
index bcd42e2ea..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/inspect.md
+++ /dev/null
@@ -1,35 +0,0 @@
-# Issues: docs/commands/inspect.md
-
-Source verified against: `src/winml/modelkit/commands/inspect.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **`--model` is listed as "required" in the flag table** (doc line 22: "Required unless `--help` is used") but the source marks it `required=False` (line 63). The command accepts `--model-type` or `--model-class` as alternatives; source line 165 raises `UsageError` only when all three (`model_id`, `model_type`, `model_class`) are None. Users who read the doc and omit `-m` expecting a usage error will instead succeed with `--model-type`.
-
-- **`--list-tasks` flag is not documented at all.** Source lines 98–103 define `@click.option("--list-tasks", "list_tasks", is_flag=True, ...)`. Omitting it from the flags table means users cannot discover this flag. Running `winml inspect --list-tasks` exits early printing all known tasks (lines 157–161) — a useful shortcut completely hidden from the doc.
-
-- **`--model-type` and `--model-class` flags are not documented.** Source lines 104–116 define `--model-type` (can replace `-m`) and `--model-class` (can replace `-m`). The doc synopsis says `-m <model_id>` is the only input path. Users have no way to discover the type-only or class-only inspection paths shown in the source docstring examples.
-
-## Important (misleading or stale)
-
-- **`-v` / `--verbose` flag is absent from the flag table.** Source lines 78–83 define `@click.option("-v", "--verbose", is_flag=True, ...)`. Verbose mode changes JSON/table output to include full configuration details (passed as `verbose=verbose` to `output_json` and `output_table` at lines 229–231).
-
-- **"How it works" says `--hierarchy` uses `AutoModel.from_config()` and records a "forward-pass trace"** — source lines 449–458 show `extract_hierarchy(model_id)` is called, but this is `from ..inspect.hierarchy import extract_hierarchy` which is a separate module. The source comment at line 451 says "requires model_id" (line 452: `if include_hierarchy and model_id:`), not just a config fetch. The claim that "no real weights are downloaded" should be verified against `extract_hierarchy`.
-
-- **`--format` choices are documented as `table | json`** — source line 74 confirms `click.Choice(["table", "json"])`, so this is correct. However the doc uses backtick-escaped `table` and `json` which is fine.
-
-## Minor (polish)
-
-- The `--help / -h` row in the flag table is auto-added by Click and does not need to be listed explicitly.
-- The synopsis shows `$ winml inspect -m <model_id> [options]` but since `-m` is not required, the synopsis should read `$ winml inspect [options]` or include alternates.
-- The example `winml inspect -m facebook/convnext-tiny-224 -v -H` uses `-v` which is a real and functional flag, but since `-v` is not in the flag table the user has no context for it. Consistent with the missing `--verbose` entry.
-
-## Verified correct (key claims checked)
-
-- `-m` / `--model` short form exists → source line 62.
-- `-f` / `--format` with `Choice(["table", "json"])`, default `"table"` → source lines 70–76.
-- `-t` / `--task` with no required constraint, default `None` → source lines 85–90.
-- `-H` / `--hierarchy` as `is_flag=True, default=False` → source lines 91–97.
-- Command does not accept `--device`, `--ep`, `--precision`, `--output` → confirmed absent.
-- `--format json` output goes to stdout, banners go to stderr → source lines 33–35.
-- `--list-tasks` requires no model and lists `KNOWN_TASKS` → source lines 157–161.
diff --git a/docs/superpowers/2026-05-27-doc-issues/installation.md b/docs/superpowers/2026-05-27-doc-issues/installation.md
deleted file mode 100644
index cb6f0a0bb..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/installation.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Issues: docs/getting-started/installation.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-- Python version wrong: doc states `3.10` and claims `requires-python = ">=3.10,<3.11"` (installation.md:3, 11), but `pyproject.toml` at 5e25579 declares `requires-python = ">=3.11,<3.12"`. The install step (`uv python install 3.10`) and the "Verify" expected output (`Python Version 3.10.x`) are also wrong as a result.
-
-## Important
-- "No NPU?" callout claims `winml eval` accepts only `cpu|gpu|npu` (no `auto`) (installation.md:16). This is **incorrect**: `eval.py` defines `--device` as `click.Choice(["auto", "cpu", "gpu", "npu"])` with `default="auto"` — `auto` is a valid value for `winml eval`.
-- `winml sys --list-device --list-ep` flags: both `--list-device` and `--list-ep` exist in `sys.py` (lines with `@click.option("--list-device", ...)` and `@click.option("--list-ep", ...)`), so this is not an error, but the quickstart.md description (quoted here as context) says these flags "skip SDK versions and Python environment details" — that is not the behavior when both are passed; the full sysinfo is **not** run, only the device/EP lists are printed. Not an issue in installation.md itself.
-
-## Minor
-- The `--extra qnn` footnote claims `onnxruntime-qnn` requires Python 3.11+ and is "reserved for future use" (installation.md:70). `pyproject.toml` at 5e25579 already gates the dep on `python_version>='3.11'` and the project itself requires 3.11+, so the "reserved for future use" framing is inaccurate — it is already effective on the required Python version.
diff --git a/docs/superpowers/2026-05-27-doc-issues/load-and-export.md b/docs/superpowers/2026-05-27-doc-issues/load-and-export.md
deleted file mode 100644
index c126a8c82..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/load-and-export.md
+++ /dev/null
@@ -1,43 +0,0 @@
-# Issues: docs/concepts/load-and-export.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-
-- (none)
-
-## Important (misleading or stale claim)
-
-- **`--dynamo` described as "reserved but not yet functional"** (line 19): The doc says "the `--dynamo` flag is reserved for the PyTorch 2.x dynamo exporter but is **not yet functional** in the current release — passing it logs a warning and the flag is ignored." The source confirms this: `commands/export.py` lines 376-384 show that when `dynamo=True`, a warning is printed and the flag is ignored. The note itself is accurate, but the doc still mentions "PyTorch 2.x" while the CLI help text says "PyTorch 2.9+" (`commands/export.py` line 98: `"Enable PyTorch 2.9+ dynamo export for rich node metadata"`). The version reference in the doc is stale/imprecise.
-
-- **`--torch-module` described as "reserved but not yet functional"** (line 35): Similarly, the source confirms (`commands/export.py` lines 362-373) it logs a warning and is ignored. The doc note is accurate. However, the doc says it is "intended to include them as distinct hierarchy nodes" while the CLI help says "Include torch.nn modules in hierarchy (comma-separated)" — consistent.
-
-- **`winml inspect` described as working "without downloading weights"** (line 13): The doc says `winml inspect` "prints the detected task, the HuggingFace model class, the export configuration, and the WinML inference class — all without downloading weights. Add `--hierarchy` to reconstruct the PyTorch module tree from random-weight tracing." The `commands/inspect.py` file was not read, so this specific claim about not downloading weights cannot be confirmed or denied from available sources. This warrants scrutiny.
-
-- **`--shape-config` vs `--input-specs`** (line 33): The doc says "Provide a `--shape-config` JSON file with explicit overrides, or use `--input-specs` to supply a fully specified input manifest." The `winml export` command has both flags: `--shape-config` (line 126 in `commands/export.py`) and `--input-specs` (line 106-111). This is correct. However, the doc describes them as equivalent alternatives — in the source, `--shape-config` passes shape overrides to auto-resolution while `--input-specs` overrides individual tensor specs after auto-resolution. They work differently, not interchangeably.
-
-## Minor (style, polish, low-impact)
-
-- **`winml.hierarchy.tag` metadata key name** (line 21): Doc says nodes carry `winml.hierarchy.tag` and `winml.hierarchy.depth`. Both keys confirmed at `src/winml/modelkit/export/htp/exporter.py` lines 594-595 and `src/winml/modelkit/core/node_metadata.py` lines 71, 74.
-
-- **`winml.io.inputs` and `winml.io.outputs` described as model-level** (line 21): Confirmed at `src/winml/modelkit/export/htp/exporter.py` lines 556, 564.
-
-- **`--no-hierarchy` alias `--clean-onnx`** (line 23): Source confirms both flags exist as aliases: `commands/export.py` lines 87-92 (`--clean-onnx` / `--no-hierarchy`).
-
-- **`--with-report` flag** (line 25): Exists at `commands/export.py` line 80-83.
-
-- **Cross-links** `[graphs-and-ir.md]`, `[../commands/inspect.md]`, `[../commands/export.md]` (lines 39-41): All files exist.
-
-## Verified correct (anchored claims you checked)
-
-- `winml export` uses TorchScript tracing by default → `commands/export.py` line 157 (docstring: "ONNX Export — Convert to ONNX format (TorchScript by default)")
-- `--dynamo` flag exists on `winml export` → `commands/export.py` lines 94-98
-- `--torch-module` flag exists on `winml export` → `commands/export.py` lines 100-105
-- `--task` flag exists on `winml export` → `commands/export.py` lines 112-117
-- `--input-specs` flag exists on `winml export` → `commands/export.py` lines 106-111
-- `--shape-config` flag exists on `winml export` → `commands/export.py` lines 125-130
-- `winml.hierarchy.tag` is a real metadata key → `core/node_metadata.py` line 71
-- `winml.hierarchy.depth` is a real metadata key → `core/node_metadata.py` line 74
-- `winml.io.inputs` / `winml.io.outputs` are model-level metadata props → `export/htp/exporter.py` lines 556, 564
-- `--trust-remote-code` applies to `winml config` (not `winml export` directly) → `commands/config.py` line 166
-- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/npu-convnext.md b/docs/superpowers/2026-05-27-doc-issues/npu-convnext.md
deleted file mode 100644
index 34b02bf57..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/npu-convnext.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Issues: docs/tutorials/npu-convnext.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-- Step 7 CPU compile artifact named `convnext_int8_cpu_ctx.onnx` (npu-convnext.md:164): `compiler/configs.py` `for_cpu()` sets `enable_ep_context=False`, so `CompileStage._finalize_output` is never invoked and no `_cpu_ctx.onnx` file is written. The CPU compile step is silently skipped by `for_provider()` returning `None` when `enable_ep_context=False`. The CPU tab in Step 7 describes a compile command that produces no artifact, and the named output file does not exist.
-- Step 8 CPU perf command references `convnext_int8_cpu_ctx.onnx` (npu-convnext.md:190): this file is never produced (same root cause as above). The CPU benchmark tab would fail to find the input model.
-- Step 9 eval uses `--device npu` (npu-convnext.md:224): `eval.py` declares `--device` as `click.Choice(["auto", "cpu", "gpu", "npu"])` — `npu` is a valid value. However, the tutorial is evaluating `convnext_int8.onnx` (the quantized float ONNX before compilation) on the NPU. This will attempt to run the uncompiled model through QNN EP, which requires JIT compilation at load time and may fail or be extremely slow. This is a usage problem but `npu` is a legal value, so it is not a flag-existence error.
-
-## Important
-- Step 7 OpenVINO compile (npu-convnext.md:155): `winml compile -m convnext_int8.onnx --device npu --ep openvino`. In `compile.py`, `--device` accepts `["auto", "npu", "gpu", "cpu"]` and `--ep` accepts EP aliases. `OpenVINOExecutionProvider` maps to `("npu", "gpu", "cpu")` in `EP_SUPPORTED_DEVICES`, so `--device npu --ep openvino` is a valid combination. No error here.
-- Step 7 claims OpenVINO produces `convnext_int8_openvino_ctx.onnx` (npu-convnext.md:164): `for_openvino()` sets `enable_ep_context=True`, so an EPContext file is produced. The filename pattern `{stem}_{device}_ctx.onnx` is used in `CompileStage._finalize_output` where `device` comes from the resolved device string. With `--device npu`, `device="npu"`, so the file would be `convnext_int8_npu_ctx.onnx`, not `convnext_int8_openvino_ctx.onnx`. The EP name is not used in the filename; the device name is.
-- Section B `winml build` command (npu-convnext.md:239): `uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/`. Source `build.py` uses `-c` (config), `-m` (model), `-o` (output-dir). The flag signatures match. No error.
-- Section B states "The QNN SDK path is read from the `QNN_SDK_ROOT` environment variable, not from the config or CLI flags." (npu-convnext.md:257): correct for `winml build` — `build.py` has no `--qnn-sdk-root` option. But note: `winml compile` *does* expose `--qnn-sdk-root` (compile.py:89–93). The tutorial does not use `winml compile --qnn-sdk-root` so this nuance is not wrong in context, but it may confuse users who read both pages.
-- Prerequisites list Python 3.10 (npu-convnext.md:22): `pyproject.toml` requires `>=3.11,<3.12`. This propagates the same Python version error found in installation.md.
-
-## Minor
-- Section B perf command at the end uses `convnext_out/model.onnx` (npu-convnext.md:262): `winml build` does not write a file named `model.onnx`; it writes the compiled artifact under its EP-derived name (e.g., `convnext_int8_npu_ctx.onnx`). The placeholder path is misleading — users must look up the actual output filename from the build log.
diff --git a/docs/superpowers/2026-05-27-doc-issues/optimize.md b/docs/superpowers/2026-05-27-doc-issues/optimize.md
deleted file mode 100644
index 6dbe2e102..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/optimize.md
+++ /dev/null
@@ -1,48 +0,0 @@
-# Issues: docs/commands/optimize.md
-
-Source verified against: `src/winml/modelkit/commands/optimize.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **`--preset` flag does not exist in source.** The doc (lines 21, 29–35) documents a `--preset / -p` flag accepting `qnn-compatible|transformer-optimized|full|minimal`. There is no such option anywhere in `optimize.py`. The source `@click.command()` definition (lines 151–187) has `--list-capabilities`, `--list-rewrites`, `--model`, `--output`, `--config`, `--verbose`, and the dynamically-generated capability flags. No `--preset` option is defined. Any user running `winml optimize -m model.onnx --preset qnn-compatible` will get "Error: no such option: --preset". The entire "Built-in presets" table (doc lines 29–35) and every preset-based example in the doc are invalid.
-
-- **`-p` short form is documented for `--preset`** (doc line 21) but in source, no `-p` exists. The `--model` flag does have `-m` and `--output` has `-o`, but there is no `-p` anywhere in the command definition.
-
-- **"Configuration precedence" claims preset is step 3** (doc lines 38–43) with order: CLI flags > config file > preset > capability defaults. The actual source precedence (lines 363–383) is: capability defaults, then config file, then CLI options. There is no preset layer. The precedence documented is for a different version or planned feature.
-
-## Important (misleading or stale)
-
-- **`--verbose / -v` flag is absent from the doc flag table.** Source lines 180–185 define `@click.option("--verbose", "-v", is_flag=True, default=False, ...)`. The doc table lists only `--model`, `--output`, `--preset`, `--config`, `--list-capabilities`, `--list-rewrites`, and dynamic flags — `--verbose` is missing entirely.
-
-- **`--model` short form `-m`** is not shown in the doc's flag table (the Short column is empty for `--model` at doc line 19). Source line 167 defines `"--model", "-m"`. Users will not know `-m` works.
-
-- **"Configuration precedence" in source is 3-level, not 4-level.** Source lines 363–383 implement: (1) capability defaults, (2) config file, (3) CLI options. The doc describes 4 levels including "preset". Without the preset, the doc's precedence section incorrectly numbers and describes the chain.
-
-- **Examples use `--preset`** (doc lines 71–85) — all preset-based examples produce errors with the current source. The only valid examples are:
-  - `winml optimize -m model.onnx` (default caps)
-  - `winml optimize --list-capabilities`
-  - `winml optimize --list-rewrites`
-  - `winml optimize -m model.onnx --enable-<cap>` / `--disable-<cap>`
-  - `winml optimize -m model.onnx -c config.json`
-
-- **`--config` type described as `PATH`** — the doc says "YAML or JSON configuration file" (doc line 23). Source line 175 uses `type=click.Path(exists=True, path_type=Path)` and `load_config()` (lines 48–70) supports `.yaml/.yml` and `.json`. This is correct.
-
-## Minor (polish)
-
-- The doc's dynamic flags section (line 25) correctly describes `--enable-<name>/--disable-<name>` pairs from the capability registry and `--list-capabilities` to discover them. This matches source lines 109–148.
-- The claim that "adding a new optimization to the registry automatically makes it available as a CLI flag" matches source — `capability_options` decorator (lines 109–148) auto-generates flags at import time.
-- `--list-capabilities` with `-l` short form → source lines 153–157 confirm `-l` is the short form. Correctly documented.
-- `--list-rewrites` (no short form) → source lines 159–163 confirm. Correctly documented.
-- Output path default `{input}_opt.onnx` → source lines 352–353 confirm.
-- Before/after node count reduction report → source lines 419–423 confirm.
-
-## Verified correct (key claims checked)
-
-- `--model / -m` exists, `required=False` (only required when not listing) → source lines 165–171.
-- `--output / -o` exists via `cli_utils.output_option` → source line 172.
-- `--config / -c` exists, type `Path(exists=True)` → source lines 173–179.
-- `--list-capabilities / -l` exists as flag → source lines 151–157.
-- `--list-rewrites` exists as flag (no short form) → source lines 159–163.
-- Dynamic `--enable-X/--disable-X` flags from capability registry → source lines 109–148.
-- Missing `--model` when not listing raises `UsageError` → source lines 336–338.
-- Config file supports YAML and JSON → source lines 48–70.
diff --git a/docs/superpowers/2026-05-27-doc-issues/overview.md b/docs/superpowers/2026-05-27-doc-issues/overview.md
deleted file mode 100644
index 38f065449..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/overview.md
+++ /dev/null
@@ -1,17 +0,0 @@
-# Issues: docs/commands/overview.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- Line 2: States "12 subcommands". Source has 14 command modules (`analyze`, `build`, `catalog`, `compile`, `config`, `eval`, `export`, `inspect`, `optimize`, `perf`, `quantize`, `run`, `serve`, `sys`). `run` and `serve` are disabled at runtime via `_DISABLED_COMMANDS` (cli.py) but the command count is still wrong at 12 — the actual exposed count is 12 only if the two disabled commands are excluded AND `catalog` is counted as `hub`. The command map (line 29) lists `hub` which does not exist; the actual command is `catalog` (catalog.py, `@click.command()` function named `catalog`). There is no `hub` command in the codebase at this commit.
-- Line 29 (table row): `hub` command listed as "Browse the curated winml-cli catalog of validated models and benchmarks." The command is named `catalog`, not `hub` (catalog.py). `winml hub` would fail at the CLI.
-
-## Important
-
-- Line 55: References `src/winml/modelkit/commands/_options.py` as the "canonical contract" for global flags. This file does not exist at commit 5e25579 (verified via `git ls-tree`). Global flags are defined in `src/winml/modelkit/cli.py` directly.
-- Lines 41–48 ("Choosing a command"): The entry "I want to know if my model is supported → `winml inspect`" is reasonable, but `winml analyze` (Verify EP operator compatibility) is a closer match for pre-deployment compatibility checks. The distinction between `inspect` and `analyze` is not reflected in the choosing-a-command list, making `analyze` effectively undiscoverable from this guide.
-
-## Minor
-
-- Line 63: Shared flags claim "`-p` / `--precision`" is shared. `perf` and `eval` both have `--precision` but `inspect`, `sys`, `hub`/`catalog`, and `analyze` do not. The claim "Defaults and accepted values can differ per command" partially covers this, but listing `-p` as a shared flag implies it exists on most commands, which overstates its reach.
diff --git a/docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md b/docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md
deleted file mode 100644
index aa5f3fe5c..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/perf-and-monitoring.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Issues: docs/concepts/perf-and-monitoring.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- Line 11: `--device` is described as accepting `cpu`, `gpu`, or `npu` only, but `perf` calls `cli_utils.device_option(include_auto=True, default="auto")` (perf.py:1113), so `auto` is also a valid choice and is the actual default. The sentence "The `--device` flag selects the target EP — `cpu`, `gpu`, or `npu`" omits `auto` and misstates the default.
-- Line 13: Output path default stated as `{model_slug}_perf.json` (implying the current directory). Source writes to `~/.cache/winml/perf/<slug>/<timestamp>.json` (perf.py:871–876). The default location is wrong and the timestamp-per-run filename structure is omitted entirely.
-
-## Important
-
-- Lines 25–31: `--op-tracing` is documented as a user-facing feature with two levels. In source the option is decorated `hidden=True` (perf.py:1183), meaning it is intentionally hidden from `--help`. Documenting a hidden flag as a supported feature is misleading.
-- Lines 17–21: `--monitor` is described as streaming "NPU utilisation". Source tracks whichever device is being benchmarked: NPU, GPU, or CPU (`monitor_device = self._model.device or self.config.device or "auto"`, perf.py:409). Calling it NPU-specific is inaccurate.
-
-## Minor
-
-- Line 19: States the chart "updates in place during the iteration loop". The live chart is managed by `LiveMonitorDisplay` (perf.py:943), but this detail is accurate. No issue.
-- Line 37: `--module` docstring in source says the argument is a "PyTorch module class name (NOT a dotted module path)" (perf.py:1166–1169). The concept doc example `winml perf -m bert-base-uncased --module BertAttention` is correct, but the doc does not warn users that a dotted path will silently not match, which is the primary pitfall documented in the source help text.
diff --git a/docs/superpowers/2026-05-27-doc-issues/perf.md b/docs/superpowers/2026-05-27-doc-issues/perf.md
deleted file mode 100644
index ebd669440..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/perf.md
+++ /dev/null
@@ -1,41 +0,0 @@
-# Issues: docs/commands/perf.md
-
-Source verified against: `src/winml/modelkit/commands/perf.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- (none)
-
-## Important (misleading or stale)
-
-- **`--compare-devices` flag does not exist in source.** The flag table lists `--compare-devices | TEXT | — | Not yet implemented`. A full search of `perf.py` shows zero occurrences of `compare_devices` or `compare-devices` as a defined click option. The flag is documented but never registered; passing it will produce a "No such option" error. The note "Not yet implemented" is insufficient — the flag should either be removed from the table entirely or marked explicitly as "not defined, will error if passed".
-- **`--op-tracing` is hidden in source.** `perf.py:1184`: `hidden=True` — the flag is intentionally hidden from `--help` output. The doc exposes it in the flag table without any note that it does not appear in `--help`. Consider adding "(hidden from --help output; not ready for general use)" to the description.
-- **Default output path documented as "`{model_slug}_perf.json`" is wrong.** Source `perf.py:871` generates the path as `~/.cache/winml/perf/<slug>[/<module_class>]/<timestamp>.json`, not a file in the current working directory. Users expecting a local file will be confused.
-
-## Minor (polish)
-
-- **Flag table omits `--verbose` / `-v`.** Defined at `perf.py:1183-1191`.
-- **Flag table omits `--build-config` / `-c` (the shared build config option).** `perf.py:1192` registers `@cli_utils.build_config_option`.
-- **`--shape-config` description says "Ignored for pre-exported ONNX files and in `--module` mode"** — correct; both branches issue warnings at `perf.py:1280-1284` and `perf.py:1351-1356`. The doc accurately describes this behavior.
-
-## Verified correct (key claims checked)
-
-- `--model` / `-m` required (enforced in body, not `required=True`) → `perf.py:1092`, `perf.py:1243`
-- `--task` string default None → `perf.py:1093-1098`
-- `--iterations` IntRange min=1 default 100 → `perf.py:1099-1105`
-- `--warmup` IntRange min=0 default 10 → `perf.py:1106-1111`
-- `--device` choice `auto|cpu|gpu|npu` default `auto` → `perf.py:1113` via `cli_utils.device_option`
-- `--precision` string default `auto` → `perf.py:1114-1120`
-- `--ep` via `cli_utils.ep_option` → `perf.py:1121-1124`
-- `--batch-size` int default 1 → `perf.py:1129-1135`
-- `--shape-config` path default None → `perf.py:1136-1142`
-- `--no-quantize` flag default false → `perf.py:1143-1148`
-- `--rebuild` flag default false → `perf.py:1149-1153`
-- `--ignore-cache` flag default false → `perf.py:1154-1159`
-- `--module` string default None → `perf.py:1160-1170`
-- `--monitor` flag default false → `perf.py:1171-1176`
-- `--op-tracing` choice `basic|detail` default None → `perf.py:1177-1184`
-- `--compare-devices` marked "not yet implemented" → confirmed not implemented (flag absent from source)
-- Statistics include mean, min, max, P50, P90, P95, P99, std → `perf.py:104-109` (BenchmarkResult fields)
-- `--monitor` includes hw metrics in JSON → `perf.py:127`, `perf.py:167-168`
-- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md b/docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md
deleted file mode 100644
index a9115cbb1..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/primitives-and-pipeline.md
+++ /dev/null
@@ -1,32 +0,0 @@
-# Issues: docs/concepts/primitives-and-pipeline.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-
-- (none)
-
-## Important (misleading or stale claim)
-
-- **`--use-cache` described as alternative to `-o`/`--output-dir`** (line 62-65): Doc says "accepts `--use-cache` in place of `-o`/`--output-dir`". The short flag for output directory in `winml build` is `-o` but the parameter is named `--output-dir`, not `--output`. The doc uses `-o`/`--output-dir` inconsistently: line 62 says "in place of `-o`/`--output-dir`" but elsewhere uses `--output-dir`. Source: `src/winml/modelkit/commands/build.py` lines 249-262 — the option is `--output-dir` / `-o`. This is technically fine but the description shorthand could confuse users.
-
-- **`winml build -c config.json -m microsoft/resnet-50 -o output/`** (line 49): The short flag `-o` maps to `--output-dir` in the build command (source: `commands/build.py` line 250-256). This is valid but worth noting: `-o` is the shorthand for `--output-dir`, not `--output`. The doc uses `-o` which is correct.
-
-- **`WinMLBuildConfig` has six nested sub-configs, not five** (line 51 in config-and-build.md references five — this doc only lists five): The `WinMLBuildConfig` dataclass also has an `eval: WinMLEvaluationConfig | None` field and an `auto: bool` field (source: `src/winml/modelkit/config/build.py` lines 132-138). The doc does not mention these — omission rather than error, but the `eval` field could be relevant to users combining `winml build` and `winml eval`.
-
-## Minor (style, polish, low-impact)
-
-- **Cross-link `[ConvNeXT primitives sample](../samples/convnext-primitives.md)`** (line 104): The file `docs/samples/convnext-primitives.md` exists and the link is valid.
-
-- **`winml build` without `-c`** (lines 49, 62): Doc implies `-c` is required for `winml build`. Source shows `-c` is `required=False` (`commands/build.py` line 236-241) — if omitted, config is auto-generated from `-m`. The doc's initial description of the command is accurate but does not mention the `-c`-less shorthand.
-
-## Verified correct (anchored claims you checked)
-
-- `WinMLBuildConfig` exists as a dataclass → `src/winml/modelkit/config/build.py` line 97
-- `winml build` flags `--no-quant`, `--no-compile`, `--no-optimize` all exist → `commands/build.py` lines 273, 275-282, 300-304
-- `--use-cache` flag exists and is mutually exclusive with `--output-dir` → `commands/build.py` lines 258-262, 376-379
-- `--rebuild` flag exists → `commands/build.py` lines 263-268
-- Setting `quant` or `compile` to `null` skips those stages → `config/build.py` lines 133-136 (both are `| None`)
-- `~/.cache/winml/` as global cache path → `commands/build.py` line 261 (`~/.cache/winml/`)
-- Six primitive commands listed are all real CLI commands → `commands/` directory contains `export.py`, `optimize.py`, `quantize.py`, `compile.py`, `perf.py`, `eval.py`
-- No `wmk` or `ModelKit` strings in prose → verified by grep
diff --git a/docs/superpowers/2026-05-27-doc-issues/quantization.md b/docs/superpowers/2026-05-27-doc-issues/quantization.md
deleted file mode 100644
index 9c2ffd2e7..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/quantization.md
+++ /dev/null
@@ -1,29 +0,0 @@
-# Issues: docs/concepts/quantization.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-- Line 9: "every precision from `_KNOWN_PRECISIONS` in `_options.py`". Neither `_KNOWN_PRECISIONS` nor `_options.py` exist anywhere in the source tree. The actual symbol is `_NAMED_PRECISIONS` (a `frozenset` at `src/winml/modelkit/config/precision.py:71`) and there is no file named `_options.py`. This is a fabricated source citation. A reader trying to cross-reference the table against source code will find nothing.
-
-- Line 9: "the resolved quantization types from `config/precision.py`". The file path should be `src/winml/modelkit/config/precision.py`. The abbreviated form `config/precision.py` is navigable by context, but the companion citation `_options.py` is entirely wrong (see above). The combined sentence creates a misleading impression of where the table data lives.
-
-- Line 18 (table row `int8`): "default for NPU via QNN EP". The actual NPU auto-precision default is `w8a16`, not `int8`. `_AUTO_PRECISION = {"npu": "w8a16", ...}` at `src/winml/modelkit/config/precision.py:32-36`. Using `--precision int8` (or the `int8` named preset) resolves to `uint8/uint8` and is _valid_ for QNN, but it is not the auto-selected default. The annotation "default for NPU via QNN EP" is wrong.
-
-- Line 20 (table row `w4a16`): "Recognized as a precision string but raises an error at quantization time; no 4-bit weight dtype mapping exists in `precision.py` yet." This overstates what the code does. `w4a16` is NOT recognized at all. `is_quantized_precision("w4a16")` returns `False` (because `4` is not in `_BITS_TO_WEIGHT_TYPE`), and `_resolve_quant_types()` in `src/winml/modelkit/commands/quantize.py:260-269` raises `click.BadParameter` for any non-quantized, non-auto precision string — including `w4a16`. The doc's claim that it is "recognized as a precision string" is incorrect; it is rejected before reaching quantization time.
-
-## Important (misleading or stale claim)
-- Line 17 (table row `auto`): "Resolves to `int8` (NPU), `fp16` (GPU/CPU) at runtime". Partially wrong. For NPU the auto-precision resolves to `w8a16` (not `int8`), per `_AUTO_PRECISION["npu"] = "w8a16"` at `src/winml/modelkit/config/precision.py:33`. For GPU and CPU the `fp16` claim is correct (`_AUTO_PRECISION["gpu"] = "fp16"`, `_AUTO_PRECISION["cpu"] = "fp16"`, lines 34-35).
-
-- Line 16 (table row `int16`): Weight dtype listed as `int16`, activation dtype as `uint16`. Source at `src/winml/modelkit/config/precision.py:43-50` shows `_WEIGHT_TYPE["int16"] = "int16"` and `_ACTIVATION_TYPE["int16"] = "uint16"`. The weight type `int16` is correct. However, the resolution goes through `_BITS_TO_WEIGHT_TYPE[16] = "int16"` when using the `w{x}a{y}` form. The named-preset path matches the table. This row is correct.
-
-## Minor (style, polish, low-impact)
-- Lines 63-65: Cross-links (`weight-and-activation.md`, `eps-and-devices.md`, `../commands/quantize.md`, `../commands/eval.md`) all resolve to files that exist on disk.
-- Lines 32-35: `--samples` default `10` and `--method` choices `minmax`, `entropy`, `percentile` — all confirmed at `src/winml/modelkit/commands/quantize.py:57-65`.
-- Line 22: "`--weight-type` and `--activation-type` flags on `winml quantize` accept `uint8`, `int8`, `uint16`, or `int16`" — confirmed at `src/winml/modelkit/commands/quantize.py:67-76`.
-
-## Verified correct (anchored claims you checked)
-- Line 16 (table row `fp16`): No QDQ nodes, float16 throughout — matches `_WEIGHT_TYPE["fp16"] = None` at `src/winml/modelkit/config/precision.py:41`.
-- Line 15 (table row `fp32`): No quantization, baseline — matches `_WEIGHT_TYPE["fp32"] = None` at `src/winml/modelkit/config/precision.py:40`.
-- Line 19 (table row `w8a8`): `uint8/uint8`, equivalent to `int8` — matches `_MIXED_RE` path resolving `w8a8` -> `_BITS_TO_WEIGHT_TYPE[8]="uint8"`, `_BITS_TO_ACTIVATION_TYPE[8]="uint8"` at `src/winml/modelkit/config/precision.py:57-65`.
-- Line 19 (table row `w8a16`): `uint8` weights, `uint16` activations — matches `_BITS_TO_WEIGHT_TYPE[8]="uint8"`, `_BITS_TO_ACTIVATION_TYPE[16]="uint16"` at `src/winml/modelkit/config/precision.py:57-65`.
-- Lines 40-41: `--samples` default `10`, `--method` default `minmax` — confirmed at `src/winml/modelkit/commands/quantize.py:57-65`.
diff --git a/docs/superpowers/2026-05-27-doc-issues/quantize.md b/docs/superpowers/2026-05-27-doc-issues/quantize.md
deleted file mode 100644
index ce74fbea0..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/quantize.md
+++ /dev/null
@@ -1,33 +0,0 @@
-# Issues: docs/commands/quantize.md
-
-Source verified against: `src/winml/modelkit/commands/quantize.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- (none)
-
-## Important (misleading or stale)
-
-- **`--precision` accepted values listed as `int8`, `int16`, or `w8a16` but source also accepts `auto` and the full `w{x}a{y}` family.** The doc's flag table says only "`int8`, `int16`, or mixed-precision like `w8a16`". Source `quantize.py:50-53` documents: "Accepted: auto, int8, int16, or w{x}a{y} where x,y in {8,16} (e.g., w8a8, w8a16, w16a16)." The `auto` value and `w8a8` / `w16a16` forms are silently omitted from the table.
-- **Flag table omits `--task` and `--model-name`.** Both are real options defined in source (`quantize.py:92-109`). `--task` selects a calibration dataset; `--model-name` enables task-aware calibration with the model's preprocessor. Users who need task-aware calibration have no documentation to guide them.
-- **Flag table omits `--verbose` / `-v`.** Defined at `quantize.py:104-109`.
-
-## Minor (polish)
-
-- **Default output path description says "`{input}_qdq.onnx`" but should clarify stem only.** Source uses `model.stem + "_qdq.onnx"` in the same directory as the input (`quantize.py:189`), which matches, but "`{input}`" is ambiguous about whether the full path or just the stem is used.
-- **"Quantizing an already-quantized model is unsupported" pitfall** mentions `winml compile --no-quant` as the alternative. As noted in compile.md, `--no-quant` is a no-op in compile. The pitfall advice is therefore unhelpful and should be updated to reflect actual behavior.
-
-## Verified correct (key claims checked)
-
-- `--model` / `-m` required path → `quantize.py:37-43`
-- `--output` / `-o` optional path, default `{stem}_qdq.onnx` → `quantize.py:44` + `quantize.py:189`
-- `--precision` / `-p` string default None → `quantize.py:45-53`
-- `--samples` integer default 10 → `quantize.py:54-58`
-- `--method` choice `minmax|entropy|percentile` default `minmax` → `quantize.py:59-65`
-- `--weight-type` choice `uint8|int8|uint16|int16` default None → `quantize.py:66-71`
-- `--activation-type` choice `uint8|int8|uint16|int16` default None → `quantize.py:72-77`
-- `--per-channel` flag default false → `quantize.py:78-83`
-- `--symmetric` flag default false → `quantize.py:84-89`
-- Explicit type flags override `--precision` → `quantize.py:271-276`
-- Default types when no precision specified: uint8/uint8 → `quantize.py:263` (precision=None or "auto" → default_w/a = "uint8")
-- No `wmk` or `ModelKit` strings in user-facing prose → confirmed
diff --git a/docs/superpowers/2026-05-27-doc-issues/quickstart.md b/docs/superpowers/2026-05-27-doc-issues/quickstart.md
deleted file mode 100644
index a446a26dd..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/quickstart.md
+++ /dev/null
@@ -1,13 +0,0 @@
-# Issues: docs/getting-started/quickstart.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-- (none)
-
-## Important
-- `winml sys --list-device --list-ep` (quickstart.md:14): the doc claims these flags "skip SDK versions and Python environment details that plain `winml sys` would include." This is misleading. In `sys.py`, when `list_device` or `list_ep` is set, the command takes a separate branch that runs *only* the device/EP listing and returns early — it does not run `_gather_system_info()` at all, so "skipping" implies it runs a subset of the normal command, when in fact it is a separate code path. This is a documentation accuracy issue but not a flag-existence issue.
-- `winml inspect -m resnet50.onnx` (quickstart.md:40): `inspect.py` explicitly raises `click.ClickException("ONNX file inspection is not yet supported. Use 'winml config -m model.onnx' for ONNX build config.")` when passed a `.onnx` local file. The command as documented will fail rather than produce the shown output.
-
-## Minor
-- (none)
diff --git a/docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md b/docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md
deleted file mode 100644
index c3db1d35c..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/qwen3-composite.md
+++ /dev/null
@@ -1,43 +0,0 @@
-# Issues: docs/samples/qwen3-composite.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- (none)
-
-## Important
-
-- **GitHub URL points to `https://github.com/microsoft/winml-cli` (plain repo root).**
-  The "Track progress" section says "Follow development and check current status
-  at https://github.com/microsoft/winml-cli". There is no issues link, milestone
-  link, or branch link. For a placeholder page whose purpose is to track an
-  in-progress feature, the URL is minimal but not wrong. However, if the
-  feature is tracked on a specific branch or issue, the link should be more
-  precise. Acceptable as-is for a placeholder.
-
-- **Forward-looking sketch references `BuildConfig`** (capitalised as proper
-  noun) without tying it to `WinMLBuildConfig`. Readers coming from the BERT
-  sample know the class name; first-time readers may not. Minor wording issue.
-
-## Minor
-
-- **`!!! info "Coming soon"` admonition** — correctly identifies the page as a
-  placeholder. Format is valid MkDocs Material admonition syntax.
-
-- **Cross-link `../samples/bert-config-build.md`** — from inside `docs/samples/`,
-  a self-referential path resolves to `docs/samples/bert-config-build.md`.
-  The `../samples/` prefix from within `samples/` is redundant (resolves one
-  directory up and back in) but should still resolve correctly in MkDocs.
-  Could be simplified to `bert-config-build.md`.
-
-## Verified correct
-
-- Page correctly identifies itself as a placeholder and defers all content to
-  after the composite-model feature branch lands.
-- No commands, flags, or artifact names are asserted (none to verify wrong).
-- GitHub URL `https://github.com/microsoft/winml-cli` is the correct upstream
-  repository URL.
-- No `wmk` or `ModelKit` strings found in user-facing prose.
-- "What composite models are" section contains only conceptual prose — no
-  verifiable command syntax.
diff --git a/docs/superpowers/2026-05-27-doc-issues/sys.md b/docs/superpowers/2026-05-27-doc-issues/sys.md
deleted file mode 100644
index 68e52ec54..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/sys.md
+++ /dev/null
@@ -1,33 +0,0 @@
-# Issues: docs/commands/sys.md
-
-Source verified against: `src/winml/modelkit/commands/sys.py` @ 5e25579
-
-## Critical (flag/behavior wrong; user gets error)
-
-- **`--verbose` / `-v` flag is missing from the flag table entirely.** Source lines 653–659 define `@click.option("--verbose", "-v", is_flag=True, default=False, ...)`. The doc table lists only `--format`, `--list-device`, `--list-ep`, and `--help` — omitting `--verbose` means users reading the doc have no way to discover a functional, documented flag. The example `winml inspect -m facebook/convnext-tiny-224 -v -H` on `inspect.md` uses the same flag pattern, and `sys.py` line 699 shows `-v`/`--verbose` passed via `docstring`. Using `--verbose` surfaces Backend SDKs and Export Readiness sections (lines 392–433) that are hidden otherwise; presenting the command as having only 3 flags is actively wrong.
-
-## Important (misleading or stale)
-
-- **"How it works" says CUDA details are always probed via PyTorch** — source lines 218–251 show `_get_torch_info(verbose=False)` is the default, which explicitly skips `import torch` and CUDA probing (lines 235–251 are gated on `if not verbose: return info`). CUDA availability (`cuda_available`) only appears in the output when `--verbose` is passed. The doc's "How it works" says "probes PyTorch for CUDA availability and GPU device names" unconditionally, which is misleading — this only happens under `--verbose`.
-
-- **`--format compact` pitfall says it "omits device and EP tables"** (line 106) — but the source (lines 757–774) shows compact *does* support `--list-device` and `--list-ep` and prints device/EP information in a single-line form. The pitfall is only correct for the full default report path (line 812 `elif output_format.lower() == "compact": _output_compact(info)` which skips devices/EPs), but combination of `--format compact --list-device` works and produces output. The pitfall is partially misleading.
-
-- **"Backend SDK detection" described as part of default output** — source lines 392–433 show Backend SDKs and Export Readiness sections are only rendered under `verbose=True` (`if verbose:` guard at line 392). The "How it works" section implies these are always shown.
-
-- **Example output shows "winml-cli System Information"** (line 49) but source line 342 renders `"WinML CLI System Information"`. Minor inconsistency in the example panel title.
-
-## Minor (polish)
-
-- **`--help` short form `-h`** — Click auto-adds `--help` / `-h` for all commands; listing it explicitly in the table is harmless but adds noise.
-- **`sys.md` cross-links to `hub.md`** (line 117), but the actual CLI command is `winml catalog` (source: `catalog.py`), not `winml hub`. If `hub.md` documents a `winml hub` alias, verify it exists in `__init__.py`; otherwise the cross-link is confusing.
-
-## Verified correct (key claims checked)
-
-- `--format` flag exists with short `-f`, type `Choice(["text", "json", "compact"])`, default `"text"` → source lines 645–652.
-- `--list-device` flag exists as `is_flag=True, default=False`, no short form → source lines 653–658.
-- `--list-ep` flag exists as `is_flag=True, default=False`, no short form → source lines 659–664.
-- QNN detection uses `QNN_SDK_ROOT` / `QAIRT_SDK_ROOT` env vars → source lines 261–272.
-- OpenVINO detection via `import openvino` → source lines 283–290.
-- `--format json` emits devices and EPs → source lines 801–812.
-- Device enumeration in NPU > GPU > CPU priority order → source lines 495–500.
-- EP enumeration merges WinML registry with ORT `get_available_providers()` → source lines 592–623.
diff --git a/docs/superpowers/2026-05-27-doc-issues/tutorials-index.md b/docs/superpowers/2026-05-27-doc-issues/tutorials-index.md
deleted file mode 100644
index ce9238027..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/tutorials-index.md
+++ /dev/null
@@ -1,27 +0,0 @@
-# Issues: docs/tutorials/index.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical
-
-- (none)
-
-## Important
-
-- (none)
-
-## Minor
-
-- **Backtick usage inconsistency.** The prose uses `` `winml-cli` `` (with
-  backtick) in one sentence but refers to `winml` (the CLI binary name) without
-  backtick elsewhere. This is cosmetic only.
-
-## Verified correct
-
-- Table entry `[ConvNeXt on NPU](npu-convnext.md)` — file exists at
-  `docs/tutorials/npu-convnext.md`. Link resolves.
-- "Hardware" column entry "Copilot+PC NPU primary; CPU works as fallback" —
-  consistent with npu-convnext.md content.
-- No command invocations to verify.
-- No `wmk` or `ModelKit` strings in user-facing prose.
-- Page correctly describes tutorials vs samples vs concepts distinctions.
diff --git a/docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md b/docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md
deleted file mode 100644
index 0cb7e2c6b..000000000
--- a/docs/superpowers/2026-05-27-doc-issues/weight-and-activation.md
+++ /dev/null
@@ -1,18 +0,0 @@
-# Issues: docs/concepts/weight-and-activation.md
-
-Source verified against: microsoft/winml-cli @ 5e25579
-
-## Critical (factually wrong; user would hit error)
-- (none)
-
-## Important (misleading or stale claim)
-- Line 23: States "QNN on NPU pairs uint8 weights with uint8 or uint16 activations." According to `src/winml/modelkit/config/precision.py`, the NPU auto-precision resolves to `w8a16` (`_AUTO_PRECISION = {"npu": "w8a16", ...}`, line 33), which maps to `uint8` weights + `uint16` activations (lines 57-65). The `int8` preset maps to `uint8/uint8` (lines 39-51). So the claim "uint8 or uint16" is technically accurate for the full range of QNN-targeted precisions, but the default (and most prominently documented) NPU precision is `w8a16` (uint8 weight + uint16 activation), not `uint8/uint8`. The framing may lead readers to underweight the `w8a16` default.
-
-## Minor (style, polish, low-impact)
-- Line 19: "The `--weight-type` and `--activation-type` flags on `winml quantize` exist..." — both flags are confirmed at `src/winml/modelkit/commands/quantize.py:67` and `73`.
-- Lines 28-33: Cross-links (`quantization.md`, `eps-and-devices.md`, `../commands/quantize.md`, `graphs-and-ir.md`) all resolve to files that exist on disk.
-
-## Verified correct (anchored claims you checked)
-- Line 7: "winml quantize ... observes the weight distributions in your exported ONNX and bakes the per-tensor scale/zero-point into the QDQ nodes" — matches `src/winml/modelkit/commands/quantize.py` workflow and `src/winml/modelkit/config/precision.py` precision resolution.
-- Lines 19-24: `--weight-type` accepts `uint8, int8, uint16, int16`; `--activation-type` accepts the same — confirmed at `src/winml/modelkit/commands/quantize.py:67-76`.
-- Line 25: `w8a16` described as "8-bit weights, 16-bit activations" — confirmed; resolves to `uint8` weight + `uint16` activation via `_BITS_TO_WEIGHT_TYPE[8]="uint8"` and `_BITS_TO_ACTIVATION_TYPE[16]="uint16"` at `src/winml/modelkit/config/precision.py:57-65`.
diff --git a/docs/superpowers/2026-05-27-validated-issues.md b/docs/superpowers/2026-05-27-validated-issues.md
deleted file mode 100644
index 4a2bbfcb3..000000000
--- a/docs/superpowers/2026-05-27-validated-issues.md
+++ /dev/null
@@ -1,151 +0,0 @@
-# v3 docs validated issues
-
-Validated against microsoft/winml-cli @ 5e25579 on docs/draft.
-
-## Critical (factually wrong; user would hit error)
-
-### docs/getting-started/installation.md
-- Python version wrong: doc states `3.10` / `requires-python = ">=3.10,<3.11"` but `pyproject.toml:13` reads `requires-python = ">=3.11,<3.12"`. Install step (`uv python install 3.10`) and verify output (`Python Version 3.10.x`) are both wrong.
-
-### docs/getting-started/end-to-end.md
-- DML and CPU artifact filenames are wrong: doc claims GPU produces `convnext_tiny_dml_ctx.onnx` and CPU produces `convnext_tiny.onnx`. `compiler/configs.py:175` (`for_dml`) and `:165` (`for_cpu`) both set `enable_ep_context=False`; `CompileStage.process` only calls `_finalize_output` when `ep_config.enable_ep_context` is True (`compile.py:102`). Neither DML nor CPU produces a `_ctx.onnx` file — the compile step is a no-op for both.
-
-### docs/commands/overview.md
-- `winml hub` command does not exist. Source registers the function as `catalog` (`catalog.py:387`). Every `winml hub` invocation in the doc will fail at the CLI.
-
-### docs/commands/build.md
-- `--random-init` flag does not exist. Source `build.py` has no such option (grep returns no hits). Passing it will produce "No such option".
-- `--config / -c` documented as *(required)* but `build.py:237` sets `required=False`. When omitted, config is auto-generated from `-m`.
-- `--qnn-sdk-root` listed in the flag table but does not exist on `winml build` (zero hits in `build.py`). It is a `winml compile`-only flag. Users will get "No such option".
-
-### docs/commands/compile.md
-- `--device` default documented as `npu` but `compile.py:62` sets `default="auto"`. Users expecting NPU-only targeting without `--device npu` will get auto-detection instead.
-- `--no-quant` flag does not exist in `compile.py` (zero occurrences). Users who pass it get "No such option".
-
-### docs/commands/config.md
-- `--no-compile` default documented as `off` (compile included by default). `config.py:163` shows `default=True` for `no_compile`, meaning compile is *excluded* by default. The framing is entirely backwards — users need `--compile` to include compilation, not `--no-compile` to exclude it.
-
-### docs/commands/eval.md
-- `--device` type column shows `cpu|gpu|npu`, default `cpu`. Source `eval.py:69` defines `click.Choice(["auto", "cpu", "gpu", "npu"])` with `default="auto"`. `auto` is missing and the default is wrong.
-- `-n` listed as a short alias for `--samples`. Source defines `--samples` with no short flag. The `-n` alias does not exist.
-
-### docs/commands/hub.md
-- All `winml hub` invocations fail: source registers the command as `winml catalog` (`catalog.py:387`).
-- `--model / -m` flag does not exist in `catalog.py` (confirmed full read). Users who run `winml hub --model <id>` get "No such option".
-
-### docs/commands/analyze.md
-- `--device` default documented as `NPU`; `analyze.py:644` sets `default="auto"`. Users will not get NPU-specific analysis by default.
-- `--ep` default documented as "all supported EPs analyzed"; `analyze.py:633` sets `default="auto"` (infers from local availability, not "all").
-- `--run-unknown-op` default documented as "enabled"; `analyze.py:668` sets `default=False`. The pitfall note that says "disable when libraries are missing" compounds this by implying it is on.
-
-### docs/commands/optimize.md
-- `--preset / -p` flag does not exist. `optimize.py` command definition (lines 151–187) has no `--preset` option. The entire "Built-in presets" table and all preset-based examples are invalid; users get "Error: no such option: --preset".
-
-### docs/commands/inspect.md
-- `--list-tasks`, `--model-type`, and `--model-class` flags are not documented. All three are defined in source (`inspect.py:98–116`) and functional.
-
-### docs/concepts/quantization.md
-- `int8` row annotated "default for NPU via QNN EP". Actual NPU auto-precision default is `w8a16` (`precision.py:33`: `_AUTO_PRECISION = {"npu": "w8a16", ...}`). `int8` is valid for QNN but is not the default.
-- `auto` row: "Resolves to `int8` (NPU)…" is wrong for NPU; resolves to `w8a16` per `_AUTO_PRECISION["npu"]` (`precision.py:33`).
-- `w4a16` row: "Recognized as a precision string but raises error at quantization time" is wrong. `is_quantized_precision("w4a16")` returns `False` (4 not in `_BITS_TO_WEIGHT_TYPE`, `precision.py:57`) so it is rejected before quantization, not recognized at all.
-
-### docs/concepts/compile-and-epcontext.md
-- `--no-quant` on `winml compile` (line 29): flag does not exist in `compile.py`. Users get "Error: No such option".
-
-### docs/concepts/config-and-build.md
-- JSON `compile` section uses nested `ep_config.provider` structure. `WinMLCompileConfig.to_dict()` (`configs.py:230–245`) serializes flat with `execution_provider`, not nested under `ep_config`. Copy-pasting the example silently uses defaults instead of the specified values.
-
-### docs/samples/bert-config-build.md
-- Final artifact documented as `bert_out/bert-base-uncased_ctx.onnx`. `build.py:714` writes `final_path = resolved_dir / "model.onnx"` for non-cached builds. The file is `bert_out/model.onnx`; the ctx-name variant does not exist. Step 3 `winml perf` reference to `bert-base-uncased_ctx.onnx` will also fail.
-
-### docs/samples/convnext-primitives.md
-- CPU (`--device cpu`) and GPU (`--device gpu`) compile steps documented as producing `_cpu_ctx.onnx` and `_dml_ctx.onnx`. Both are wrong: `for_cpu()` and `for_dml()` set `enable_ep_context=False` (`configs.py:165,175`); no `_ctx.onnx` is written. The GPU perf tab referencing `convnext_int8_dml_ctx.onnx` will fail.
-- Note claims "`--device` does not accept `auto` on `winml eval`". `eval.py:69` lists `auto` as a valid choice with `default="auto"`.
-
-### docs/tutorials/npu-convnext.md
-- CPU artifact named `convnext_int8_cpu_ctx.onnx` (steps 7–8): `for_cpu()` sets `enable_ep_context=False` (`configs.py:165`); no such file is produced. The Step 8 perf command referencing it will fail.
-- Python 3.10 listed in prerequisites; `pyproject.toml:13` requires `>=3.11,<3.12`.
-
-### docs/concepts/perf-and-monitoring.md
-- `--device` described as accepting only `cpu`, `gpu`, `npu`. `perf.py:1113` uses `device_option(include_auto=True, default="auto")`; `auto` is valid and is the default.
-- Default output path stated as `{model_slug}_perf.json` (current directory). Source `perf.py:871` writes to `~/.cache/winml/perf/<slug>/<timestamp>.json`.
-
----
-
-## Important (misleading or stale)
-
-### docs/getting-started/installation.md
-- "No NPU?" callout claims `winml eval` accepts only `cpu|gpu|npu` (no `auto`). `eval.py:69` defines `click.Choice(["auto", "cpu", "gpu", "npu"])` — `auto` is valid.
-
-### docs/getting-started/end-to-end.md
-- Sample `sys` output shows `QNNExecutionProvider -> NPU`. `get_ep_device_map()` returns `"npu/gpu"` for QNN (`device.py:49`, `constants.py:183`); actual rendered output would be `QNNExecutionProvider -> NPU/GPU`.
-
-### docs/commands/build.md
-- `--no-compile/--compile` documented as a simple `--no-compile` flag; source `build.py:275–282` is a boolean toggle pair; `--compile` (force enable) is undocumented.
-- `--trust-remote-code` absent from flag table; `build.py:312–314` defines it.
-- `--max-optim-iterations` table default shown as `3`; `build.py:309` sets `default=None` (3 is enforced inside pipeline helpers, not at Click layer).
-
-### docs/commands/config.md
-- `--no-compile` framing is backwards (see Critical). The entire usage example `winml config ... --no-compile` implies the flag does work when it is a no-op (already the default).
-
-### docs/commands/hub.md
-- "How it works" describes per-EP latency stats and accuracy verdicts (PASS/AT_RISK/REGRESSION) that do not appear anywhere in `catalog.py`. The rendered catalog shows only Model, Task, Size, Model Type columns.
-- `--ep` and `--device` filter flags (`catalog.py:377–385`) absent from the flag table entirely.
-
-### docs/commands/analyze.md
-- `--ep` valid special values `"all"` and `"auto"` not mentioned; `analyze.py:634` includes both in `Choice`. Related: "Omitting `--ep` analyzes every EP" (pitfall line 82) repeats the incorrect default claim.
-- `--model` short form `-m` shown with empty Short column; `cli.py:68` defines `"--model", "-m"`.
-- `--verbose/-v`, `--quiet/-q`, and `--config/-c` absent from flag table; all defined via decorators (`analyze.py:651–652`).
-
-### docs/commands/optimize.md
-- `--verbose/-v` absent from flag table; `optimize.py:180–185` defines it.
-- `--model` Short column is empty; `optimize.py:167` defines `-m`.
-- "Configuration precedence" describes 4 levels (including "preset"); source has 3 levels (`optimize.py:363–383`). The preset level does not exist.
-
-### docs/commands/inspect.md
-- `-v`/`--verbose` absent from flag table; `inspect.py:78–83` defines it.
-
-### docs/commands/perf.md
-- `--compare-devices` listed as "Not yet implemented" but the flag is not registered at all in `perf.py`. Passing it will error, not silently be ignored.
-- `--op-tracing` documented as a user-facing feature; `perf.py:1183` decorates it `hidden=True`.
-- Default output path documented as `{model_slug}_perf.json`; actual path is `~/.cache/winml/perf/<slug>/<timestamp>.json` (`perf.py:871`).
-
-### docs/commands/sys.md
-- `--verbose/-v` absent from flag table; `sys.py:653–659` defines it. Verbose mode surfaces Backend SDKs and Export Readiness sections (`sys.py:392`).
-
-### docs/concepts/config-and-build.md
-- `WinMLBuildConfig` described as having five sub-configs; `config/build.py:132–138` also has `eval: WinMLEvaluationConfig | None` and `auto: bool`.
-
-### docs/samples/npu-convnext.md
-- Step 7 OpenVINO artifact named `convnext_int8_openvino_ctx.onnx`; `compile.py:230` uses `{stem}_{device}_ctx.onnx` where device is the resolved device string (`"npu"`), not the EP name. Actual filename would be `convnext_int8_npu_ctx.onnx`.
-
-### docs/concepts/perf-and-monitoring.md
-- `--monitor` described as streaming "NPU utilisation". Source resolves from model device at runtime (`perf.py:409`); it monitors whichever device is being benchmarked, not NPU specifically.
-- `--op-tracing` documented as a supported feature; it is `hidden=True` (`perf.py:1183`).
-
-### docs/commands/overview.md
-- `src/winml/modelkit/commands/_options.py` cited as "canonical contract" for global flags. This file does not exist (`_options.py` absent from `commands/` directory). Global flags are in `cli.py`.
-
----
-
-## Rejected (claimed by an agent but not a real defect)
-
-### docs/concepts/quantization.md
-- ["`_KNOWN_PRECISIONS` from `_options.py`" is fabricated] — REJECTED: The claim itself is being kept as Critical because both `_KNOWN_PRECISIONS` and `_options.py` are absent from the codebase (confirmed `_options.py` not in commands/, and `grep` for `_KNOWN_PRECISIONS` returns nothing). The actual symbol is `_NAMED_PRECISIONS` at `precision.py:71`. The finding is genuine, not a false positive.
-
-### docs/concepts/compile-and-epcontext.md
-- [External EPContext described as "default"] — REJECTED as false positive: `EPConfig.embed_context: bool = False` at `configs.py:46` confirms external is the default. Doc is correct.
-- [`--no-validate` flag] — REJECTED as false positive: `compile.py:72–74` defines `--validate/--no-validate`; the doc's use of `--no-validate` correctly names the negative form of the toggle.
-
-### docs/getting-started/end-to-end.md
-- [`QNN_SDK_ROOT` from environment] — REJECTED: `build.py` has no `--qnn-sdk-root` flag (confirmed zero hits). Reading from environment is the correct description.
-- [`--device auto` priority order "NPU first, then GPU, then CPU"] — REJECTED: `device.py:62` confirms `_DEVICE_PRIORITY = ("npu", "gpu", "cpu")`. Claim is correct.
-
-### docs/commands/compile-and-epcontext.md
-- [`for_vitisai` and `for_qnn` described as interchangeable "QNN-family EPs"] — REJECTED as below threshold: both produce EPContext, the distinction noted by the agent is a simplification, not a user-facing error.
-
-### docs/concepts/quantization.md
-- [`int16` weight dtype listed as `int16`] — REJECTED: `_WEIGHT_TYPE["int16"] = "int16"` at `precision.py:43`. Doc row is correct.
-
-### docs/commands/eval.md
-- [`winml eval` loads via `WinMLAutoModel`] — this claim is kept as Important (class name misrepresents implementation) but the agent's flag about missing flags is correct and retained above.
diff --git a/docs/superpowers/plans/2026-05-20-modelkit-docs-site.md b/docs/superpowers/plans/2026-05-20-modelkit-docs-site.md
deleted file mode 100644
index 6321257a9..000000000
--- a/docs/superpowers/plans/2026-05-20-modelkit-docs-site.md
+++ /dev/null
@@ -1,1031 +0,0 @@
-# ModelKit User-Facing Documentation Site — Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Build an MVP user-facing documentation site for ModelKit using MkDocs Material, served from `docs/`, with full content for chapters 1-4 (Getting Started, Concepts, Commands, Samples) and stubs for chapters 5-7.
-
-**Architecture:** Static site generated by MkDocs Material from in-repo markdown. Build verified locally; CI workflow exists but requires `workflow_dispatch` (no automatic publish during MVP). Authoring parallelized via subagents per page group; each batch commits as one logical unit.
-
-**Tech Stack:** Python 3.10 + uv, MkDocs 1.6+, mkdocs-material 9+, pymdown-extensions, mkdocs-jupyter (optional notebook plugin), GitHub Actions.
-
-**Spec:** `docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md`
-
-**Branch:** `docs/init` (based on `feat/mvp`). No pushes to remote.
-
----
-
-## Conventions used in this plan
-
-- **Source CLI name:** `winml` (entry point at `modelkit/cli.py`, surfaced through `pyproject.toml` `scripts.winml`).
-- **Command source files:** `src/winml/modelkit/commands/<name>.py`.
-- **Existing internal docs that must be untouched:** `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md`. Excluded from MkDocs build via `exclude_docs` in `mkdocs.yml`.
-- **Verification step at the end of every task:** `uv run mkdocs build --strict` must succeed with exit code 0 and zero warnings.
-- **Commit message style:** Conventional Commits (`docs: ...`, `chore: ...`). No `Co-Authored-By` (project rule from CLAUDE.md). No "Test plan" section.
-
----
-
-## Task 1: Scaffold MkDocs config, dependencies, and stub tree
-
-**Files:**
-- Modify: `pyproject.toml`
-- Create: `mkdocs.yml`
-- Create: `docs/index.md`
-- Create: `docs/getting-started/installation.md`
-- Create: `docs/getting-started/quickstart.md`
-- Create: `docs/getting-started/end-to-end.md`
-- Create: `docs/concepts/how-it-works.md`
-- Create: `docs/concepts/onnx-and-eps.md`
-- Create: `docs/concepts/quantization.md`
-- Create: `docs/concepts/hierarchy.md`
-- Create: `docs/concepts/buildconfig.md`
-- Create: `docs/commands/overview.md`
-- Create: `docs/commands/sys.md`, `inspect.md`, `hub.md`, `analyze.md`, `config.md`, `optimize.md`, `export.md`, `quantize.md`, `compile.md`, `build.md`, `perf.md`, `eval.md`
-- Create: `docs/samples/convnext-primitives.md`
-- Create: `docs/samples/bert-config-build.md`
-- Create: `docs/samples/qwen3-composite.md`
-- Create: `docs/reference/index.md`
-- Create: `docs/troubleshooting.md`
-- Create: `docs/contributing.md`
-
-- [ ] **Step 1: Add MkDocs dev dependencies to pyproject.toml**
-
-In `pyproject.toml`, locate the `optional-dependencies.dev = [` block (around line 62) and add three new entries (alphabetical insertion):
-
-```toml
-  "mkdocs-jupyter>=0.25",
-  "mkdocs-material>=9.5",
-  "pymdown-extensions>=10.7",
-```
-
-Run:
-```bash
-uv sync --extra dev
-```
-
-Expected: `uv` resolves and installs the three new packages without error. `uv.lock` is regenerated.
-
-- [ ] **Step 2: Create `mkdocs.yml`**
-
-Create `mkdocs.yml` at the repo root with the following exact content:
-
-```yaml
-site_name: ModelKit
-site_description: Accelerate Model Deployment on WinML
-site_url: https://gim-home.github.io/ModelKit/
-repo_url: https://github.com/gim-home/ModelKit
-repo_name: gim-home/ModelKit
-edit_uri: edit/main/docs/
-
-docs_dir: docs
-
-# Internal docs and brainstorming artifacts are kept in the repo but excluded
-# from the user-facing site.
-exclude_docs: |
-  /design/
-  /superpowers/
-  /naming-convention.md
-  /pytest-best-practices.md
-
-theme:
-  name: material
-  features:
-    - navigation.instant
-    - navigation.tracking
-    - navigation.tabs
-    - navigation.sections
-    - navigation.top
-    - content.code.copy
-    - content.action.edit
-    - toc.follow
-    - search.suggest
-    - search.highlight
-  palette:
-    - media: "(prefers-color-scheme: light)"
-      scheme: default
-      primary: indigo
-      accent: indigo
-      toggle:
-        icon: material/brightness-7
-        name: Switch to dark mode
-    - media: "(prefers-color-scheme: dark)"
-      scheme: slate
-      primary: indigo
-      accent: indigo
-      toggle:
-        icon: material/brightness-4
-        name: Switch to light mode
-
-plugins:
-  - search
-
-markdown_extensions:
-  - admonition
-  - attr_list
-  - md_in_html
-  - tables
-  - toc:
-      permalink: true
-  - pymdownx.details
-  - pymdownx.highlight:
-      anchor_linenums: true
-      line_spans: __span
-      pygments_lang_class: true
-  - pymdownx.inlinehilite
-  - pymdownx.snippets
-  - pymdownx.superfences:
-      custom_fences:
-        - name: mermaid
-          class: mermaid
-          format: !!python/name:pymdownx.superfences.fence_code_format
-  - pymdownx.tabbed:
-      alternate_style: true
-  - pymdownx.tasklist:
-      custom_checkbox: true
-
-nav:
-  - Home: index.md
-  - Getting Started:
-      - Installation: getting-started/installation.md
-      - Quickstart: getting-started/quickstart.md
-      - End-to-End — HF → NPU: getting-started/end-to-end.md
-  - Concepts:
-      - How ModelKit Works: concepts/how-it-works.md
-      - ONNX & Execution Providers: concepts/onnx-and-eps.md
-      - Quantization & QDQ: concepts/quantization.md
-      - Hierarchy Preservation: concepts/hierarchy.md
-      - BuildConfig & Kits: concepts/buildconfig.md
-  - Commands:
-      - Overview: commands/overview.md
-      - Discover:
-          - sys: commands/sys.md
-          - inspect: commands/inspect.md
-          - hub: commands/hub.md
-          - analyze: commands/analyze.md
-      - Configure:
-          - config: commands/config.md
-          - optimize: commands/optimize.md
-      - Build:
-          - export: commands/export.md
-          - quantize: commands/quantize.md
-          - compile: commands/compile.md
-          - build: commands/build.md
-      - Measure:
-          - perf: commands/perf.md
-          - eval: commands/eval.md
-  - Samples:
-      - ConvNeXt — Primitives Walkthrough: samples/convnext-primitives.md
-      - BERT — Config + Build + Perf: samples/bert-config-build.md
-      - Qwen3 — Composite Models: samples/qwen3-composite.md
-  - Reference: reference/index.md
-  - Troubleshooting: troubleshooting.md
-  - Contributing: contributing.md
-```
-
-- [ ] **Step 3: Create the landing page**
-
-Create `docs/index.md`:
-
-```markdown
-# ModelKit
-
-> Accelerate model deployment on Windows ML.
-
-ModelKit is a Python toolkit that converts and optimizes PyTorch and Hugging Face models to ONNX for deployment on the [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/) runtime. It supports multiple hardware backends including QNN (Qualcomm Neural Processing SDK), OpenVINO, DirectML, and ONNX Runtime CPU/GPU.
-
-## Where to start
-
-- **[Installation](getting-started/installation.md)** — get the `winml` CLI running locally.
-- **[Quickstart](getting-started/quickstart.md)** — export a Hugging Face model in five minutes.
-- **[End-to-End: HF → NPU](getting-started/end-to-end.md)** — full pipeline against a Qualcomm NPU.
-
-## Learn the model
-
-- **[How ModelKit Works](concepts/how-it-works.md)** — the pipeline from a PyTorch model to an EP-compiled artifact.
-- **[Commands](commands/overview.md)** — reference for all 12 `winml` subcommands.
-- **[Samples](samples/convnext-primitives.md)** — end-to-end walkthroughs for ConvNeXt, BERT, and Qwen3.
-
-## License
-
-MIT. See [LICENSE](https://github.com/gim-home/ModelKit/blob/main/LICENSE.txt).
-```
-
-- [ ] **Step 4: Create all stub pages**
-
-For every page listed in the `Files` section above (other than `index.md`), create a stub with this structure:
-
-```markdown
-# <Page Title>
-
-!!! note "Coming soon"
-    This page is part of the documentation MVP and will be authored shortly.
-```
-
-Use the page title from the `nav:` block in `mkdocs.yml`. For sub-pages like `commands/sys.md`, use `winml sys` as the title.
-
-- [ ] **Step 5: Verify the strict build passes**
-
-Run:
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit code 0; the message `INFO - Documentation built in <N>.<NN> seconds`; no `WARNING` lines. A `site/` directory is produced and is already gitignored (line 99 of `.gitignore`).
-
-If warnings appear about excluded internal docs, re-check the `exclude_docs:` block in `mkdocs.yml`.
-
-- [ ] **Step 6: Commit**
-
-```bash
-git add pyproject.toml uv.lock mkdocs.yml docs/index.md docs/getting-started/ docs/concepts/ docs/commands/ docs/samples/ docs/reference/ docs/troubleshooting.md docs/contributing.md
-git commit -m "docs: scaffold MkDocs Material site with stub pages
-
-Adds mkdocs.yml (Material theme, mermaid superfences, search),
-landing page, and stub pages for chapters 1-7. Internal docs
-(design/, naming-convention.md, pytest-best-practices.md) are
-excluded from the user-facing build via exclude_docs."
-```
-
----
-
-## Task 2: CI workflow (manual dispatch only)
-
-**Files:**
-- Create: `.github/workflows/docs.yml`
-
-- [ ] **Step 1: Create the workflow file**
-
-Create `.github/workflows/docs.yml`:
-
-```yaml
-name: Build & Publish Docs
-
-on:
-  workflow_dispatch:
-
-permissions:
-  contents: write
-
-jobs:
-  build:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - uses: astral-sh/setup-uv@v3
-        with:
-          python-version: "3.10"
-      - run: uv sync --extra dev
-      - run: uv run mkdocs build --strict
-      - uses: peaceiris/actions-gh-pages@v4
-        with:
-          github_token: ${{ secrets.GITHUB_TOKEN }}
-          publish_dir: ./site
-```
-
-- [ ] **Step 2: Verify the workflow is syntactically valid**
-
-Run:
-```bash
-python -c "import yaml; yaml.safe_load(open('.github/workflows/docs.yml'))"
-```
-
-Expected: no output (valid YAML, exit code 0).
-
-- [ ] **Step 3: Commit**
-
-```bash
-git add .github/workflows/docs.yml
-git commit -m "ci: add docs build/publish workflow (manual dispatch only)
-
-Workflow runs mkdocs build --strict and deploys to gh-pages.
-Triggered manually via workflow_dispatch — no automatic publish
-during MVP authoring."
-```
-
----
-
-## Task 3: Author Concepts pages (5 parallel agents)
-
-**Files:**
-- Modify: `docs/concepts/how-it-works.md`
-- Modify: `docs/concepts/onnx-and-eps.md`
-- Modify: `docs/concepts/quantization.md`
-- Modify: `docs/concepts/hierarchy.md`
-- Modify: `docs/concepts/buildconfig.md`
-
-- [ ] **Step 1: Dispatch 5 parallel subagents — one per concept page**
-
-Send all five `Agent` tool calls in a single message so they run concurrently. Use `subagent_type: general-purpose`. Use the prompt template below, substituting the bracketed values per page.
-
-### Reusable agent prompt template
-
-```
-You are authoring one page of the ModelKit user-facing documentation. Output: a single markdown file at the path I give you. Audience: external open-source developers, no insider jargon. Length: 400-700 words. Tone: clear, direct, second person where useful, no marketing fluff.
-
-Page: [PAGE TITLE]
-File to write: [ABSOLUTE PATH]
-Source files to read first (for accuracy — do not copy verbatim):
-[SOURCE PATHS, one per line]
-
-Required structure:
-1. H1 heading matching the page title.
-2. One- or two-paragraph lead that defines the concept and why a ModelKit user encounters it.
-3. Body sections (use H2/H3) covering the points listed below.
-4. A "See also" section at the bottom linking to 2-4 related pages within the docs (use relative paths like `../commands/build.md`).
-
-Body points to cover for this page:
-[BULLET LIST OF SUBSTANTIVE POINTS]
-
-Diagram: [yes/no — if yes, embed one mermaid block as instructed]
-
-Rules:
-- Write actual prose, not bullet lists, except where bullets clarify (lists of EPs, lists of precision options).
-- Use real `winml` command names and flag names where relevant — never the old `wmk` name. CLI source of truth is `src/winml/modelkit/commands/`.
-- Code blocks use triple backticks with language tags (```python, ```bash, ```text).
-- Do NOT speculate beyond what the source code supports. If unsure, omit.
-- Do NOT use placeholder phrases like "TBD" or "more details to come".
-
-Return only the file path you wrote.
-```
-
-### Per-page substitutions
-
-**Agent 1 — How ModelKit Works**
-- File: `docs/concepts/how-it-works.md`
-- Sources: `src/winml/modelkit/build/`, `src/winml/modelkit/commands/build.py`, `src/winml/modelkit/cli.py`
-- Body points:
-  - The pipeline at a glance: PyTorch/HF model → ONNX export → optional optimization → optional quantization → EP-specific compilation → inference session.
-  - What each stage does and which `winml` command owns it.
-  - Where `build` fits as the one-shot wrapper for the staged commands.
-  - How configuration flows (BuildConfig) versus ad-hoc CLI flags.
-- Diagram: yes — a top-to-bottom mermaid flowchart of the five stages. Use the literal syntax:
-  ```
-  ```mermaid
-  flowchart TD
-      A[PyTorch / HF model] --> B[winml export]
-      B --> C[winml optimize]
-      C --> D[winml quantize]
-      D --> E[winml compile]
-      E --> F[EP-ready ONNX]
-      F --> G[winml perf / eval]
-  ```
-  ```
-
-**Agent 2 — ONNX & Execution Providers**
-- File: `docs/concepts/onnx-and-eps.md`
-- Sources: `src/winml/modelkit/sysinfo/`, `src/winml/modelkit/commands/sys.py`, `src/winml/modelkit/analyze/`
-- Body points:
-  - What ONNX is (one paragraph; link to onnx.ai).
-  - What an Execution Provider is in ONNX Runtime terms.
-  - The EPs ModelKit supports today (read sysinfo/analyze source to enumerate — at least QNN, OpenVINO, DirectML, CPU, CUDA).
-  - Hardware/device mapping table (Device × EP) showing which combinations are valid.
-  - How `--device` (auto/cpu/gpu/npu) versus `--ep` interact.
-- Diagram: no.
-
-**Agent 3 — Quantization & QDQ**
-- File: `docs/concepts/quantization.md`
-- Sources: `src/winml/modelkit/quant/`, `src/winml/modelkit/commands/quantize.py`
-- Body points:
-  - Why quantize: smaller artifacts, faster inference on integer-capable hardware (NPUs), trade-off is accuracy loss.
-  - Precision options ModelKit supports today (read quant source to enumerate — fp32, fp16, int8, int16, w8a8, w8a16, w4a16, auto).
-  - Calibration: what calibration samples do, the `--samples` and `--method` flags (minmax/entropy/percentile).
-  - QDQ pattern explained: insert Quantize and Dequantize nodes around weights/activations so the runtime can fuse them on the target device.
-  - When quantization is lossy and how to tell.
-- Diagram: no.
-
-**Agent 4 — Hierarchy Preservation**
-- File: `docs/concepts/hierarchy.md`
-- Sources: `src/winml/modelkit/export/`, `src/winml/modelkit/onnx/`, `src/winml/modelkit/commands/export.py`, `src/winml/modelkit/commands/inspect.py`
-- Body points:
-  - Standard ONNX export flattens the PyTorch module tree — you lose which ops belonged to which layer.
-  - ModelKit embeds PyTorch hierarchy as metadata in the exported ONNX (`hierarchy_tag`).
-  - What this unlocks: per-module benchmarking (`winml perf --module`), targeted optimization, hierarchy view in `winml inspect --hierarchy`.
-  - How to turn it off: `winml export --no-hierarchy`.
-- Diagram: no.
-
-**Agent 5 — BuildConfig & Kits**
-- File: `docs/concepts/buildconfig.md`
-- Sources: `src/winml/modelkit/config/`, `src/winml/modelkit/commands/config.py`, `src/winml/modelkit/commands/build.py`
-- Body points:
-  - What a `WinMLBuildConfig` represents: the full set of decisions for one model's pipeline (model id, task, precision, EP, quantization options, compilation options).
-  - How `winml config` generates one for a given Hugging Face model or local ONNX.
-  - How `winml build -c <config.json>` consumes one.
-  - Per-task templates — where they live in `MODEL_BUILD_CONFIGS`.
-  - Why a config file is useful: reproducibility, sharing, CI.
-- Diagram: no.
-
-- [ ] **Step 2: Verify strict build passes**
-
-Run:
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit code 0, no warnings. If any link the agents wrote is broken (e.g. `../commands/sys.md` before sys.md has real content — it's fine, the stub still exists), `--strict` will pass; only missing files fail.
-
-- [ ] **Step 3: Commit**
-
-```bash
-git add docs/concepts/
-git commit -m "docs: author Concepts chapter (5 pages)
-
-Covers the ModelKit mental model: pipeline overview, ONNX/EPs,
-quantization & QDQ, hierarchy preservation, BuildConfig. Each
-page authored from the corresponding source modules; mermaid
-diagram in How ModelKit Works."
-```
-
----
-
-## Task 4: Author 12 Command pages (4 parallel agents, 3 commands each)
-
-**Files (modified):**
-- `docs/commands/sys.md`, `inspect.md`, `hub.md`, `analyze.md`
-- `docs/commands/config.md`, `optimize.md`
-- `docs/commands/export.md`, `quantize.md`, `compile.md`, `build.md`
-- `docs/commands/perf.md`, `eval.md`
-
-- [ ] **Step 1: Dispatch 4 parallel subagents**
-
-Send all four `Agent` tool calls in a single message. Each agent owns one group:
-
-| Agent | Group | Commands |
-|---|---|---|
-| 1 | Discover (part 1) | sys, inspect, hub |
-| 2 | Discover (part 2) + Configure | analyze, config, optimize |
-| 3 | Build | export, quantize, compile, build |
-| 4 | Measure | perf, eval |
-
-Note Agent 3 owns four commands (the Build group naturally has 4); the rebalance keeps the other agents at 3.
-
-### Reusable agent prompt template
-
-```
-You are authoring user-facing reference pages for the ModelKit `winml` CLI. Output: one markdown file per command listed below. Audience: external open-source developers. Length: 300-600 words per page.
-
-Commands to author:
-[LIST: command name → file path]
-
-For each command, read its source file at `src/winml/modelkit/commands/<command>.py` and the shared options at `src/winml/modelkit/commands/_options.py`. Confirm the actual flags by running `uv run winml <command> --help` and capturing the output.
-
-Required structure for every page:
-
-# winml <command>
-
-> [One-line tagline describing the command's job — 8-15 words.]
-
-## When to use this
-
-[One or two sentences. State the user intent and where it falls in the pipeline.]
-
-## Synopsis
-
-```bash
-$ winml <command> [options]
-```
-
-## Flags
-
-[A markdown table with columns: Flag | Short | Type | Default | Description. Include EVERY flag from --help output. For shared flags (model, device, ep, output, task, precision), list them with their actual semantics for this command — do not skip them.]
-
-## How it works
-
-[2-4 sentences explaining what the command does internally — keep it user-relevant, not implementation trivia.]
-
-## Examples
-
-[3-5 examples. Each is a fenced bash block. Order from simplest to richest. Include expected output snippet (use a separate ```text block) for at least the first example. Use realistic model ids — microsoft/resnet-50, bert-base-uncased, microsoft/Phi-3-mini-4k-instruct as appropriate.]
-
-## Common pitfalls
-
-[Bulleted list of 2-5 gotchas. Be specific: missing flags, environment requirements, common error messages.]
-
-## See also
-
-[2-4 relative links to related command pages or concept pages.]
-
-Rules:
-- Use the actual `winml` CLI name everywhere. Never `wmk`.
-- Use the actual flag names from the source. Never invent flags.
-- Code blocks: ```bash for invocations, ```text for output, ```python only if a real Python snippet is needed.
-- No placeholder phrases. If a section legitimately has nothing to say (e.g. no pitfalls), omit the section.
-- Cross-link to concept pages where helpful: `../concepts/quantization.md`, `../concepts/buildconfig.md`, etc.
-
-Return the list of file paths you wrote.
-```
-
-- [ ] **Step 2: Verify strict build passes**
-
-Run:
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit code 0, no warnings.
-
-- [ ] **Step 3: Commit**
-
-```bash
-git add docs/commands/
-git commit -m "docs: author all 12 command reference pages
-
-Each page follows the standard template: tagline, when-to-use,
-synopsis, flags table, how-it-works, examples, pitfalls, see-also.
-Pages drafted from src/winml/modelkit/commands/ sources and the
-real --help output. Commands → Overview page authored in next task."
-```
-
----
-
-## Task 5: Author Commands → Overview page
-
-**Files:**
-- Modify: `docs/commands/overview.md`
-
-- [ ] **Step 1: Dispatch one subagent**
-
-Prompt:
-
-```
-Author the user-facing Commands Overview page for ModelKit. Output: `docs/commands/overview.md`. Length: 400-600 words.
-
-Read all 12 command pages under `docs/commands/` (already authored). Read the shared CLI argument spec at `docs/design/cli/3_cli_args_spec.md` for context (do not copy from it — it is internal).
-
-Required structure:
-
-# Commands
-
-[2-3 paragraph lead: ModelKit exposes a CLI named `winml` with 12 subcommands organized by user intent. Show the four groups (Discover / Configure / Build / Measure) and explain when a user would reach for each group.]
-
-## Command map
-
-[Markdown table with columns: Command | Group | Purpose. One row per command. Link the command name to its page, e.g. [sys](sys.md).]
-
-## Choosing a command
-
-[A decision-style section. Pose 5-8 common questions a user might have ("I want to see what hardware I have", "I want to convert a HF model to ONNX", "I want to benchmark a compiled model on NPU", ...) and answer each with a single command + a one-line reason.]
-
-## Global flags
-
-[Brief mention of -v / -q / --debug / --version / -h. Note they live on the root `winml` group only and are inherited by all subcommands via ctx.obj.]
-
-## Shared flags
-
-[Brief mention of -m / -d / -o / -t / -p / --ep. State they have the same meaning on every command that accepts them.]
-
-Rules:
-- Use `winml` everywhere.
-- Link every command name to its page.
-- Do not duplicate full flag tables — that's what each command page is for.
-
-Return the file path you wrote.
-```
-
-- [ ] **Step 2: Verify strict build passes**
-
-Run:
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit code 0, no warnings.
-
-- [ ] **Step 3: Commit**
-
-```bash
-git add docs/commands/overview.md
-git commit -m "docs: author Commands Overview page
-
-Adds the 12-command map (grouped Discover/Configure/Build/Measure),
-a 'choosing a command' decision section, and references to global
-and shared flags. Drafted from the now-settled per-command pages."
-```
-
----
-
-## Task 6: Author Sample pages (3 parallel agents)
-
-**Files:**
-- Modify: `docs/samples/convnext-primitives.md`
-- Modify: `docs/samples/bert-config-build.md`
-- Modify: `docs/samples/qwen3-composite.md`
-
-- [ ] **Step 1: Dispatch 3 parallel subagents**
-
-### Agent 1 — ConvNeXt primitives walkthrough
-
-```
-Author the ConvNeXt primitives sample page. Output: `docs/samples/convnext-primitives.md`. Length: 700-1100 words.
-
-The goal: teach the user the ModelKit pipeline by invoking each command directly (no `build` wrapper). Reader walks away understanding what each command does and how outputs chain.
-
-Source materials:
-- `src/winml/modelkit/commands/` (all command files — use real flags)
-- The command pages under `docs/commands/`
-- The concept pages under `docs/concepts/`
-
-Required structure:
-
-# ConvNeXt — Primitives Walkthrough
-
-[Lead: 2-paragraph intro. Why ConvNeXt (small, fast, ImageNet classifier — good first model). State this sample uses the primitive commands rather than `winml build`, to show how the pieces compose. State the target EPs covered (CPU, GPU, NPU).]
-
-## Prerequisites
-
-[Bulleted list: ModelKit installed, internet access for HF download, optional QNN SDK for NPU section.]
-
-## Step 1: Inspect the model
-
-[Brief bash block running `winml inspect -m facebook/convnext-tiny-224`. Show abbreviated expected output. One-paragraph callout: "What we just did — checked task detection, model class, exporter compatibility."]
-
-## Step 2: Generate a config (optional)
-
-[Show `winml config -m facebook/convnext-tiny-224 -o convnext_config.json`. Callout: this is optional in the primitives flow but useful for reproducibility.]
-
-## Step 3: Export to ONNX
-
-[`winml export -m facebook/convnext-tiny-224 -o convnext.onnx`. Callout on hierarchy preservation.]
-
-## Step 4: Quantize
-
-[`winml quantize -m convnext.onnx -o convnext_int8.onnx --precision int8 --samples 32`. Callout on calibration.]
-
-## Step 5: Compile for each EP
-
-[Three tabbed bash blocks using pymdownx.tabbed syntax:
-
-=== "CPU"
-    ```bash
-    winml compile -m convnext_int8.onnx -o convnext_cpu.onnx --device cpu
-    ```
-
-=== "GPU"
-    ```bash
-    winml compile -m convnext_int8.onnx -o convnext_gpu.onnx --device gpu
-    ```
-
-=== "NPU"
-    ```bash
-    winml compile -m convnext_int8.onnx -o convnext_npu.onnx --device npu --qnn-sdk-root <path>
-    ```
-
-Callout: NPU compilation requires the QNN SDK; cross-link to concepts/onnx-and-eps.md.]
-
-## Step 6: Benchmark
-
-[`winml perf` invocations with --device flags, one per EP.]
-
-## Step 7: Evaluate
-
-[`winml eval` on an ImageNet validation slice. Note the dataset flag.]
-
-## What you learned
-
-[Short bulleted summary: which command does what, how the artifacts chain.]
-
-## See also
-
-[Links to convnext command pages, BERT sample, Concepts/Quantization.]
-
-Rules:
-- Use real flag names. Verify against source.
-- Use realistic but small sample counts.
-- Where an output is shown, use ```text and keep it short (5-10 lines).
-
-Return the file path.
-```
-
-### Agent 2 — BERT config + build sample
-
-```
-Author the BERT config + build sample. Output: `docs/samples/bert-config-build.md`. Length: 500-800 words.
-
-The goal: teach the production-style workflow where the user generates a BuildConfig, runs `winml build` end-to-end, then measures with `winml perf`. EP coverage is NOT the focus — the workflow is.
-
-Source materials: same as ConvNeXt sample agent.
-
-Required structure:
-
-# BERT — Config + Build + Perf
-
-[Lead: 2-paragraph intro. Why BERT (canonical text classifier). State this sample uses `winml config` to generate a config file, then `winml build` to run the whole pipeline in one shot. Contrast briefly with the ConvNeXt primitives sample.]
-
-## Prerequisites
-
-[Bulleted list — short.]
-
-## Step 1: Generate a build config
-
-[`winml config -m bert-base-uncased -t text-classification -o bert_config.json`. Show truncated example of the JSON content (5-10 lines, illustrative). One-paragraph callout on what's in the file — link to concepts/buildconfig.md.]
-
-## Step 2: Run the build
-
-[`winml build -c bert_config.json --output-dir bert_out/`. Show short text-style output of stage progress (export → quantize → compile). Callout on the analyzer loop / --no-analyze if relevant.]
-
-## Step 3: Benchmark
-
-[`winml perf -m bert_out/<artifact>.onnx --iterations 50`. Show expected output snippet (1-2 numeric latency/throughput lines).]
-
-## Customizing the config
-
-[A short section: how to override precision in the config, how to disable a stage (--no-quant). Refer to concepts/buildconfig.md.]
-
-## What you learned
-
-[Bulleted summary.]
-
-## See also
-
-[Links to commands/config.md, commands/build.md, commands/perf.md, concepts/buildconfig.md, samples/convnext-primitives.md.]
-
-Rules:
-- Use real flag names. Verify.
-- Keep EP detail minimal — workflow is the focus.
-
-Return the file path.
-```
-
-### Agent 3 — Qwen3 placeholder
-
-```
-Author the Qwen3 sample placeholder. Output: `docs/samples/qwen3-composite.md`. Length: 150-300 words.
-
-This is a placeholder because composite-model support is on a feature branch and not yet in `feat/mvp` or `main`. We reserve the nav slot now.
-
-Required structure:
-
-# Qwen3 — Composite Models
-
-!!! info "Coming soon"
-    Composite-model support — running models with multiple components like text encoder + decoder or vision + LLM through a single ModelKit pipeline — is on an in-progress feature branch. This page will be authored once that work merges.
-
-## What composite models are
-
-[One short paragraph: explain at a conceptual level. Examples: an LLM with a separate vision encoder, a text encoder + decoder pair, multi-stage pipelines.]
-
-## What Qwen3 will demonstrate
-
-[Bulleted preview: 3-5 bullets of what the sample will cover when it ships. Be honest that this is forward-looking.]
-
-## Track progress
-
-[One line pointing the reader to GitHub issues / the project README for the current status.]
-
-Rules:
-- Be honest that this is a placeholder. Do not invent details.
-- Do not promise dates.
-- Use the actual project URL: https://github.com/gim-home/ModelKit
-```
-
-- [ ] **Step 2: Verify strict build passes**
-
-Run:
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit code 0, no warnings.
-
-- [ ] **Step 3: Commit**
-
-```bash
-git add docs/samples/
-git commit -m "docs: author Sample pages (ConvNeXt, BERT, Qwen3 placeholder)
-
-ConvNeXt walks through the primitive commands (export → quantize →
-compile → perf → eval) across CPU/GPU/NPU. BERT shows the
-config + build + perf workflow. Qwen3 is a placeholder for the
-upcoming composite-model feature."
-```
-
----
-
-## Task 7: Author Getting Started pages (3 parallel agents)
-
-**Files:**
-- Modify: `docs/getting-started/installation.md`
-- Modify: `docs/getting-started/quickstart.md`
-- Modify: `docs/getting-started/end-to-end.md`
-
-Sequenced after Concepts, Commands, and Samples so it can cross-link to settled pages.
-
-- [ ] **Step 1: Dispatch 3 parallel subagents**
-
-### Agent 1 — Installation
-
-```
-Author `docs/getting-started/installation.md`. Length: 300-500 words.
-
-Audience: external developers who have just landed on the repo.
-
-Required structure:
-
-# Installation
-
-[Lead: one paragraph explaining what ModelKit is and what you need to install it.]
-
-## Prerequisites
-
-[Bulleted list: Windows 10/11, Python 3.10 (not 3.11+), `uv` package manager with link to https://github.com/astral-sh/uv. Mention git.]
-
-## Install
-
-[Bash block:
-```bash
-git clone https://github.com/gim-home/ModelKit.git
-cd ModelKit
-uv python install 3.10
-uv sync
-```
-Briefly explain each line.]
-
-## Verify
-
-[Bash block: `uv run winml sys` and show a snippet of expected output (5-8 lines, abbreviated). State that this enumerates available devices and execution providers.]
-
-## Optional extras
-
-[Brief mention of `--extra openvino` and `--extra qnn` (or whatever extras pyproject.toml actually defines — read `pyproject.toml` lines 79-82 to confirm). Note these are needed for OpenVINO and Qualcomm NPU respectively.]
-
-## Next steps
-
-[Link to quickstart.md.]
-
-Rules:
-- Use realistic terminal commands. No placeholders.
-- Verify the extras names against pyproject.toml.
-
-Return the file path.
-```
-
-### Agent 2 — Quickstart
-
-```
-Author `docs/getting-started/quickstart.md`. Length: 400-600 words.
-
-Audience: someone with ModelKit installed who wants a first success in ~5 minutes.
-
-Required structure:
-
-# Quickstart
-
-[Lead: one paragraph. Goal of this page: prove your install works by exporting a Hugging Face image classifier and inspecting the result.]
-
-## Export your first model
-
-[Bash block:
-```bash
-uv run winml export -m microsoft/resnet-50 -o resnet50.onnx
-```
-One-paragraph callout: what just happened. Cross-link to commands/export.md.]
-
-## Inspect the artifact
-
-[Bash block:
-```bash
-uv run winml inspect -m resnet50.onnx
-```
-Show a short truncated table-style output (5-8 lines). Cross-link to commands/inspect.md.]
-
-## What's next
-
-[Three short bullet links:
-- End-to-End walkthrough → ../getting-started/end-to-end.md
-- Concept of how ModelKit works → ../concepts/how-it-works.md
-- Full ConvNeXt sample → ../samples/convnext-primitives.md ]
-
-Rules:
-- Keep it under 5 minutes of reading + running.
-- No quantization, no EP selection — that's the next page's job.
-- Use `uv run winml` prefix consistently.
-
-Return the file path.
-```
-
-### Agent 3 — End-to-End: HF → NPU
-
-```
-Author `docs/getting-started/end-to-end.md`. Length: 700-1000 words.
-
-Audience: someone past the quickstart who wants to see the full pipeline land on a real NPU.
-
-Required structure:
-
-# End-to-End: Hugging Face → NPU
-
-[Lead: 2 paragraphs. Goal of this page: run a ConvNeXt classifier through the full ModelKit pipeline targeting a Qualcomm NPU via `winml build`. Estimated time, hardware requirement (Qualcomm device + QNN SDK).]
-
-## Prerequisites
-
-[Bulleted list: Quickstart done, Qualcomm device, QNN SDK installed (link out), `--extra qnn` installed.]
-
-## Step 1: Generate the build config
-
-[Bash block: `uv run winml config -m facebook/convnext-tiny-224 --device npu -o convnext_npu.json`. Truncated JSON snippet. Callout linking to concepts/buildconfig.md.]
-
-## Step 2: Run the build
-
-[Bash block: `uv run winml build -c convnext_npu.json --output-dir convnext_npu_out/ --qnn-sdk-root <path>`. Stage-progress output snippet. One-paragraph callout on what each stage does (link to concepts/how-it-works.md).]
-
-## Step 3: Benchmark on the NPU
-
-[Bash block: `uv run winml perf -m convnext_npu_out/<artifact>.onnx --device npu --iterations 50`. Expected latency line snippet.]
-
-## Step 4: (Optional) Compare against CPU
-
-[Same model, --device cpu, show the relative latency.]
-
-## Where to go next
-
-[Bulleted list of links:
-- Samples → ../samples/convnext-primitives.md
-- Command reference → ../commands/overview.md
-- BuildConfig → ../concepts/buildconfig.md]
-
-Rules:
-- All commands must use real flags from src/winml/modelkit/commands/.
-- Place QNN-specific notes in admonitions.
-- This is the showcase page — write it well.
-
-Return the file path.
-```
-
-- [ ] **Step 2: Verify strict build passes**
-
-Run:
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit code 0, no warnings.
-
-- [ ] **Step 3: Commit**
-
-```bash
-git add docs/getting-started/
-git commit -m "docs: author Getting Started chapter (3 pages)
-
-Installation covers Windows + Python 3.10 + uv setup and the
-optional EP extras. Quickstart proves the install with a 5-minute
-export + inspect. End-to-End walks ConvNeXt through the full
-pipeline targeting a Qualcomm NPU."
-```
-
----
-
-## Task 8: Final verification and cleanup
-
-**Files:** none modified.
-
-- [ ] **Step 1: Run full strict build**
-
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit code 0, no warnings, `site/` produced.
-
-- [ ] **Step 2: Local-serve smoke test**
-
-```bash
-uv run mkdocs serve
-```
-
-Expected: server starts on http://127.0.0.1:8000, banner shows "Documentation built in <N> seconds", no warnings in the log. Open the URL in a browser, click through:
-- Landing → Getting Started → Quickstart
-- Concepts → How ModelKit Works (verify mermaid renders)
-- Commands → Overview → click into 2-3 command pages
-- Samples → ConvNeXt
-
-Verify dark/light toggle works.
-
-Stop the server (Ctrl+C).
-
-- [ ] **Step 3: Confirm no remote pushes**
-
-```bash
-git log origin/feat/mvp..HEAD --oneline
-git status
-```
-
-Expected: lists every commit from Task 1 through Task 7 (about 7 commits) as ahead of `origin/feat/mvp`. Working tree clean.
-
-- [ ] **Step 4: Confirm internal docs untouched**
-
-```bash
-git diff origin/feat/mvp..HEAD -- docs/design/ docs/naming-convention.md docs/pytest-best-practices.md
-```
-
-Expected: no output (no changes to those paths).
-
-- [ ] **Step 5: No commit needed**
-
-If smoke test passes and no internal docs were touched, the MVP is complete. Report the commit log to the user.
-
----
-
-## Self-review notes
-
-- **Spec coverage:** every section in the spec maps to at least one task. Section 5 IA → Tasks 1, 3, 4, 5, 6, 7. Section 8 layout → Task 1. Section 9 MkDocs config → Task 1 Step 2. Section 10 CI → Task 2. Section 11 batches → Tasks 3, 4, 5, 6, 7 (one task per batch). Section 13 acceptance criteria → Task 8.
-- **Type/name consistency:** `winml` (not `wmk`) used throughout. Flag names referenced are present in `src/winml/modelkit/commands/_options.py` (verified). Source paths use `src/winml/modelkit/commands/` consistently.
-- **Placeholder scan:** no TBD/TODO. All agent prompts are concrete; per-page substitutions are listed. The Qwen3 sample is intentionally a placeholder *page*, not a placeholder *task*.
-- **Parallelism is explicit:** Tasks 3, 4, 6, 7 dispatch multiple agents in a single message — the executor (subagent-driven-development) reads this and parallelizes accordingly.
diff --git a/docs/superpowers/plans/2026-05-24-docs-expansion-v2.md b/docs/superpowers/plans/2026-05-24-docs-expansion-v2.md
deleted file mode 100644
index 56b4ef49f..000000000
--- a/docs/superpowers/plans/2026-05-24-docs-expansion-v2.md
+++ /dev/null
@@ -1,996 +0,0 @@
-# Docs Expansion v2 — Implementation Plan
-
-> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
-
-**Goal:** Author 11 new doc pages, rename 2 existing pages with content edits, modify 5 pages, and restructure the MkDocs nav — delivering a Tutorials chapter, a sub-grouped Concepts chapter (Fundamentals + WinML CLI), and polish to Getting Started.
-
-**Architecture:** Six-batch plan executed on `docs/v2`. Foundation first (scaffold + nav + renames), then authoring batches in parallel where pages don't share state, then a cross-link sweep that catches any reference to the renamed files. Verification at every batch is `uv run mkdocs build --strict`.
-
-**Tech Stack:** Python 3.10 + uv, MkDocs Material 9.5+, pymdown-extensions, Bash via Git for Windows for `sed`/`grep` operations.
-
-**Spec:** `docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md`
-
-**Branch:** `docs/v2` (off `docs/v1`). No remote pushes during execution.
-
----
-
-## Conventions used in this plan
-
-- **CLI source of truth:** `src/winml/modelkit/commands/<name>.py` and `src/winml/modelkit/commands/_options.py`. Every flag mentioned in a doc must exist in source.
-- **Product name in prose:** `winml-cli` (never `wmk` or `ModelKit`).
-- **Existing internal docs that must NOT be modified:** `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md`, `docs/superpowers/` (other than this plan and its spec).
-- **Verification at task end:** `uv run mkdocs build --strict` — must exit 0 with no WARNING lines from MkDocs (the Material upstream advisory banner is not a MkDocs WARNING).
-- **Commit style:** Conventional Commits (`docs: ...`). No `Co-Authored-By`. No "Test plan" section.
-- **Parallel agent dispatches** within a task = single message with multiple Agent tool calls, agents do NOT commit (orchestrator batch-commits).
-
----
-
-## Task 1: Scaffold — stubs, renames, nav restructure (Batch A)
-
-**Files (modify):**
-- `mkdocs.yml`
-
-**Files (rename):**
-- `docs/concepts/onnx-and-eps.md` → `docs/concepts/eps-and-devices.md`
-- `docs/concepts/hierarchy.md` → `docs/concepts/hierarchy-and-metadata.md`
-
-**Files (create as stubs — full content authored in later batches):**
-- `docs/tutorials/index.md`
-- `docs/tutorials/npu-convnext.md`
-- `docs/concepts/graphs-and-ir.md`
-- `docs/concepts/tensors-and-dtypes.md`
-- `docs/concepts/primitives-and-pipeline.md`
-- `docs/concepts/config-and-build.md`
-- `docs/concepts/load-and-export.md`
-- `docs/concepts/analyze-and-optimize.md`
-- `docs/concepts/compile-and-epcontext.md`
-- `docs/concepts/perf-and-monitoring.md`
-- `docs/concepts/eval-and-datasets.md`
-
-- [ ] **Step 1: Rename the 2 existing concept files**
-
-Use `git mv` so history is preserved:
-
-```bash
-git mv docs/concepts/onnx-and-eps.md docs/concepts/eps-and-devices.md
-git mv docs/concepts/hierarchy.md docs/concepts/hierarchy-and-metadata.md
-```
-
-The file contents are unchanged at this step; content edits happen in Batch B.
-
-- [ ] **Step 2: Create 11 stub pages**
-
-Each stub has this exact body shape, with `<Page Title>` filled per the table below:
-
-```markdown
-# <Page Title>
-
-!!! note "Coming soon"
-    This page is part of the v2 docs expansion and will be authored next.
-```
-
-| File path | Page Title |
-|---|---|
-| `docs/tutorials/index.md` | `Tutorials` |
-| `docs/tutorials/npu-convnext.md` | `ConvNeXt on NPU` |
-| `docs/concepts/graphs-and-ir.md` | `Models, graphs, and the ONNX IR` |
-| `docs/concepts/tensors-and-dtypes.md` | `Tensors and dtypes` |
-| `docs/concepts/primitives-and-pipeline.md` | `Primitives and pipeline` |
-| `docs/concepts/config-and-build.md` | `Config and build` |
-| `docs/concepts/load-and-export.md` | `Load and export` |
-| `docs/concepts/analyze-and-optimize.md` | `Analyze and optimize` |
-| `docs/concepts/compile-and-epcontext.md` | `Compile and EPContext` |
-| `docs/concepts/perf-and-monitoring.md` | `Perf and monitoring` |
-| `docs/concepts/eval-and-datasets.md` | `Eval and datasets` |
-
-- [ ] **Step 3: Update `mkdocs.yml` nav**
-
-Replace the existing `nav:` block with this exact block (the rest of the file — `site_name`, `theme`, `plugins`, `markdown_extensions`, `exclude_docs` — stays untouched):
-
-```yaml
-nav:
-  - Home: index.md
-  - Getting Started:
-      - Installation: getting-started/installation.md
-      - Quickstart: getting-started/quickstart.md
-      - End-to-End — HF → NPU: getting-started/end-to-end.md
-  - Concepts:
-      - Fundamentals:
-          - How winml-cli works: concepts/how-it-works.md
-          - Models, graphs, and the ONNX IR: concepts/graphs-and-ir.md
-          - Tensors and dtypes: concepts/tensors-and-dtypes.md
-          - Execution Providers and devices: concepts/eps-and-devices.md
-          - Quantization and QDQ: concepts/quantization.md
-          - Hierarchy and ONNX metadata: concepts/hierarchy-and-metadata.md
-          - BuildConfig and kits: concepts/buildconfig.md
-      - WinML CLI:
-          - Primitives and pipeline: concepts/primitives-and-pipeline.md
-          - Config and build: concepts/config-and-build.md
-          - Load and export: concepts/load-and-export.md
-          - Analyze and optimize: concepts/analyze-and-optimize.md
-          - Compile and EPContext: concepts/compile-and-epcontext.md
-          - Perf and monitoring: concepts/perf-and-monitoring.md
-          - Eval and datasets: concepts/eval-and-datasets.md
-  - Commands:
-      - Overview: commands/overview.md
-      - Discover:
-          - sys: commands/sys.md
-          - inspect: commands/inspect.md
-          - hub: commands/hub.md
-          - analyze: commands/analyze.md
-      - Configure:
-          - config: commands/config.md
-          - optimize: commands/optimize.md
-      - Build:
-          - export: commands/export.md
-          - quantize: commands/quantize.md
-          - compile: commands/compile.md
-          - build: commands/build.md
-      - Measure:
-          - perf: commands/perf.md
-          - eval: commands/eval.md
-  - Samples:
-      - ConvNeXt — Primitives Walkthrough: samples/convnext-primitives.md
-      - BERT — Config + Build + Perf: samples/bert-config-build.md
-      - Qwen3 — Composite Models: samples/qwen3-composite.md
-  - Tutorials:
-      - Overview: tutorials/index.md
-      - ConvNeXt on NPU: tutorials/npu-convnext.md
-  - Reference: reference/index.md
-  - Troubleshooting: troubleshooting.md
-  - Contributing: contributing.md
-```
-
-- [ ] **Step 4: Verify strict build**
-
-```bash
-uv run mkdocs build --strict
-```
-
-Expected: exit 0, message `Documentation built in <N> seconds`, no WARNING lines.
-
-If `--strict` errors with "doc file not found" for any of the 11 new files or the 2 renamed files, fix the path before continuing.
-
-- [ ] **Step 5: Commit**
-
-```bash
-git add mkdocs.yml docs/concepts/ docs/tutorials/
-git commit -m "docs: scaffold v2 expansion (stubs, renames, nav restructure)
-
-Renames:
-- concepts/onnx-and-eps.md -> concepts/eps-and-devices.md
-- concepts/hierarchy.md -> concepts/hierarchy-and-metadata.md
-
-Stubs created (content authored in next batches):
-- tutorials/index.md, tutorials/npu-convnext.md
-- concepts/graphs-and-ir.md, concepts/tensors-and-dtypes.md
-- concepts/{primitives-and-pipeline,config-and-build,load-and-export,
-  analyze-and-optimize,compile-and-epcontext,perf-and-monitoring,
-  eval-and-datasets}.md
-
-Nav restructured: Concepts sub-grouped into Fundamentals + WinML CLI;
-Tutorials chapter inserted between Samples and Reference."
-```
-
----
-
-## Task 2: Concepts — Fundamentals authoring (Batch B)
-
-**Files (full content authoring or content editing):**
-- Modify: `docs/concepts/how-it-works.md` (rename-in-nav only — content kept; included here so the reviewer notices it)
-- Modify: `docs/concepts/eps-and-devices.md` (already renamed in Task 1; small content reframe to match the new pair-topic title)
-- Modify: `docs/concepts/hierarchy-and-metadata.md` (already renamed; broaden content to cover other metadata, not just `winml.hierarchy.tag`)
-- Modify: `docs/concepts/buildconfig.md` (rename-in-nav only — content kept)
-- Modify: `docs/concepts/quantization.md` (tighten — dtype family content moves out to Tensors page)
-- Author: `docs/concepts/graphs-and-ir.md`
-- Author: `docs/concepts/tensors-and-dtypes.md`
-
-In total: **2 new pages authored, 3 pages content-edited, 2 pages untouched-but-renamed-in-nav**. The 2 untouched-in-nav pages need no editing in this batch.
-
-### Voice anchor (read before dispatching agents)
-
-The 5 existing Fundamentals pages (now renamed/touched) set the voice: clear, direct, 400–700 words, opens with a 1–2 paragraph lead, uses H2 sections, ends with a `## See also` block of 2–4 relative links. Every flag and symbol cited is verified in `src/winml/modelkit/`. No marketing language.
-
-- [ ] **Step 1: Dispatch parallel author agents (wave 1)**
-
-Send all 4 `Agent` tool calls in a single message; `subagent_type: general-purpose`, `model: sonnet`. Agents write only; the orchestrator commits.
-
-#### Agent B1 — Author `concepts/graphs-and-ir.md` (new)
-
-```
-You are authoring ONE Concepts page for the winml-cli user-facing docs. Output: overwrite the stub at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\graphs-and-ir.md. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Title: # Models, graphs, and the ONNX IR
-Length: 400–700 words of prose.
-
-Sources to read first (for accuracy — do not copy verbatim):
-- src/winml/modelkit/onnx/ (directory; look at metadata.py and model detection helpers)
-- src/winml/modelkit/export/ (directory; opset version selection)
-- An external reference: https://github.com/onnx/onnx/blob/main/docs/IR.md (treat as background, do not link)
-
-Body requirements:
-1. Lead (1–2 paragraphs): what a model file is at rest; the model is a graph; graphs are described in the ONNX IR; opsets version the operator set.
-2. H2 — "What is in a .onnx file": inputs, outputs, nodes (operators), initializers (weights), metadata. Use one short bulleted list.
-3. H2 — "Graphs as IR": brief explanation that ONNX is an Intermediate Representation — a static computation graph that's portable across runtimes. Mention nodes have inputs/outputs that wire into the graph; this enables shape inference and EP-targeted compilation.
-4. H2 — "Opsets and versioning": opset is a snapshot of the operator catalog at a specific version. winml-cli's `winml export` defaults to opset 17 (verify in src/winml/modelkit/export/ or commands/export.py). New opsets unlock new ops; EPs may not support the latest opset.
-5. H2 — "See also": 2–4 relative links. Valid targets (relative to docs/concepts/):
-   - eps-and-devices.md
-   - tensors-and-dtypes.md
-   - hierarchy-and-metadata.md
-   - ../commands/inspect.md
-   - ../commands/export.md
-
-Rules:
-- Use winml-cli (never ModelKit, never wmk).
-- Verify opset default by reading the source. If you cannot confirm 17, state the actual default you found.
-- No "TBD", no placeholders.
-- Code blocks: ```bash for invocations, ```text for output.
-
-Verify the strict build after writing:
-  uv run mkdocs build --strict 2>&1 | tail -3
-
-Expected: exit 0, no WARNING lines.
-
-Return: status (DONE/DONE_WITH_CONCERNS/BLOCKED), word count estimate, last 3 lines of mkdocs build output, the opset version you cited and where you confirmed it.
-```
-
-#### Agent B2 — Author `concepts/tensors-and-dtypes.md` (new)
-
-```
-You are authoring ONE Concepts page for the winml-cli user-facing docs. Output: overwrite the stub at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\tensors-and-dtypes.md. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Title: # Tensors and dtypes
-Length: 500–800 words of prose (slightly longer than typical Fundamentals page because this absorbs the dtype content from the quantization page).
-
-Sources to read first:
-- src/winml/modelkit/commands/_options.py (for _KNOWN_PRECISIONS)
-- src/winml/modelkit/onnx/ (for I/O tensor spec and shape inference helpers)
-- src/winml/modelkit/commands/quantize.py (for activation_type / weight_type flags)
-- src/winml/modelkit/commands/export.py (for --input-specs and --shape-config flags)
-
-Body requirements:
-1. Lead (1–2 paragraphs): three roles for tensors in a model — weights (static parameters), activations (intermediate values at inference), I/O tensors (inputs and outputs at the graph boundary). Each role has a dtype that may differ.
-2. H2 — "Weights and activations": one paragraph explaining the distinction and why it matters (memory footprint, quantization granularity, EP support tiers).
-3. H2 — "Dtype options in winml-cli": markdown table listing the precision strings from _KNOWN_PRECISIONS in _options.py. Columns: Precision | Weight dtype | Activation dtype | Notes. Cover at least auto, fp32, fp16, int8, int16, w8a8, w8a16, w4a16.
-4. H2 — "Static vs dynamic shapes": one paragraph. ONNX supports symbolic dimensions ("batch", "sequence") that are resolved at runtime. winml-cli's --input-specs and --shape-config flags let you constrain these at export time. Some EPs (QNN) require fully static shapes; others (DirectML) accept dynamic.
-5. H2 — "See also": 2–4 relative links. Valid targets:
-   - quantization.md
-   - eps-and-devices.md
-   - graphs-and-ir.md
-   - ../commands/export.md
-   - ../commands/quantize.md
-
-Rules:
-- Verify every precision string against _KNOWN_PRECISIONS. Do not invent precisions.
-- Verify --input-specs and --shape-config exist on winml export (read export.py).
-- Use winml-cli (never ModelKit/wmk).
-- No "TBD", no placeholders.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, word count, build output last 3 lines, the precision list you enumerated (verbatim).
-```
-
-#### Agent B3 — Edit `concepts/eps-and-devices.md` (rename done; content reframe)
-
-```
-You are content-editing ONE Concepts page for the winml-cli user-facing docs. The page is at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\eps-and-devices.md (just renamed from onnx-and-eps.md; content is the previous version). DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Goal: reframe the page title and lead from "ONNX & Execution Providers" to "Execution Providers and devices". The ONNX intro content should be trimmed (it now lives in graphs-and-ir.md) and the EP × Device matrix should remain front-and-center.
-
-Length: 400–700 words after editing.
-
-Specific edits:
-1. Change the H1 to: # Execution Providers and devices
-2. Rewrite the lead (1–2 paragraphs): what an EP is, what a device is, how winml-cli's --device and --ep flags map to them. Drop the "what is ONNX" intro paragraph (now covered by graphs-and-ir.md). If you reference ONNX, link to ../concepts/graphs-and-ir.md.
-3. Keep the EP × Device table (and update it if you find a missed EP in src/winml/modelkit/sysinfo/).
-4. Keep the "Device vs EP on the CLI" section.
-5. Update the "## See also" block to include a link to graphs-and-ir.md and tensors-and-dtypes.md if not already present. Keep total at 2–4 links.
-
-Rules:
-- Use winml-cli (never ModelKit/wmk). Replace any "ModelKit" string you find inside the page with "winml-cli".
-- Do not invent EPs. Verify against src/winml/modelkit/sysinfo/.
-- No "TBD", no placeholders.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, word count after editing, last 3 lines of build, list of EP names referenced in the final table.
-```
-
-#### Agent B4 — Edit `concepts/hierarchy-and-metadata.md` (rename done; broaden content)
-
-```
-You are content-editing ONE Concepts page for the winml-cli user-facing docs. The page is at C:\Users\zhengte\BYOM\ModelKits\mvp\docs\concepts\hierarchy-and-metadata.md (just renamed from hierarchy.md; current content focuses only on hierarchy.tag). DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Goal: broaden the page from "what is hierarchy_tag" to "what metadata winml-cli writes into the ONNX model, and why each entry exists."
-
-Length target: 500–700 words after editing.
-
-Specific edits:
-1. Change the H1 to: # Hierarchy and ONNX metadata
-2. Rewrite the lead (1–2 paragraphs): ONNX files carry metadata_props key/value entries beyond the graph itself. winml-cli writes several of these. The most important is winml.hierarchy.tag (the PyTorch module-path tag), but there are others.
-3. New H2 — "Metadata winml-cli writes": markdown table. Columns: Key | Set by | Purpose. Inspect src/winml/modelkit/onnx/metadata.py and src/winml/modelkit/export/htp/exporter.py to find the canonical list. Include winml.hierarchy.tag at minimum.
-4. Existing H2 — "What hierarchy_tag enables": keep the existing content about per-module benchmarking (winml perf --module) and the --no-hierarchy / --clean-onnx flag on winml export.
-5. Existing H2 — "See also": keep but add tensors-and-dtypes.md as a link.
-
-Rules:
-- Verify every metadata key by reading the source. State the file:line where you found each key.
-- Use winml-cli (never ModelKit/wmk). Replace any "ModelKit" string with "winml-cli".
-- No "TBD", no placeholders.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, word count, build output last 3 lines, the list of metadata keys you documented with file:line evidence.
-```
-
-- [ ] **Step 2: Edit `concepts/quantization.md` (tighten — move dtype content to Tensors page)**
-
-Read the current file first; find any paragraph or table that primarily explains the dtype family (fp32/fp16/int8/int16/compound types). Move that content (logically — by trimming here, since the new content already lives in tensors-and-dtypes.md after Step 1).
-
-Make these specific edits to `docs/concepts/quantization.md`:
-
-- If the page has an H2 like "Precision options" that lists the dtype family, replace its body with a short sentence: "See [Tensors and dtypes](tensors-and-dtypes.md) for the full precision family. This page focuses on the quantization algorithm, calibration, and the QDQ pattern."
-- Otherwise no changes — the page can keep its calibration and QDQ content.
-
-The orchestrator does this edit directly (not via agent) since it's a one-line surgical change.
-
-- [ ] **Step 3: Edit `concepts/how-it-works.md` and `concepts/buildconfig.md` — verify they don't reference renamed files**
-
-Read each file. If they contain links like `[ONNX & Execution Providers](onnx-and-eps.md)` or `[Hierarchy](hierarchy.md)`, update them to `eps-and-devices.md` and `hierarchy-and-metadata.md` respectively. Otherwise no changes.
-
-- [ ] **Step 4: Verify strict build**
-
-```bash
-uv run mkdocs build --strict 2>&1 | tail -3
-```
-
-Expected: exit 0, no WARNING lines.
-
-If the build complains about broken links pointing to `onnx-and-eps.md` or `hierarchy.md`, fix those references in whatever file they live in (these are the inbound links flagged in spec §7).
-
-- [ ] **Step 5: Commit (Fundamentals batch)**
-
-```bash
-git add docs/concepts/
-git commit -m "docs(concepts/fundamentals): author graphs-and-ir + tensors-and-dtypes; reframe eps-and-devices + hierarchy-and-metadata after rename
-
-- New: graphs-and-ir.md (models, graphs, IR, opsets)
-- New: tensors-and-dtypes.md (weights/activations/I-O tensors, precision
-  family, static-vs-dynamic shapes)
-- Reframed: eps-and-devices.md (drops the ONNX intro, now covered by
-  graphs-and-ir.md; keeps EP × Device matrix)
-- Broadened: hierarchy-and-metadata.md (now covers all metadata
-  winml-cli writes, not only winml.hierarchy.tag)
-- Tightened: quantization.md (dtype family content moved to
-  tensors-and-dtypes.md to remove duplication)"
-```
-
----
-
-## Task 3: Concepts — WinML CLI authoring (Batch C)
-
-**Files (all new, full content authoring):**
-- `docs/concepts/primitives-and-pipeline.md`
-- `docs/concepts/config-and-build.md`
-- `docs/concepts/load-and-export.md`
-- `docs/concepts/analyze-and-optimize.md`
-- `docs/concepts/compile-and-epcontext.md`
-- `docs/concepts/perf-and-monitoring.md`
-- `docs/concepts/eval-and-datasets.md`
-
-### Voice anchor
-
-These are **workflow-concept pages**, not command-reference pages. Each explains the **why** and **when**, cross-linking to the per-command reference at `docs/commands/<name>.md` for **what**. No flag tables — that's the command-reference page's job.
-
-- [ ] **Step 1: Dispatch parallel author agents (4 agents, 7 pages)**
-
-Single message, 4 Agent tool calls, `model: sonnet`, agents write only.
-
-| Agent | Pages owned |
-|---|---|
-| C1 | `primitives-and-pipeline.md`, `config-and-build.md` |
-| C2 | `load-and-export.md`, `analyze-and-optimize.md` |
-| C3 | `compile-and-epcontext.md`, `perf-and-monitoring.md` |
-| C4 | `eval-and-datasets.md` |
-
-#### Reusable agent prompt template
-
-```
-You are authoring Concepts pages for the winml-cli user-facing docs. Output: write the markdown files listed below. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Voice and shape per page:
-- H1 = page title (given below per page).
-- 400–700 words of prose.
-- Lead (1–2 paragraphs): what conceptual tension/pair this page covers and why it matters.
-- 2–4 H2 sections.
-- Closing "## See also" with 2–4 relative links.
-- These are workflow-concept pages: explain WHY and WHEN. The /commands/ pages cover WHAT flags.
-- No flag tables. If you need to mention a flag, do it inline in prose.
-- Use winml-cli throughout (never ModelKit/wmk).
-
-Source verification rule: every flag, file, or symbol you cite must exist in src/winml/modelkit/. Verify by reading or running uv run winml <command> --help.
-
-Pages assigned to you:
-
-[PAGE BLOCKS — see below]
-
-After all your pages are written, run:
-  uv run mkdocs build --strict 2>&1 | tail -3
-
-Expected: exit 0, no WARNING lines.
-
-Return: status (DONE/DONE_WITH_CONCERNS/BLOCKED), per-page word count, build output last 3 lines, anything surprising (a flag that doesn't exist where the prompt says it should, a source claim you couldn't verify).
-```
-
-#### Page blocks — Agent C1
-
-```
-PAGE 1 — concepts/primitives-and-pipeline.md
-Title: # Primitives and pipeline
-Theme: Two ways to use winml-cli — invoke individual primitive commands (export, optimize, quantize, compile, perf, eval) one at a time, or use `winml build` as the wrapper that runs them all from a config. Teach when to choose which: primitives for learning / debugging / one-off variations; build for production / CI / reproducibility.
-
-Required H2 sections:
-- "The primitive commands" — list the staged commands with a one-line role each. Reference the order in docs/concepts/how-it-works.md.
-- "The pipeline wrapper" — winml build orchestrates the same stages from a single WinMLBuildConfig.
-- "When to choose which" — bullets contrasting the two.
-- "See also" — 2–4 links. Valid: how-it-works.md, config-and-build.md, ../commands/build.md, ../samples/convnext-primitives.md, ../samples/bert-config-build.md.
-
-PAGE 2 — concepts/config-and-build.md
-Title: # Config and build
-Theme: Producer/consumer pair. winml config generates a WinMLBuildConfig JSON; winml build consumes it. Teach the reproducibility angle (version configs, share across CI, replay later), and the override semantics (CLI flags can override config values).
-
-Required H2 sections:
-- "Generating a config" — short prose about winml config, --task, --no-quant/--no-compile, --trust-remote-code. No full flag table.
-- "Consuming a config" — winml build -c <file>.json --output-dir or --use-cache (exactly one of them). The build runs the stages defined in the config.
-- "Overrides at run time" — flags like --no-quant, --no-compile, --no-optimize on winml build override the corresponding config sections without editing the file. Useful for ad-hoc skips.
-- "Why version a config" — three concrete reasons: reproducibility, CI, sharing.
-- "See also" — 2–4 links. Valid: buildconfig.md, primitives-and-pipeline.md, ../commands/config.md, ../commands/build.md.
-```
-
-#### Page blocks — Agent C2
-
-```
-PAGE 1 — concepts/load-and-export.md
-Title: # Load and export
-Theme: The first conceptual stage of the pipeline — bring a model into memory (from Hugging Face Hub or a local checkpoint), then transform it to ONNX. Teach the load step (the loader module in src/winml/modelkit/loader/) and the export step (the winml export command).
-
-NOTE: "load" is not a CLI verb. The loader is internal. Pair this page is "stage 1 load" + "stage 1 export"; both are part of getting a model into ONNX form.
-
-Required H2 sections:
-- "Loading a model" — winml-cli loads from HF Hub (with cache at ~/.cache/huggingface) or from a local PyTorch checkpoint. winml inspect is the user-facing way to check the loader picked it up correctly. Trust remote code with --trust-remote-code.
-- "Exporting to ONNX" — winml export converts the loaded model to ONNX. Mentions hierarchy preservation (see hierarchy-and-metadata.md), the --no-hierarchy / --clean-onnx flag, and --dynamo for an alternative export backend.
-- "Where it goes wrong" — task mismatch (use --task), shape issues (use --shape-config or --input-specs), custom modules (use --torch-module).
-- "See also" — 2–4 links. Valid: hierarchy-and-metadata.md, graphs-and-ir.md, ../commands/inspect.md, ../commands/export.md.
-
-PAGE 2 — concepts/analyze-and-optimize.md
-Title: # Analyze and optimize
-Theme: Two graph-quality commands that work together. winml analyze checks EP compatibility and reports issues; winml optimize applies fusions and rewrites. They share --optim-config and often run together via winml build's analyzer/optimizer loop.
-
-Required H2 sections:
-- "What analyze does" — runs operator coverage, shape inference, and runtime checks against a target EP; outputs a report. Reference the --format choices.
-- "What optimize does" — applies graph fusions (GELU, LayerNorm, MatMul+Add) and pattern rewrites. References --list-capabilities and the --enable-X / --disable-X dynamic flags. Briefly mention --list-rewrites for the pattern-rewrite family.
-- "The analyzer/optimizer loop" — winml build runs analyze → optimize → analyze → optimize up to --max-optim-iterations times to converge. Mention --no-analyze for deterministic single-pass builds.
-- "See also" — 2–4 links. Valid: compile-and-epcontext.md, primitives-and-pipeline.md, ../commands/analyze.md, ../commands/optimize.md.
-```
-
-#### Page blocks — Agent C3
-
-```
-PAGE 1 — concepts/compile-and-epcontext.md
-Title: # Compile and EPContext
-Theme: What winml compile actually produces. Some EPs (especially QNN) bake a binary blob — the EP context — into the ONNX file at compile time. Compiled models load faster at runtime because the EP-specific setup is pre-computed.
-
-Required H2 sections:
-- "What compilation produces" — for ORT-compatible EPs the compile step writes an ONNX file that the runtime can load directly; for QNN the file embeds a binary EPContext blob.
-- "Embedded vs external EPContext" — winml compile --embed controls whether the QNN context is inlined into the .onnx or stored as a sidecar binary. Trade-offs: inline = one file but bigger; sidecar = smaller .onnx but two files.
-- "Why pre-compile" — runtime cold-start cost. The first inference on a fresh model loads + JIT-compiles; a pre-compiled model loads ready-to-run.
-- "Skipping validation" — --no-validate exists for fast iteration; explain when not to use it (production builds).
-- "See also" — 2–4 links. Valid: eps-and-devices.md, analyze-and-optimize.md, ../commands/compile.md, ../commands/build.md.
-
-PAGE 2 — concepts/perf-and-monitoring.md
-Title: # Perf and monitoring
-Theme: winml perf measures latency/throughput. The --monitor flag adds a live hardware utilization chart (NPU primarily); --op-tracing produces per-operator timing breakdowns. Together they let you see both end-to-end numbers and where the time goes.
-
-Required H2 sections:
-- "What perf measures" — iterations, warmup, batch size; the output is latency p50/p90/mean and throughput. Mention --device for the EP target.
-- "Live monitoring" — --monitor opens a terminal chart of NPU utilization while the benchmark runs. Useful for confirming the workload actually hit the NPU.
-- "Per-operator tracing" — --op-tracing basic|detail produces breakdowns. Useful for finding hot ops.
-- "Per-module benchmarking" — --module <substring> benchmarks just one HF/PyTorch module from the hierarchy (links to hierarchy-and-metadata.md).
-- "See also" — 2–4 links. Valid: hierarchy-and-metadata.md, eval-and-datasets.md, ../commands/perf.md.
-```
-
-#### Page blocks — Agent C4
-
-```
-PAGE 1 — concepts/eval-and-datasets.md
-Title: # Eval and datasets
-Theme: winml eval measures accuracy, not speed. It needs a dataset (typically from Hugging Face) and a way to bind dataset columns to model inputs/outputs. Teach when to use eval (always after quantization), how to point it at a dataset, and the column-mapping pattern.
-
-Required H2 sections:
-- "What eval reports" — the metric depends on the task (accuracy for classification, mAP for detection, etc.). Output is a JSON with per-metric numbers; --format controls the form.
-- "Picking a dataset" — --dataset accepts a Hugging Face dataset path; --dataset-name picks a config; --split selects which split (validation by default); --samples caps the count for quick checks. Note --streaming for large datasets.
-- "Column mapping" — --column key=value to bind dataset columns to model inputs; --label-mapping for label index translation.
-- "Why eval after quantization" — quantization is lossy; the only way to know you didn't break the model is to check accuracy. Link to quantization.md.
-- "See also" — 2–4 links. Valid: quantization.md, perf-and-monitoring.md, ../commands/eval.md.
-```
-
-- [ ] **Step 2: Verify strict build**
-
-```bash
-uv run mkdocs build --strict 2>&1 | tail -3
-```
-
-Expected: exit 0, no WARNING lines.
-
-- [ ] **Step 3: Commit (WinML CLI batch)**
-
-```bash
-git add docs/concepts/
-git commit -m "docs(concepts/winml-cli): author 7 workflow-concept pages
-
-Each page covers a winml-cli workflow pair, explaining the WHY and
-WHEN of using the commands together. Pages: primitives-and-pipeline,
-config-and-build, load-and-export, analyze-and-optimize,
-compile-and-epcontext, perf-and-monitoring, eval-and-datasets.
-
-No flag tables (those live on the per-command reference pages).
-Every flag and symbol verified against src/winml/modelkit/."
-```
-
----
-
-## Task 4: Tutorials authoring (Batch D)
-
-**Files (full content authoring):**
-- `docs/tutorials/index.md` — short overview (~150 words)
-- `docs/tutorials/npu-convnext.md` — the long-form tutorial (1500–2500 words)
-
-### Why a single agent owns the tutorial
-
-The ConvNeXt-on-NPU tutorial is one long page where prose voice and step transitions matter. A single agent produces more consistent voice than splitting it.
-
-- [ ] **Step 1: Dispatch 1 agent for the tutorial + 1 agent for the index**
-
-Single message, 2 parallel agents (different files, no conflict).
-
-#### Agent D1 — Author `tutorials/npu-convnext.md`
-
-```
-You are authoring the flagship tutorial for the winml-cli docs site. Output: overwrite C:\Users\zhengte\BYOM\ModelKits\mvp\docs\tutorials\npu-convnext.md. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Title: # ConvNeXt on NPU
-Model: facebook/convnext-tiny-224
-Length: 1500–2500 words of prose (excluding code blocks).
-Tone: classroom-style, prescriptive, every step has an explicit "what just happened" callout. Source: adapted from internal WinHECLab lab (saved at temp/winheclab-readme.md as background reference).
-
-Required structure:
-
-# ConvNeXt on NPU
-
-[Lead — 2–3 paragraphs:
-- Goal: take facebook/convnext-tiny-224 from Hugging Face to a benchmark-ready compiled model running on NPU.
-- Primary hardware: Copilot+PC with Snapdragon X-class NPU (or comparable). Explicit CPU/DirectML fallback documented throughout.
-- Two sections: Section A builds the model using primitive commands (so you understand each stage); Section B does the same thing with `winml build` (so you see the wrapper).]
-
-## Prerequisites
-
-- Windows 11 24H2 (required for NPU support)
-- Copilot+PC with NPU (40+ TOPS recommended; CPU/DirectML works as fallback)
-- Python 3.10, uv installed
-- winml-cli installed (see [Installation](../getting-started/installation.md))
-- For NPU: QNN SDK (set QNN_SDK_ROOT env var) or OpenVINO
-
-## Section A — Primitive commands
-
-### Step 1: Inspect the model
-
-[bash block: uv run winml inspect -m facebook/convnext-tiny-224]
-[text block: short abbreviated expected output]
-[!!! note "What we just did" — explains: confirmed task detection, model class, exporter compatibility before transformation.]
-
-### Step 2: Generate a build config
-
-[bash block: uv run winml config -m facebook/convnext-tiny-224 -o convnext_config.json]
-[!!! note callout: this is optional for primitives but useful for versioning.]
-
-### Step 3: Export to ONNX
-
-[bash block: uv run winml export -m facebook/convnext-tiny-224 -o convnext.onnx]
-[Link to ../concepts/hierarchy-and-metadata.md re: what hierarchy preservation adds.]
-
-### Step 4: Analyze for EP compatibility
-
-[bash block: uv run winml analyze -m convnext.onnx --ep qnn]
-(Show that analyze reports operator coverage and any flagged issues.)
-
-### Step 5: Optimize the graph
-
-[bash block: uv run winml optimize -m convnext.onnx -o convnext_optim.onnx]
-
-### Step 6: Quantize
-
-[bash block: uv run winml quantize -m convnext_optim.onnx -o convnext_int8.onnx --precision int8 --samples 32]
-[Link to ../concepts/quantization.md.]
-
-### Step 7: Compile for the target EP
-
-Use pymdownx.tabbed for QNN vs OpenVINO:
-
-=== "QNN (Snapdragon NPU)"
-
-    ```bash
-    # Requires QNN_SDK_ROOT env var set
-    uv run winml compile -m convnext_int8.onnx -o convnext_qnn.onnx --device npu
-    ```
-
-=== "OpenVINO (Intel CPU/GPU/NPU)"
-
-    ```bash
-    uv run winml compile -m convnext_int8.onnx -o convnext_ov.onnx --device npu --ep openvino
-    ```
-
-=== "CPU fallback"
-
-    ```bash
-    uv run winml compile -m convnext_int8.onnx -o convnext_cpu.onnx --device cpu
-    ```
-
-[Link to ../concepts/compile-and-epcontext.md.]
-
-### Step 8: Benchmark
-
-Tabbed by EP:
-
-=== "QNN NPU"
-
-    ```bash
-    uv run winml perf -m convnext_qnn.onnx --device npu --iterations 50 --monitor
-    ```
-
-=== "OpenVINO NPU"
-
-    ```bash
-    uv run winml perf -m convnext_ov.onnx --device npu --ep openvino --iterations 50 --monitor
-    ```
-
-=== "CPU"
-
-    ```bash
-    uv run winml perf -m convnext_cpu.onnx --device cpu --iterations 50
-    ```
-
-[text block: a short example latency/throughput snippet.]
-
-### Step 9 (optional): Evaluate accuracy
-
-[bash block: uv run winml eval -m convnext_int8.onnx --dataset imagenet-1k --split validation --samples 100 --device npu]
-[Link to ../concepts/eval-and-datasets.md.]
-
-## Section B — One-shot with `winml build`
-
-```bash
-uv run winml build -c convnext_config.json --output-dir convnext_out/
-```
-
-[Brief prose: this single command runs export → optimize → quantize → compile and produces the same final artifact. Use --no-quant / --no-compile / --no-optimize to skip stages.]
-
-[Show a benchmark step at the end using the artifact from convnext_out/.]
-
-## Where to go next
-
-- [Concepts → How winml-cli works](../concepts/how-it-works.md)
-- [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md)
-- [Samples → ConvNeXt primitives walkthrough](../samples/convnext-primitives.md) (the CPU/GPU/NPU device comparison version of this material)
-- [Commands → Overview](../commands/overview.md)
-
-## See also
-
-(2–4 relative links — pick the most relevant from above.)
-
-Rules:
-- Use winml-cli (never ModelKit/wmk).
-- Every flag and command must exist in src/winml/modelkit/. Verify by running uv run winml <command> --help.
-- For unverifiable claims (e.g. --device value names), DOUBLE-CHECK against source.
-- Use pymdownx.tabbed syntax verbatim: `=== "Label"` then blank line then 4-space-indented code block.
-- Output snippets use ```text and stay short (5–10 lines).
-- No "TBD", no placeholders.
-- Adapt the WinHECLab content but rewrite in our voice (drop "Step N" classroom numbering for primary headings; keep step numbering inside Section A only).
-- DO NOT reference Visual Studio, Windows App SDK, C#, or any GUI app — Python/CLI only.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, total word count (prose only, exclude code blocks), build output last 3 lines, and confirmation that tabbed blocks rendered (mkdocs --strict accepts them).
-```
-
-#### Agent D2 — Author `tutorials/index.md`
-
-```
-You are authoring the Tutorials chapter overview page. Output: overwrite C:\Users\zhengte\BYOM\ModelKits\mvp\docs\tutorials\index.md. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Title: # Tutorials
-Length: 100–250 words.
-
-Required structure:
-
-# Tutorials
-
-[One paragraph framing: tutorials are linear, prescriptive, end-to-end walkthroughs. For lookup, use Concepts (the WHY/WHEN) or Commands (the WHAT). Tutorials sit alongside Samples (which are reference-style demos comparing options).]
-
-## Available tutorials
-
-| Tutorial | What you'll build | Hardware |
-|---|---|---|
-| [ConvNeXt on NPU](npu-convnext.md) | A quantized ConvNeXt image classifier compiled for Snapdragon NPU (with CPU/DirectML fallback) | Copilot+PC NPU primary; CPU works as fallback |
-
-[One short closing paragraph noting more tutorials coming.]
-
-Rules:
-- Use winml-cli (never ModelKit/wmk).
-- No "TBD", no placeholders.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, word count, build output last 3 lines.
-```
-
-- [ ] **Step 2: Verify strict build**
-
-```bash
-uv run mkdocs build --strict 2>&1 | tail -3
-```
-
-- [ ] **Step 3: Commit (Tutorials batch)**
-
-```bash
-git add docs/tutorials/
-git commit -m "docs(tutorials): add Tutorials chapter with ConvNeXt-on-NPU walkthrough
-
-- tutorials/index.md: chapter overview + tutorial table
-- tutorials/npu-convnext.md: end-to-end ConvNeXt build on NPU,
-  adapted from the internal WinHECLab lab. Primitives walkthrough
-  (Section A) covers each stage in turn; one-shot section (Section B)
-  shows the same result via winml build. QNN, OpenVINO, and CPU
-  paths shown via tabbed code blocks.
-
-Python/winml-cli only — Visual Studio / Windows App SDK / C# app
-content from the lab is deliberately out of scope for this iteration."
-```
-
----
-
-## Task 5: Getting Started polish (Batch E)
-
-**Files (content edits to existing pages):**
-- `docs/getting-started/installation.md`
-- `docs/getting-started/quickstart.md`
-- `docs/getting-started/end-to-end.md`
-
-- [ ] **Step 1: Dispatch 3 parallel agents**
-
-Single message, 3 Agent tool calls, `model: sonnet`. Agents write only.
-
-#### Agent E1 — Edit `installation.md`
-
-```
-You are editing the winml-cli Installation page. File: C:\Users\zhengte\BYOM\ModelKits\mvp\docs\getting-started\installation.md. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Goal:
-1. Rewrite the prerequisites table to be more specific about NPU requirements.
-2. Add a fallback callout for users without NPU hardware.
-
-Specific edits:
-- Replace the existing "Prerequisites" section with a table that includes:
-  - Windows 11 24H2 or later (required for NPU support)
-  - Copilot+PC with NPU (40+ TOPS NPU recommended for NPU acceleration; not required for CPU/DirectML)
-  - Python 3.10 (the project pins requires-python = ">=3.10,<3.11"; verify before stating)
-  - uv (link https://github.com/astral-sh/uv)
-  - git
-- After the prereqs table, add a !!! note "No NPU?" callout: explain that --device auto falls back to CPU or DirectML, and the rest of the docs apply with minor flag differences.
-- Otherwise keep the page (Install, Verify, Optional extras, Next steps sections all stay).
-- Verify the existing extras text matches pyproject.toml lines 79–82.
-
-Rules:
-- Use winml-cli (never ModelKit/wmk).
-- Keep page under 600 words.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, word count, build output last 3 lines.
-```
-
-#### Agent E2 — Edit `quickstart.md`
-
-```
-You are editing the winml-cli Quickstart page. File: C:\Users\zhengte\BYOM\ModelKits\mvp\docs\getting-started\quickstart.md. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Goal: add winml sys --list-device --list-ep to the verify step. Otherwise leave the page alone.
-
-Specific edit:
-- Wherever the page currently shows `uv run winml sys` as the verify command (probably in a "Verify the install" or similar section), replace it with:
-
-  ```bash
-  uv run winml sys --list-device --list-ep
-  ```
-
-- Update the surrounding prose to mention that this enumerates available devices and execution providers (versus `winml sys` alone, which shows everything).
-- No other changes.
-
-Rules:
-- Use winml-cli (never ModelKit/wmk).
-- Keep page under 600 words.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, word count, build output last 3 lines.
-```
-
-#### Agent E3 — Edit `end-to-end.md`
-
-```
-You are editing the winml-cli End-to-End page. File: C:\Users\zhengte\BYOM\ModelKits\mvp\docs\getting-started\end-to-end.md. DO NOT commit.
-
-Working dir: C:\Users\zhengte\BYOM\ModelKits\mvp. Branch: docs/v2.
-
-Goals:
-1. Add --monitor to the winml perf step.
-2. Add a short CPU-fallback section after the NPU section.
-3. Align prereqs callout with the updated installation.md.
-
-Specific edits:
-- Wherever the page shows `uv run winml perf ... --device npu`, add --monitor:
-
-  ```bash
-  uv run winml perf -m convnext_npu_out/<artifact>.onnx --device npu --iterations 50 --monitor
-  ```
-
-  Add a sentence: "The --monitor flag opens a live chart of NPU utilization while the benchmark runs — confirmation that the workload actually hit the NPU."
-- After the existing NPU perf step, add a new section:
-
-  ```
-  ## (Optional) CPU fallback
-
-  If you don't have NPU hardware, the same artifact runs on CPU via DirectML:
-
-  ```bash
-  uv run winml perf -m convnext_npu_out/<artifact>.onnx --device cpu --iterations 50
-  ```
-
-  Latency will be higher than NPU but the build pipeline is otherwise identical.
-  ```
-
-- In the prerequisites section, reference the updated installation page (link relative path: ../getting-started/installation.md is wrong from within getting-started/ — use installation.md).
-
-Rules:
-- Use winml-cli (never ModelKit/wmk).
-- Keep page under 1100 words.
-
-Verify: uv run mkdocs build --strict 2>&1 | tail -3
-
-Return: status, word count, build output last 3 lines.
-```
-
-- [ ] **Step 2: Verify strict build**
-
-```bash
-uv run mkdocs build --strict 2>&1 | tail -3
-```
-
-- [ ] **Step 3: Commit (Getting Started polish batch)**
-
-```bash
-git add docs/getting-started/
-git commit -m "docs(getting-started): polish prereqs, add NPU monitoring, document CPU fallback
-
-- installation.md: rewrite prereqs as a table (Windows 11 24H2,
-  Copilot+PC, Python 3.10, uv, git); add 'No NPU?' callout pointing
-  at --device auto and CPU/DirectML.
-- quickstart.md: verify step now uses 'winml sys --list-device
-  --list-ep' for a focused capability check.
-- end-to-end.md: add --monitor to the perf step and a short
-  CPU-fallback section after the NPU benchmark."
-```
-
----
-
-## Task 6: Cross-link sweep (Batch F)
-
-**Files:** any docs file referencing the renamed `onnx-and-eps.md` or `hierarchy.md`.
-
-- [ ] **Step 1: Find broken references**
-
-```bash
-echo "=== References to old onnx-and-eps.md ==="
-grep -rn "onnx-and-eps\.md" docs/ 2>/dev/null | grep -v "docs/superpowers/"
-
-echo ""
-echo "=== References to old hierarchy.md (not hierarchy-and-metadata.md) ==="
-grep -rn "hierarchy\.md" docs/ 2>/dev/null | grep -v "hierarchy-and-metadata\.md" | grep -v "docs/superpowers/"
-```
-
-Expected: zero or a small handful of matches. If empty, skip to Step 3.
-
-- [ ] **Step 2: Fix any matches**
-
-For each match, edit the file replacing:
-- `onnx-and-eps.md` → `eps-and-devices.md`
-- `hierarchy.md` → `hierarchy-and-metadata.md`
-
-If many matches exist (≥3), use sed:
-
-```bash
-files_with_old_eps=$(grep -rl "onnx-and-eps\.md" docs/ | grep -v "docs/superpowers/")
-files_with_old_hier=$(grep -rl "hierarchy\.md" docs/ | grep -v "hierarchy-and-metadata\.md" | grep -v "docs/superpowers/")
-for f in $files_with_old_eps; do sed -i 's|onnx-and-eps\.md|eps-and-devices.md|g' "$f"; done
-for f in $files_with_old_hier; do sed -i 's|\bhierarchy\.md|hierarchy-and-metadata.md|g' "$f"; done
-```
-
-- [ ] **Step 3: Verify strict build (final)**
-
-```bash
-uv run mkdocs build --strict 2>&1 | tail -3
-```
-
-Expected: exit 0, no WARNING lines.
-
-- [ ] **Step 4: Commit (if any link fixes happened)**
-
-```bash
-git add docs/
-git commit -m "docs: fix inbound links to renamed Fundamentals pages
-
-Updates references to onnx-and-eps.md -> eps-and-devices.md and
-hierarchy.md -> hierarchy-and-metadata.md across the docs tree.
-Internal docs and the design/plan files under docs/superpowers/
-are not touched."
-```
-
-If Step 1 found no matches, skip the commit — no changes to record.
-
-- [ ] **Step 5: Final smoke check**
-
-```bash
-echo "=== Page count by chapter ===" && ls docs/getting-started/*.md docs/concepts/*.md docs/commands/*.md docs/samples/*.md docs/tutorials/*.md 2>&1 | wc -l
-
-echo "=== Final commit log on docs/v2 (vs docs/v1) ===" && git log --oneline docs/v1..HEAD
-
-echo "=== Working tree clean? ===" && git status --short
-```
-
-Expected: page count = 32 (3 + 14 + 13 + 3 + 2 - wait, recompute) — actually:
-- getting-started: 3
-- concepts: 14 (the 5 existing + 2 renamed-and-already-existing + 9 new = 14 — but two of those are renamed (eps-and-devices, hierarchy-and-metadata) so net file count after renames is still 14)
-- commands: 13
-- samples: 3
-- tutorials: 2
-
-Total: **35 markdown files** under those chapters. Plus index.md = 36 user-facing markdown files in the site (excluding stubs in reference/, troubleshooting.md, contributing.md).
-
-If page count is off, investigate; otherwise the v2 expansion is complete.
-
----
-
-## Self-review notes
-
-- **Spec coverage:** Each section of `docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md` maps to a task. §4 IA → Task 1; §5.1 Getting Started → Task 5; §5.2 Tutorials → Task 4; §5.3 Concepts/Fundamentals → Task 2; §5.4 Concepts/WinML CLI → Task 3; §6 nav → Task 1; §7 file inventory → tasks 1–6; §8 implementation strategy → directly the 6 batches; §9 acceptance criteria → end of Task 6.
-- **Type/name consistency:** `winml-cli` used throughout; file paths use `concepts/`, `tutorials/`, `getting-started/`. Pair-page H1 titles match `mkdocs.yml` nav labels.
-- **No placeholders:** every step has actual content. Agent prompts are concrete and self-contained (no "see plan for details").
-- **Agent parallelism is explicit** at the start of each authoring task.
-- **One known acceptable hack:** in Task 2 Step 2, the dtype-content move is a surgical edit done by the orchestrator (not by an agent) because the edit is one paragraph or fewer.
diff --git a/docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md b/docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md
deleted file mode 100644
index bffe1398f..000000000
--- a/docs/superpowers/specs/2026-05-20-modelkit-docs-site-design.md
+++ /dev/null
@@ -1,239 +0,0 @@
-# ModelKit User-Facing Documentation Site — Design
-
-> **Date:** 2026-05-20
-> **Branch:** `docs/init` (based on `feat/mvp`)
-> **Status:** Design approved; ready for implementation plan.
-
-## 1. Goal
-
-Create a user-facing documentation site for ModelKit (the Python toolkit fronted by the `winml` CLI) targeted at external open-source users discovering the project on GitHub. The site must support markdown authoring, code-block-friendly rendering, mermaid diagrams, and optional Jupyter notebook embedding.
-
-## 2. Audience and scope
-
-- **Primary audience:** External OSS users (developers exporting/quantizing/compiling models for Windows ML deployment). No insider jargon; clear install-to-first-success path required.
-- **Out of scope:** Internal-only sections, MS-internal access controls.
-- **MVP scope:** Full content for the first four chapters (Getting Started, Concepts, Commands, Samples). Reference / Troubleshooting / Contributing exist as nav stubs only and are tracked as P2.
-
-## 3. Framework decision
-
-**MkDocs Material**, hosted on GitHub Pages, sources in `docs/`.
-
-| Considered | Outcome |
-|---|---|
-| **MkDocs Material** | Chosen. Python-native, single `uv add --dev mkdocs-material`, first-class mermaid, code-block tabs, instant search, dark mode. Matches existing toolchain. |
-| Sphinx + MyST + Furo | Rejected for MVP. Heavier config; autodoc not needed for a CLI tool. Revisit if we add a library API surface. |
-| Docusaurus | Rejected. Adds Node ecosystem to a Python repo; MDX features unused. |
-| GitHub Wiki | Rejected. No PR review, no code-search integration, weaker mermaid/notebook support. |
-
-**Notebook integration:** `mkdocs-jupyter` plugin, treated as nice-to-have. No notebooks required in MVP; plugin is installed so future samples can drop in `.ipynb` files.
-
-## 4. Hosting and deploy
-
-- Site lives in-repo under `docs/` (alongside existing internal docs, which remain untouched and excluded from the nav).
-- Built by GitHub Actions, published to the `gh-pages` branch, served by GitHub Pages.
-- **Deploy is held off for now:** the CI workflow is written but configured to require manual `workflow_dispatch`. No automatic pushes to remote during this MVP. All commits stay local on `docs/init` until the user decides to publish.
-
-## 5. Information architecture
-
-```
-ModelKit Docs
-├── Home (landing)
-│
-├── 1. Getting Started
-│   ├── Installation
-│   ├── Quickstart (5-min export)
-│   └── End-to-End: HF → NPU (15-min walkthrough)
-│
-├── 2. Concepts
-│   ├── How ModelKit Works (pipeline diagram)
-│   ├── ONNX & Execution Providers
-│   ├── Quantization & QDQ
-│   ├── Hierarchy Preservation
-│   └── BuildConfig & Kits
-│
-├── 3. Commands
-│   ├── Overview (12-command map, decision table)
-│   ├── Discover  → sys, inspect, hub, analyze
-│   ├── Configure → config, optimize
-│   ├── Build     → export, quantize, compile, build
-│   └── Measure   → perf, eval
-│
-├── 4. Samples
-│   ├── ConvNeXt — primitives walkthrough (all EPs, quantized)
-│   ├── BERT — config + build + perf (workflow focus)
-│   └── Qwen3 — Composite Models (placeholder, "coming soon")
-│
-├── 5. Reference          (P2 — nav stub only)
-├── 6. Troubleshooting    (P2 — nav stub only)
-└── 7. Contributing       (P2 — nav stub only)
-```
-
-### 5.1 Grouping rationale
-
-- Commands grouped by **user intent** (discover / configure / build / measure), not alphabetical — matches how a user actually progresses.
-- Concepts placed **before** Commands so users have a mental model before reading flag tables.
-- Existing `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md` stay where they are; they remain contributor-facing and are linked from Contributing (P2).
-
-## 6. Per-page outlines
-
-### 6.1 Getting Started
-
-- **Installation** — Prereqs (Win 10/11, Python 3.10, `uv`), `git clone` + `uv sync`, verify with `winml sys`.
-- **Quickstart** — 5-minute path: pick any HF classifier, run `winml export`, view the `.onnx`, run `winml inspect`. No EPs, no quantization — proves the install.
-- **End-to-End: HF → NPU** — 15-minute walkthrough: ConvNeXt + `winml build` with QNN, see artifacts, run `winml perf` against NPU. Sets the stage for the Samples chapter.
-
-### 6.2 Concepts
-
-- **How ModelKit Works** — Mermaid pipeline diagram (PyTorch → ONNX → QDQ → EP-compiled). One paragraph per stage with deep-links to its command page.
-- **ONNX & Execution Providers** — What ONNX is, what an EP is, EPs ModelKit supports (QNN, OpenVINO, DML, CPU/GPU), hardware mapping table.
-- **Quantization & QDQ** — Why quantize, INT8/INT16/FP16, calibration vs. static, QDQ node insertion, lossy trade-offs.
-- **Hierarchy Preservation** — Why ONNX needs PyTorch module info, how ModelKit embeds it as metadata, what it enables downstream (per-module benchmarking, targeted optimization).
-- **BuildConfig & Kits** — The unified config object, precision policies, per-task templates, where configs live (`MODEL_BUILD_CONFIGS`).
-
-### 6.3 Commands
-
-**Page template** (applied to all 12 command pages — sections kept as headings even if initially sparse; content filled in incrementally):
-
-```
-# winml <command>
-> one-line tagline
-
-## When to use this
-[1–2 sentences: user intent, place in pipeline]
-
-## Synopsis
-$ winml <command> [options]
-
-## Flags
-[Table: Flag | Short | Type | Default | Description; shared flags collapsed]
-
-## How it works
-[2–3 sentences; optional mermaid diagram for non-trivial commands]
-
-## Examples
-[3–5 progressively richer examples with expected output snippets]
-
-## Common pitfalls
-[Bullet list of gotchas]
-
-## See also
-[Links to related commands and concept pages]
-```
-
-The **Overview** sub-page contains the 12-command map (grouped) and a "which command for which task" decision table.
-
-The 12 command pages: `sys`, `inspect`, `hub`, `analyze`, `config`, `optimize`, `export`, `quantize`, `compile`, `build`, `perf`, `eval`.
-
-### 6.4 Samples
-
-Each sample has a distinct teaching purpose — together they form an abstraction ladder.
-
-- **ConvNeXt — primitives walkthrough**
-  - Style: invoke each command directly (`inspect` → `config` → `export` → `quantize` → `compile` → `perf` → `eval`).
-  - EP coverage: CPU, GPU, NPU. For each EP, document the flags that differ, expected outputs, and a "what we just did" callout per step.
-  - Goal: reader leaves understanding what each command does and how they compose.
-
-- **BERT — config + build + perf**
-  - Style: `winml config` to generate the BuildConfig, `winml build` to run the whole pipeline, `winml perf` on the artifact.
-  - EP focus de-emphasized — the page teaches the wrapper workflow, not the EP matrix.
-  - Goal: reader leaves understanding the production-style one-shot path and how config files become reusable.
-
-- **Qwen3 — Composite Models** (placeholder)
-  - Single page: 1-paragraph teaser, "coming soon" admonition, link to the in-progress feature branch.
-  - Goal: reserve the slot in the nav; signal where ModelKit is headed without blocking MVP on unmerged work.
-
-## 7. Reference handling (P2 — nav stubs in MVP)
-
-- **BuildConfig schema, hub catalog, EP/device matrix, precision options:** hand-written when the time comes (decided against autogeneration for MVP — maintenance burden traded against polish).
-- **Naming conventions:** existing `docs/naming-convention.md` will be linked from the Reference page when written.
-
-## 8. Repository layout
-
-```
-mvp/
-├── docs/
-│   ├── index.md                          ← landing
-│   ├── getting-started/
-│   │   ├── installation.md
-│   │   ├── quickstart.md
-│   │   └── end-to-end.md
-│   ├── concepts/
-│   │   ├── how-it-works.md
-│   │   ├── onnx-and-eps.md
-│   │   ├── quantization.md
-│   │   ├── hierarchy.md
-│   │   └── buildconfig.md
-│   ├── commands/
-│   │   ├── overview.md
-│   │   ├── sys.md
-│   │   ├── inspect.md
-│   │   ├── hub.md
-│   │   ├── analyze.md
-│   │   ├── config.md
-│   │   ├── optimize.md
-│   │   ├── export.md
-│   │   ├── quantize.md
-│   │   ├── compile.md
-│   │   ├── build.md
-│   │   ├── perf.md
-│   │   └── eval.md
-│   ├── samples/
-│   │   ├── convnext-primitives.md
-│   │   ├── bert-config-build.md
-│   │   └── qwen3-composite.md            ← placeholder
-│   ├── reference/                        ← P2 stubs
-│   ├── troubleshooting.md                ← P2 stub
-│   ├── contributing.md                   ← P2 stub
-│   │
-│   ├── design/                           ← UNCHANGED (internal)
-│   ├── naming-convention.md              ← UNCHANGED (internal)
-│   ├── pytest-best-practices.md          ← UNCHANGED (internal)
-│   └── superpowers/specs/                ← UNCHANGED (this file lives here)
-│
-├── mkdocs.yml                            ← new
-└── .github/workflows/docs.yml            ← new (manual dispatch only)
-```
-
-## 9. MkDocs configuration
-
-- **Theme:** `material` with palette toggle (light/dark), instant navigation, code-copy button, "Edit on GitHub" link per page.
-- **Plugins:** `search` (built-in), `mkdocs-jupyter` (notebooks; lazy install).
-- **Markdown extensions:** `pymdownx.superfences` (mermaid, tabbed code), `admonition`, `pymdownx.tabbed`, `pymdownx.details`, `pymdownx.tasklist`.
-- **Nav:** hand-written, mirroring section 5. Chapters 5-7 appear as stub pages in nav.
-- **Strict mode:** `mkdocs build --strict` to fail CI on broken links or missing nav entries.
-- **Excluded from nav:** `docs/design/`, `docs/superpowers/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md` (they remain in the repo for contributors).
-
-## 10. CI workflow
-
-- **File:** `.github/workflows/docs.yml`.
-- **Triggers:** `workflow_dispatch` only (manual) until the user is ready to publish. No auto-trigger on `push` or `pull_request` for MVP.
-- **Steps:** checkout → install `uv` → `uv sync` → `uv run mkdocs build --strict` → `peaceiris/actions-gh-pages` deploy to `gh-pages`.
-- **Local equivalent:** `uv run mkdocs serve` for live preview during authoring.
-
-## 11. Implementation strategy (preview for the plan)
-
-The plan will batch work for parallel execution via subagents:
-
-- **Batch A — Site scaffold (sequential, foundation):** create `mkdocs.yml`, repo layout, landing page, nav stubs, CI workflow. Verify `mkdocs build --strict` succeeds with placeholder content.
-- **Batch B — Concepts pages (5 pages, parallel):** one subagent per concept page; each reads the relevant source module and drafts the page.
-- **Batch C — Command pages (12 command pages + 1 overview page, 4 parallel agents):** one agent per group (Discover / Configure / Build / Measure), each owning 3 commands; agents read source + `--help` output and draft pages using the section 6.3 template. The Commands → Overview page is authored after the 12 command pages settle (sequential), so its decision table reflects the real flag surfaces.
-- **Batch D — Sample pages (3 pages, parallel):** ConvNeXt agent runs the primitive command sequence end-to-end and captures real outputs; BERT agent runs `config + build + perf` and captures outputs; Qwen3 page is a static placeholder.
-- **Batch E — Getting Started (3 pages, sequential after Concepts and Commands):** authored last so it can cross-link to settled concept and command pages.
-
-Each batch ends with `mkdocs build --strict` to catch broken links before moving on.
-
-## 12. Open items / things explicitly punted
-
-- **Versioning:** Not added in MVP. `mike` plugin available if needed later.
-- **Search analytics, Algolia DocSearch:** Not in MVP; Material's built-in search is sufficient.
-- **API reference autogeneration:** Not in MVP. Reconsider if/when a stable library API emerges.
-- **i18n:** Not in MVP.
-
-## 13. Acceptance criteria
-
-- `uv run mkdocs serve` renders the site locally without errors.
-- `uv run mkdocs build --strict` succeeds (no broken links, no missing nav entries).
-- All chapters 1-4 have authored content; chapters 5-7 have stub pages.
-- Mermaid diagrams render in the "How it works" concept page.
-- Existing `docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md` are unmodified and not in the user-facing nav.
-- All commits remain on local `docs/init`; nothing pushed to `origin`.
diff --git a/docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md b/docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md
deleted file mode 100644
index c77e11ece..000000000
--- a/docs/superpowers/specs/2026-05-24-docs-expansion-v2-design.md
+++ /dev/null
@@ -1,263 +0,0 @@
-# Docs Expansion v2 — Design
-
-> **Date:** 2026-05-24
-> **Branch:** `docs/v2` (based on `docs/v1`)
-> **Status:** Design approved verbally; ready for spec self-review and plan.
-
-## 1. Goal
-
-Expand the user-facing winml-cli docs site with: (a) a new **Tutorials** chapter seeded with a ConvNeXt-on-NPU walkthrough adapted from the internal WinHECLab lab, (b) a restructured **Concepts** chapter with two sub-groups (Fundamentals + WinML CLI) totaling 14 pages of pair-topic content, (c) targeted polish to the three existing Getting Started pages.
-
-## 2. Scope and non-goals
-
-### In scope
-
-- 3 Getting Started pages: targeted edits (prereqs alignment, new flag mentions, CPU/DirectML fallback).
-- 2 new Tutorial pages (chapter index + 1 ConvNeXt-on-NPU tutorial).
-- 14 Concepts pages: 5 renamed/touched, 9 newly authored. Sub-grouped into Fundamentals and WinML CLI.
-- `mkdocs.yml` nav restructure to expose Tutorials and the Concepts sub-groups.
-
-### Out of scope
-
-- The C# Windows App SDK demo app from WinHECLab Steps 9–19 (Python/winml-cli only this iteration).
-- Visual Studio / Windows App SDK prerequisites.
-- Hardware-specific lab paths (`C:\LabWinML\...`, `Start\`, `Final\`).
-- Pinned wheel/SDK versions (we use `>=` semantics).
-- Reference, Troubleshooting, Contributing chapters (still P2 stubs).
-- A second tutorial or further Concepts pages beyond the 14 listed.
-
-## 3. Source material
-
-- **WinHECLab README** (`we2-microsoft/WinHECLab`, fetched to `temp/winheclab-readme.md` for this design). External publish OK per design discussion.
-- **Existing winml-cli sources** at `src/winml/modelkit/` (canonical for any flag or behavior we describe).
-- **Existing docs** at `docs/getting-started/`, `docs/concepts/`, `docs/commands/`, `docs/samples/`.
-
-## 4. Information architecture changes
-
-### 4.1 New chapter: Tutorials
-
-A new top-level chapter between **Samples** and **Reference**:
-
-```
-- Samples
-- Tutorials              ← NEW
-    - Overview
-    - ConvNeXt on NPU
-- Reference
-```
-
-The chapter is the home for classroom-style, prescriptive, end-to-end walkthroughs. Distinct from **Samples** (which are reference-style, command-comparison demos) and from **Getting Started** (which is a short onboarding journey).
-
-### 4.2 Concepts restructure
-
-Concepts gets two sub-groups in the nav:
-
-```
-- Concepts
-    - Fundamentals
-        - How winml-cli works
-        - Models, graphs, and the ONNX IR
-        - Tensors and dtypes
-        - Execution Providers and devices
-        - Quantization and QDQ
-        - Hierarchy and ONNX metadata
-        - BuildConfig and kits
-    - WinML CLI
-        - Primitives and pipeline
-        - Config and build
-        - Load and export
-        - Analyze and optimize
-        - Compile and EPContext
-        - Perf and monitoring
-        - Eval and datasets
-```
-
-Every page uses the **pair-topic** framing — the H1 names two related concepts whose contrast or interplay structures the page.
-
-## 5. Per-page detail
-
-### 5.1 Getting Started — 3 pages, targeted edits
-
-#### `installation.md`
-- Rewrite prereqs table in lab style: Windows 11 24H2, Copilot+PC 40+ TOPS NPU (recommended for NPU acceleration), Python 3.10, uv, git. Drop the VS / App SDK lines (those were never in our installation anyway — confirming they stay out).
-- Add a one-paragraph **"No NPU? Use `--device auto`"** callout that explicitly names CPU and DirectML as the fallback.
-
-#### `quickstart.md`
-- Add `winml sys --list-device --list-ep` to the verify step.
-- No other changes — quickstart stays a 5-minute zero-to-export.
-
-#### `end-to-end.md`
-- Add `--monitor` to the `winml perf` step (live NPU utilization chart).
-- Add a short CPU-fallback section after the NPU section showing the same `winml perf` with `--device cpu`.
-- Align prereqs callout with the updated `installation.md`.
-- Model stays **ConvNeXt** (consistency with existing sample pairing).
-
-### 5.2 Tutorials — 2 new pages
-
-#### `tutorials/index.md` (Overview)
-- One paragraph framing: tutorials are linear, end-to-end, prescriptive; for lookup go to Concepts or Commands.
-- One-row table linking to the available tutorials.
-- ~150 words. Grows as more tutorials are added.
-
-#### `tutorials/npu-convnext.md` (ConvNeXt on NPU)
-- **Model:** `facebook/convnext-tiny-224`.
-- **Hardware:** Primary path is Copilot+PC NPU; explicit CPU/DirectML fallback documented throughout.
-- **Structure:**
-  1. **Prerequisites** — adopted from WinHECLab prereqs table.
-  2. **Section A — Primitives walkthrough**: `inspect → config → export → analyze → optimize → quantize → compile → perf`. EP-specific steps (`compile` and `perf`) use **`=== "QNN" / === "OpenVINO"` tabbed code blocks** so readers see both NPU backends inline.
-  3. **Section B — One-shot with `winml build`**: closing section showing the wrapper command produces the same artifact.
-  4. **(Optional) Eval** against an ImageNet sample using `winml eval`.
-  5. **Where to go next** — links to Concepts and Samples.
-- **Length target:** 1,500–2,500 words. This is the longest single page in the site.
-
-### 5.3 Concepts — Fundamentals (7 pages)
-
-Each page uses pair-topic framing. New = needs full authoring; touched = exists but renamed/expanded.
-
-| File | Status | Pair / focus |
-|---|---|---|
-| `concepts/how-it-works.md` | **touched** (rename in nav, content kept) | Pipeline overview, mermaid diagram |
-| `concepts/graphs-and-ir.md` | **new** | What is a model file, graph nodes/edges, opsets, ONNX as IR |
-| `concepts/tensors-and-dtypes.md` | **new** | Weights vs activations vs I/O tensors; fp32/fp16/int8/int16; static vs dynamic shapes |
-| `concepts/eps-and-devices.md` | **touched** (renamed from `onnx-and-eps.md`) | EP vs device, the EP matrix, when to use which |
-| `concepts/quantization.md` | **touched** (small content tightening; dtype family moves to Tensors page) | Why quantize, calibration, QDQ pattern |
-| `concepts/hierarchy-and-metadata.md` | **touched** (renamed from `hierarchy.md`, broadened) | `winml.hierarchy.tag` plus other metadata winml-cli writes |
-| `concepts/buildconfig.md` | **touched** (rename in nav, content kept) | WinMLBuildConfig structure, kits, MODEL_BUILD_CONFIGS |
-
-**Rename mapping:**
-- `onnx-and-eps.md` → `eps-and-devices.md`
-- `hierarchy.md` → `hierarchy-and-metadata.md`
-- The other three existing pages keep their file names; only the nav label changes.
-
-Any inbound links from other docs files (Commands, Samples, Getting Started, Tutorials) must be updated to the new file paths.
-
-### 5.4 Concepts — WinML CLI (7 new pages)
-
-All seven are new. Each is a workflow concept page (the **why** and **when**), not a command reference (the **what**). Cross-link to per-command pages under `docs/commands/`.
-
-| File | Pair / focus |
-|---|---|
-| `concepts/primitives-and-pipeline.md` | Staged commands (`export`, `quantize`, `compile`, …) vs the one-shot `winml build` wrapper. When to choose which. Opens the chapter. |
-| `concepts/config-and-build.md` | `winml config` produces a `WinMLBuildConfig`; `winml build` consumes it. The wrapper-flow pair — reproducibility, sharing configs across runs and CI, override flags vs config values. |
-| `concepts/load-and-export.md` | The "load model into memory, then transform it to ONNX" arc. Covers HF Hub loading, local PyTorch loading, the `winml inspect` pre-flight check, and `winml export` itself. (Note: `winml load` is not a CLI verb — "load" here is the conceptual stage in the loader module, paired with the `export` command that follows it.) |
-| `concepts/analyze-and-optimize.md` | Graph-quality commands. How analyze reports problems and how optimize applies fusions. Shared `--optim-config`. |
-| `concepts/compile-and-epcontext.md` | What `winml compile` produces. QNN EPContext binary blobs embedded in ONNX. Why compiled models load faster at runtime. |
-| `concepts/perf-and-monitoring.md` | `winml perf` plus `--monitor` (live NPU chart) and `--op-tracing`. When to use each. |
-| `concepts/eval-and-datasets.md` | `winml eval` plus dataset semantics (`--dataset`, `--split`, `--column`, `--label-mapping`). When eval matters. |
-
-**Length target per page:** 400–700 words of prose. Same shape as the existing Concepts pages.
-
-**Discipline:** workflow pages explain *why and when*; command pages document *what flags exist*. No flag-table duplication.
-
-## 6. `mkdocs.yml` nav changes
-
-Full updated nav structure:
-
-```yaml
-nav:
-  - Home: index.md
-  - Getting Started:
-      - Installation: getting-started/installation.md
-      - Quickstart: getting-started/quickstart.md
-      - End-to-End — HF → NPU: getting-started/end-to-end.md
-  - Concepts:
-      - Fundamentals:
-          - How winml-cli works: concepts/how-it-works.md
-          - Models, graphs, and the ONNX IR: concepts/graphs-and-ir.md
-          - Tensors and dtypes: concepts/tensors-and-dtypes.md
-          - Execution Providers and devices: concepts/eps-and-devices.md
-          - Quantization and QDQ: concepts/quantization.md
-          - Hierarchy and ONNX metadata: concepts/hierarchy-and-metadata.md
-          - BuildConfig and kits: concepts/buildconfig.md
-      - WinML CLI:
-          - Primitives and pipeline: concepts/primitives-and-pipeline.md
-          - Config and build: concepts/config-and-build.md
-          - Load and export: concepts/load-and-export.md
-          - Analyze and optimize: concepts/analyze-and-optimize.md
-          - Compile and EPContext: concepts/compile-and-epcontext.md
-          - Perf and monitoring: concepts/perf-and-monitoring.md
-          - Eval and datasets: concepts/eval-and-datasets.md
-  - Commands: (unchanged)
-  - Samples: (unchanged)
-  - Tutorials:
-      - Overview: tutorials/index.md
-      - ConvNeXt on NPU: tutorials/npu-convnext.md
-  - Reference: (unchanged P2 stub)
-  - Troubleshooting: (unchanged P2 stub)
-  - Contributing: (unchanged P2 stub)
-```
-
-## 7. File-system changes summary
-
-### New files (11)
-
-- `docs/tutorials/index.md`
-- `docs/tutorials/npu-convnext.md`
-- `docs/concepts/graphs-and-ir.md`
-- `docs/concepts/tensors-and-dtypes.md`
-- `docs/concepts/primitives-and-pipeline.md`
-- `docs/concepts/config-and-build.md`
-- `docs/concepts/load-and-export.md`
-- `docs/concepts/analyze-and-optimize.md`
-- `docs/concepts/compile-and-epcontext.md`
-- `docs/concepts/perf-and-monitoring.md`
-- `docs/concepts/eval-and-datasets.md`
-
-### Renamed files (2 — rename + content edit)
-
-- `docs/concepts/onnx-and-eps.md` → `docs/concepts/eps-and-devices.md` (small content reframe to match the new pair-topic title)
-- `docs/concepts/hierarchy.md` → `docs/concepts/hierarchy-and-metadata.md` (broaden content to cover other metadata winml-cli writes, not just `winml.hierarchy.tag`)
-
-### Modified files (5 — content edits, no rename)
-
-- `docs/getting-started/installation.md`
-- `docs/getting-started/quickstart.md`
-- `docs/getting-started/end-to-end.md`
-- `docs/concepts/quantization.md` (tightening — dtype content moves to the new Tensors page)
-- `mkdocs.yml` (full nav restructure to introduce the Concepts sub-groups and Tutorials chapter)
-
-### Inbound links to update
-
-Any reference to `onnx-and-eps.md` or `hierarchy.md` from other pages (Commands, Samples, Tutorials, Getting Started) must be updated to the new paths. Estimated 6–10 inbound links across the site (to be confirmed during implementation).
-
-## 8. Implementation strategy preview
-
-For the plan to formalize:
-
-- **Batch A — Scaffolding (sequential, foundation):** create stubs for all 11 new pages; rename the 2 renamed pages; update `mkdocs.yml` nav. Verify `mkdocs build --strict` passes with stubs. Single commit.
-- **Batch B — Concepts (Fundamentals) authoring (parallel, 4–5 agents):** new pages (`graphs-and-ir`, `tensors-and-dtypes`) authored in parallel with content-touch passes on the 5 existing pages.
-- **Batch C — Concepts (WinML CLI) authoring (parallel, 3–4 agents):** 7 new workflow pages, agents own 2 pages each (one agent owns 1).
-- **Batch D — Tutorials authoring (sequential, 1 agent):** the ConvNeXt-on-NPU tutorial. Single big page — best authored by one agent for consistency. Plus the small overview page.
-- **Batch E — Getting Started polish (parallel, 3 agents):** small edits to the 3 existing pages.
-- **Batch F — Cross-link fix-up (sequential):** sweep the rest of the docs site for inbound links to the renamed files and update them.
-
-Each batch ends with `uv run mkdocs build --strict` to catch broken links.
-
-## 9. Acceptance criteria
-
-- `uv run mkdocs build --strict` exits 0 with zero warnings on the final commit.
-- All 11 new pages exist and contain non-stub content of at least 300 words each (Tutorials index is exempt — it's a short overview).
-- All 2 renamed pages have been renamed at the filesystem level (not just nav).
-- No remaining inbound links reference the old paths `onnx-and-eps.md` or `hierarchy.md`.
-- Tutorial uses `facebook/convnext-tiny-224`, contains tabbed QNN/OpenVINO code blocks for EP-specific steps, contains both a primitives section and a one-shot `winml build` section.
-- Every flag mentioned in the new content is verified against `src/winml/modelkit/commands/` source (no invented flags).
-- Existing internal docs (`docs/design/`, `docs/naming-convention.md`, `docs/pytest-best-practices.md`, `docs/superpowers/`) are unmodified.
-- All commits remain on local `docs/v2` until publish.
-
-## 10. Risks and mitigations
-
-| Risk | Mitigation |
-|---|---|
-| 19 doc pages is a lot — author agents may drift from each other in tone | Provide every agent the same template and a short "voice guide" excerpt; require source-grounded claims; consider splitting Batch B into two waves so the first wave's voice anchors the second |
-| Inbound-link sweep is easy to miss | Dedicated final batch (F) with `grep` verification before commit |
-| `winml.hierarchy.tag` and other metadata details are real source claims | Each agent verifies via source path + line; reported in the agent's return summary |
-| Tutorial scope creep (toward classroom-style screenshots etc.) | Length cap (1,500–2,500 words); no screenshots in this iteration |
-| ConvNeXt + ConvNeXt overlap between `samples/convnext-primitives.md` and `tutorials/npu-convnext.md` | Sample focuses on **device comparison** (CPU/GPU/NPU); tutorial focuses on **NPU production path** (QNN vs OpenVINO). Different teaching purposes documented in each page's intro paragraph |
-
-## 11. Open items explicitly punted
-
-- A second tutorial (e.g. BERT-config-build on a fresh model). Available content-wise from WinHECLab but deferred to v3.
-- Screenshots and embedded outputs. Not in this iteration; can add later under `docs/tutorials/images/`.
-- Reference, Troubleshooting, Contributing chapter content. Still P2.
-- Versioning (mike plugin). Still P2 from the v1 spec.
-- Migration of internal `docs/design/` content into the public docs. Not in scope.
diff --git a/mkdocs.yml b/mkdocs.yml
index bd00618cd..ba0bc373c 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -9,10 +9,6 @@ docs_dir: docs
 
 exclude_docs: |
   /design/
-  /superpowers/
-  /naming-convention.md
-  /pytest-best-practices.md
-  /README.md
 
 extra:
   version:
@@ -134,3 +130,4 @@ nav:
       - Supported Models: reference/supported-models.md
   - Troubleshooting: troubleshooting.md
   - Contributing: contributing.md
+  - Privacy: Privacy.md

From b55c64b2b619ab3aec28dade77257972015094be Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Mon, 8 Jun 2026 14:52:14 +0800
Subject: [PATCH 084/143] docs: restore naming-convention,
 pytest-best-practices, and README (pre-existing files)

---
 docs/README.md                |  130 ++
 docs/naming-convention.md     |  104 ++
 docs/pytest-best-practices.md | 2832 +++++++++++++++++++++++++++++++++
 3 files changed, 3066 insertions(+)
 create mode 100644 docs/README.md
 create mode 100644 docs/naming-convention.md
 create mode 100644 docs/pytest-best-practices.md

diff --git a/docs/README.md b/docs/README.md
new file mode 100644
index 000000000..6fa5d68bf
--- /dev/null
+++ b/docs/README.md
@@ -0,0 +1,130 @@
+# Contributing to winml-cli docs
+
+This folder hosts the source for the [winml-cli](https://github.com/microsoft/winml-cli) documentation site, built with [MkDocs Material](https://squidfunk.github.io/mkdocs-material/).
+
+## Quick reference
+
+| Task | Command |
+|---|---|
+| Install dev deps | `uv sync --extra dev` |
+| Live preview | `uv run mkdocs serve` |
+| Build for CI | `uv run mkdocs build --strict` |
+| Publish (one-shot from laptop) | `uv run mkdocs gh-deploy --force` |
+| Publish (CI workflow) | GitHub Actions → "Build & Publish Docs" → Run workflow |
+
+## What's in here
+
+```
+docs/
+├── index.md                          ← landing page
+├── getting-started/                  ← 3 onboarding pages
+├── concepts/                         ← 12 conceptual pages in two sub-groups
+│   ├── how-it-works.md, graphs-and-ir.md, weight-and-activation.md,
+│   │     eps-and-devices.md, quantization.md         (Fundamentals)
+│   └── primitives-and-pipeline.md, load-and-export.md, analyze-and-optimize.md,
+│         compile-and-epcontext.md, perf-and-monitoring.md, eval-and-datasets.md,
+│         config-and-build.md                         (WinML CLI workflows)
+├── commands/                         ← per-command reference (overview + 12 commands)
+├── samples/                          ← reference-style walkthroughs
+├── tutorials/                        ← classroom-style walkthroughs
+├── reference/                        ← P2 stubs
+├── troubleshooting.md                ← P2 stub
+├── contributing.md                   ← P2 stub
+│
+├── superpowers/                      ← specs, plans, review notes (excluded from build)
+├── design/                           ← internal ADRs and design docs (excluded)
+├── naming-convention.md              ← internal style guide (excluded)
+└── pytest-best-practices.md          ← internal style guide (excluded)
+```
+
+The site config (`mkdocs.yml`) lives at the repo root, not inside `docs/`. The build outputs to `site/` (gitignored).
+
+## Local development
+
+### Prerequisites
+
+Python 3.10+ and [uv](https://github.com/astral-sh/uv).
+
+### Setup and preview
+
+```bash
+# from the repo root
+uv sync --extra dev
+uv run mkdocs serve
+```
+
+Open http://127.0.0.1:8000/ in a browser. The server auto-reloads when you edit any `.md` file under `docs/`. Changes to `mkdocs.yml` (nav, theme, plugins) require a manual server restart.
+
+### Validate before pushing
+
+```bash
+uv run mkdocs build --strict
+```
+
+`--strict` must exit 0 with no `WARNING` lines. Common causes of strict-mode failures:
+
+- A new page added without an entry in `nav:` (gives a "not included in nav" warning)
+- A nav entry pointing at a file that doesn't exist
+- A relative link like `[text](other-page.md)` whose target file is missing
+- A markdown anchor like `[link](#section-heading)` that doesn't match any heading slug
+
+## Publishing
+
+The site publishes to **GitHub Pages** from the `gh-pages` branch. The repo's `Settings → Pages` source is set to "Deploy from a branch" → `gh-pages` → `/ (root)`.
+
+### One-shot publish from your laptop
+
+```bash
+uv run mkdocs gh-deploy --force
+```
+
+This builds the site locally, commits the static HTML to a local `gh-pages` branch, and force-pushes it to `origin/gh-pages`. GitHub Pages picks up the new commit within ~30–60 seconds.
+
+### Publish via CI
+
+The workflow at `.github/workflows/docs.yml` does the same thing in CI:
+
+1. `Settings → Actions → Build & Publish Docs → Run workflow`
+2. Select the branch you want to publish from (typically `main`)
+
+The workflow is `workflow_dispatch` only — there is no automatic publish on push. If you want auto-publish on every push to `main`, change the trigger:
+
+```yaml
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'docs/**'
+      - 'mkdocs.yml'
+      - 'pyproject.toml'
+      - '.github/workflows/docs.yml'
+  workflow_dispatch:
+```
+
+## Authoring conventions
+
+- **Product name**: `winml-cli` (lowercase, hyphenated) throughout user-facing prose. Use `WinML CLI` (or `Windows ML`) only where the broader Microsoft brand is meant.
+- **Command name**: the CLI invocation is always `winml <subcommand>`. Never `wmk`.
+- **Flag verification**: every flag mentioned in docs must exist in `src/winml/modelkit/commands/<cmd>.py`. Run `uv run winml <cmd> --help` to confirm.
+- **Source citations**: when documenting source-grounded behavior (e.g., "the default opset is 17"), cite the file path and ideally the symbol name. Avoid line numbers — they drift fast.
+- **Mermaid diagrams**: use `pymdownx.superfences` syntax (already configured in `mkdocs.yml`).
+- **Tabbed code blocks**: use `pymdownx.tabbed` (`=== "Label"` followed by a blank line and 4-space-indented code block).
+- **Admonitions**: `!!! note "Title"`, `!!! warning "Title"`, `!!! info "Title"`.
+- **No emojis** in pages unless they're part of an external attribution (e.g., a GitHub badge).
+
+## Excluded paths
+
+The following are present in `docs/` but **excluded from the published site** via the `exclude_docs:` block in `mkdocs.yml`. They are kept in-repo for contributors:
+
+- `docs/design/` — internal architecture decision records and design notes
+- `docs/superpowers/` — specs, plans, and review notes accumulated during doc development
+- `docs/naming-convention.md` — internal naming conventions for code review
+- `docs/pytest-best-practices.md` — internal testing style guide
+
+If you add new internal-only content, either place it under one of these excluded paths or add a new entry to `exclude_docs` in `mkdocs.yml`.
+
+## See also
+
+- [MkDocs Material reference](https://squidfunk.github.io/mkdocs-material/reference/)
+- [MkDocs Material navigation setup](https://squidfunk.github.io/mkdocs-material/setup/setting-up-navigation/)
+- [MkDocs Material color palette](https://squidfunk.github.io/mkdocs-material/setup/changing-the-colors/)
diff --git a/docs/naming-convention.md b/docs/naming-convention.md
new file mode 100644
index 000000000..f1cd3a9a5
--- /dev/null
+++ b/docs/naming-convention.md
@@ -0,0 +1,104 @@
+# WinML CLI Naming Convention
+
+This document defines the naming rules for the WinML CLI codebase. All new code and refactored code must follow these conventions.
+
+## 1. Acronyms in Class Names
+
+Domain acronyms in PascalCase class names **retain their uppercase form**, except for two-letter abbreviations used as generic prefixes.
+
+### Canonical Acronym Table
+
+| Acronym | Meaning | Class Casing | Example |
+|---------|---------|--------------|---------|
+| ONNX | Open Neural Network Exchange | `ONNX` | `ONNXStaticAnalyzer`, `ONNXLoader` |
+| EP | Execution Provider | `EP` | `EPChecker`, `EPConfig`, `EPMonitor` |
+| QDQ | Quantize-Dequantize | `QDQ` | `QDQParameterConfig`, `QDQGenerator` |
+| QNN | Qualcomm Neural Network | `QNN` | `QNNMonitor` |
+| Op | Operator (2-letter prefix) | `Op` | `OpUnsupportedError` |
+| IO | Input/Output | `IO` | `IOConfigInfo` |
+| HTP | Hexagon Tensor Processor | `HTP` | `HTPConfig`, `HTPExporter`, `HTPMetadataBuilder` |
+
+### Why `Op` Not `OP`
+
+Two-letter acronyms used as **class name prefixes** use PascalCase:
+
+- `OPUnsupported` reads ambiguously as three tokens (O-P-Unsupported)
+- `OpUnsupported` reads clearly as two tokens (Op-Unsupported)
+- Consistent with conventions like `Id` vs `ID`
+
+All-caps is acceptable in **constants** (e.g., `SUPPORTED_OPS`).
+
+### Canonical Execution Provider Names
+
+Execution providers appear mainly in constants, EP-name strings, and config keys rather than as class prefixes. Each EP has a fixed canonical short name (used in our code) and an ORT full name (the `*ExecutionProvider` symbol).
+
+| Short name | ORT full name | Device | Vendor / Notes |
+|------------|---------------|--------|----------------|
+| `CPU` | `CPUExecutionProvider` | CPU | Default fallback. |
+| `CUDA` | `CUDAExecutionProvider` | GPU | NVIDIA. All caps. |
+| `DML` | `DmlExecutionProvider` | GPU | DirectML. Use `DML` in our code; do not write `DirectML` as the EP name. |
+| `MIGraphX` | `MIGraphXExecutionProvider` | GPU | AMD. Exact casing (mixed case). |
+| `NvTensorRTRTX` | `NvTensorRTRTXExecutionProvider` | GPU | NVIDIA TensorRT-RTX. Exact casing; do not shorten to `TensorRT`. |
+| `OpenVINO` | `OpenVINOExecutionProvider` | CPU / GPU / NPU | Intel. Exact casing. Alias: `ov`. |
+| `QNN` | `QNNExecutionProvider` | NPU | Qualcomm. All caps. |
+| `VitisAI` | `VitisAIExecutionProvider` | NPU | AMD Ryzen AI. Exact casing. Alias: `vitis`. |
+
+### Other Canonical Identifiers
+
+| Token | Meaning | Notes |
+|-------|---------|-------|
+| `HF_` | HuggingFace (constant/variable prefix) | e.g., `HF_MODEL_CLASS_MAPPING`, `HF_TASK_DEFAULTS`. Not used as a class prefix. |
+
+## 2. Module and Package Names
+
+Follow PEP 8: all lowercase with underscores.
+
+```
+correct:   onnx_op.py, ep_checker.py, qdq_fix.py
+wrong:     OnnxOp.py, EP_Checker.py
+```
+
+## 3. Function and Method Names
+
+Snake_case, lowercase.
+
+```
+correct:   normalize_ep_name(), generate_build_config()
+wrong:     normalizeEPName(), GenerateBuildConfig()
+```
+
+## 4. Constants
+
+UPPER_CASE with underscores.
+
+```
+correct:   SUPPORTED_EPS, EP_ALIASES, DEVICE_TO_DEVICE_TYPE
+wrong:     supportedEps, ep_aliases
+```
+
+## 5. Directory Abbreviation Policy
+
+The codebase uses a mix of abbreviated and full directory names. The established names are frozen — do not rename existing directories for consistency alone. For **new** directories, prefer full names unless the abbreviation is widely recognized in the domain (e.g., `optim`, `eval`, `quant`).
+
+| Established Abbreviation | Full Form |
+|---|---|
+| `optim` | optimization |
+| `quant` | quantization |
+| `eval` | evaluation |
+| `sysinfo` | system information |
+| `optracing` | operator tracing |
+
+## 6. Avoid Name Collisions Across Hierarchy
+
+Do not reuse a parent or sibling package name at a deeper level. When creating new subpackages, verify the name does not already exist elsewhere in the tree.
+
+Known collisions to be aware of:
+
+| Name | Locations | Issue |
+|---|---|---|
+| `winml` | top-level namespace, `modelkit/winml.py`, `models/winml/` | 3-level collision |
+| `core` | `modelkit/core/`, `analyze/core/` | same name, different content |
+| `models` | `modelkit/models/`, `analyze/models/` | ML models vs data models |
+| `utils` | `modelkit/utils/`, `analyze/utils/` | no shared content |
+| `pattern` | `modelkit/pattern/`, `analyze/pattern/` | active vs near-empty |
+| `inspect` | `modelkit/inspect/` | shadows Python stdlib |
diff --git a/docs/pytest-best-practices.md b/docs/pytest-best-practices.md
new file mode 100644
index 000000000..30142d2df
--- /dev/null
+++ b/docs/pytest-best-practices.md
@@ -0,0 +1,2832 @@
+# Complete Pytest Best Practices Guide (2025)
+
+A comprehensive guide covering all aspects of pytest, from basic usage to advanced patterns and project organization.
+
+## Table of Contents
+
+1. [Project Structure & Organization](#project-structure--organization)
+2. [Test Discovery & Naming Conventions](#test-discovery--naming-conventions)
+3. [Fixtures: The Heart of Pytest](#fixtures-the-heart-of-pytest)
+4. [Markers & Test Categorization](#markers--test-categorization)
+5. [Parametrization: Data-Driven Testing](#parametrization-data-driven-testing)
+6. [Assertions & Error Handling](#assertions--error-handling)
+7. [Configuration & Settings](#configuration--settings)
+8. [Conftest.py: Shared Test Logic](#conftest-py-shared-test-logic)
+9. [Mocking & Monkeypatching](#mocking--monkeypatching)
+10. [Database Testing Patterns](#database-testing-patterns)
+11. [Performance & Optimization](#performance--optimization)
+12. [CI/CD Integration](#cicd-integration)
+13. [Plugin Ecosystem](#plugin-ecosystem)
+14. [Snapshot & Regression Testing](#snapshot--regression-testing)
+15. [Property-Based Testing with Hypothesis](#property-based-testing-with-hypothesis)
+16. [Test Asset Generation & Management](#test-asset-generation--management)
+17. [Common Patterns & Anti-Patterns](#common-patterns--anti-patterns)
+18. [Debugging & Troubleshooting](#debugging--troubleshooting)
+19. [Best Practices Checklist](#best-practices-checklist)
+
+---
+
+## Project Structure & Organization
+
+### Recommended Layout
+
+```
+project/
+├── src/                        # Source code
+│   └── myproject/
+│       ├── __init__.py
+│       ├── core/
+│       │   ├── __init__.py
+│       │   └── engine.py
+│       ├── utils/
+│       │   ├── __init__.py
+│       │   └── helpers.py
+│       └── api/
+│           ├── __init__.py
+│           └── endpoints.py
+├── tests/                      # Test directory
+│   ├── __init__.py            # Makes tests a package (optional - see note below)
+│   ├── conftest.py            # Shared fixtures and configuration
+│   ├── unit/                  # Unit tests
+│   │   ├── __init__.py
+│   │   ├── test_engine.py
+│   │   └── test_helpers.py
+│   ├── integration/           # Integration tests
+│   │   ├── __init__.py
+│   │   └── test_api.py
+│   ├── e2e/                   # End-to-end tests
+│   │   ├── __init__.py
+│   │   └── test_workflows.py
+│   └── fixtures/              # Shared test data/utilities
+│       ├── __init__.py
+│       └── test_data.py
+├── pyproject.toml            # Modern Python project config (preferred)
+├── pytest.ini                 # Legacy pytest configuration (avoid)
+├── .coveragerc               # Coverage configuration
+└── tox.ini                   # Multiple environment testing
+```
+
+### Key Principles
+
+1. **Mirror Source Structure**: Test directory structure should mirror your source code
+2. **Separate Test Types**: Keep unit, integration, and e2e tests in separate directories
+3. **`__init__.py` in Tests**: Optional - use only when you need to import between test modules (see detailed explanation below)
+4. **Centralize Fixtures**: Use `conftest.py` for shared fixtures
+
+### Should You Use `__init__.py` in Test Directories?
+
+The use of `__init__.py` in test directories is **optional** and depends on your specific needs:
+
+#### When to USE `__init__.py` in tests ✅
+
+1. **Cross-test imports**: When you need to import helper functions or classes between test modules
+   ```python
+   # tests/unit/test_user.py
+   from tests.helpers.factories import UserFactory  # Requires __init__.py
+   ```
+
+2. **Test utilities as a package**: When you have reusable test utilities that need to be imported
+   ```
+   tests/
+   ├── __init__.py
+   ├── helpers/
+   │   ├── __init__.py
+   │   ├── factories.py
+   │   └── assertions.py
+   ```
+
+3. **Namespace packages**: When you need to avoid naming conflicts with application modules
+   ```python
+   # Disambiguates tests.models from myapp.models
+   from tests.models import TestUser
+   from myapp.models import User
+   ```
+
+#### When NOT to use `__init__.py` in tests ❌
+
+1. **Simple test structures**: Most projects don't need it - pytest discovers tests without it
+2. **Import mode conflicts**: Can cause issues with pytest's import mechanisms
+3. **Accidental test collection**: May cause pytest to collect non-test files
+
+#### Best Practice Recommendation
+
+**Default approach**: Start WITHOUT `__init__.py` in test directories. Only add it when you have a specific need for cross-test imports or test utilities.
+
+```
+# Recommended minimal structure
+tests/
+├── conftest.py          # Shared fixtures (no __init__.py needed)
+├── unit/
+│   └── test_models.py   # Tests work without __init__.py
+└── integration/
+    └── test_api.py
+```
+
+#### pytest.ini Configuration for Import Issues
+
+If you encounter import issues, configure pytest's import mode instead of adding `__init__.py`:
+
+```ini
+# pytest.ini
+[pytest]
+# Use importlib mode for better import handling
+import_mode = importlib
+
+# Or use prepend mode (default)
+import_mode = prepend
+```
+
+### Alternative Layouts
+
+#### Tests Outside Application Code (Recommended)
+```
+project/
+├── src/myproject/
+└── tests/
+```
+
+#### Tests as Part of Application (Less Common)
+```
+project/
+└── myproject/
+    ├── core/
+    │   ├── engine.py
+    │   └── tests/
+    │       └── test_engine.py
+    └── utils/
+        ├── helpers.py
+        └── tests/
+            └── test_helpers.py
+```
+
+---
+
+## Test Discovery & Naming Conventions
+
+### Default Discovery Rules
+
+Pytest automatically discovers tests following these patterns:
+
+- **Test files**: `test_*.py` or `*_test.py`
+- **Test classes**: `Test*` (must not have an `__init__` method)
+- **Test functions**: `test_*`
+- **Test methods**: `test_*` inside `Test*` classes
+
+### Naming Best Practices
+
+```python
+# ❌ Bad: Unclear test names
+def test_1():
+    pass
+
+def test_user():
+    pass
+
+def test_function():
+    pass
+
+# ✅ Good: Descriptive test names
+def test_user_creation_with_valid_email():
+    """Test that a user can be created with a valid email address."""
+    pass
+
+def test_user_creation_fails_with_duplicate_email():
+    """Test that creating a user with an existing email raises an error."""
+    pass
+
+def test_password_reset_sends_email_to_registered_user():
+    """Test that password reset email is sent to registered users."""
+    pass
+```
+
+### Test Class Organization
+
+```python
+class TestUserAuthentication:
+    """Test cases for user authentication functionality."""
+
+    def test_login_with_valid_credentials_returns_token(self):
+        """Test successful login returns authentication token."""
+        pass
+
+    def test_login_with_invalid_password_returns_401(self):
+        """Test login with wrong password returns 401 status."""
+        pass
+
+    def test_login_with_nonexistent_user_returns_404(self):
+        """Test login with non-existent user returns 404 status."""
+        pass
+```
+
+### Custom Discovery Configuration
+
+```ini
+# pytest.ini
+[pytest]
+# Custom patterns for test discovery
+python_files = test_*.py check_*.py
+python_classes = Test* Check*
+python_functions = test_* check_*
+
+# Ignore specific directories
+norecursedirs = .git .tox build dist *.egg
+```
+
+---
+
+## Fixtures: The Heart of Pytest
+
+### Basic Fixture Concepts
+
+```python
+import pytest
+
+# Simple fixture
+@pytest.fixture
+def sample_data():
+    """Provide sample data for tests."""
+    return {"name": "John", "age": 30}
+
+# Fixture with teardown
+@pytest.fixture
+def database_connection():
+    """Create database connection and clean up after test."""
+    conn = create_connection()
+    yield conn  # This is where the test runs
+    conn.close()  # Teardown happens after test
+
+# Using fixtures in tests
+def test_user_data(sample_data):
+    assert sample_data["name"] == "John"
+```
+
+### Fixture Scopes
+
+```python
+# Function scope (default) - run once per test function
+@pytest.fixture(scope="function")
+def function_resource():
+    return expensive_setup()
+
+# Class scope - run once per test class
+@pytest.fixture(scope="class")
+def class_resource():
+    return expensive_setup()
+
+# Module scope - run once per module
+@pytest.fixture(scope="module")
+def module_resource():
+    return expensive_setup()
+
+# Session scope - run once per test session
+@pytest.fixture(scope="session")
+def session_resource():
+    return expensive_setup()
+
+# Package scope - run once per package
+@pytest.fixture(scope="package")
+def package_resource():
+    return expensive_setup()
+```
+
+### Advanced Fixture Patterns
+
+#### Factory Fixtures
+```python
+@pytest.fixture
+def make_user():
+    """Factory fixture for creating users."""
+    created_users = []
+
+    def _make_user(name, email=None):
+        user = User(name=name, email=email or f"{name}@example.com")
+        created_users.append(user)
+        return user
+
+    yield _make_user
+
+    # Cleanup all created users
+    for user in created_users:
+        user.delete()
+
+def test_user_interactions(make_user):
+    alice = make_user("alice")
+    bob = make_user("bob", "bob@company.com")
+    assert alice.can_message(bob)
+```
+
+#### Parametrized Fixtures
+```python
+@pytest.fixture(params=["sqlite", "postgresql", "mysql"])
+def database(request):
+    """Test with multiple database backends."""
+    return setup_database(request.param)
+
+def test_query_performance(database):
+    # This test runs three times, once for each database
+    result = database.execute("SELECT * FROM users")
+    assert result.execution_time < 100  # ms
+```
+
+#### Dynamic Fixture Scope
+```python
+def determine_scope(fixture_name, config):
+    """Dynamically determine fixture scope based on config."""
+    if config.getoption("--quick", None):
+        return "session"  # Reuse fixtures for speed
+    return "function"    # Fresh fixtures for isolation
+
+@pytest.fixture(scope=determine_scope)
+def api_client():
+    return APIClient()
+```
+
+#### Fixture Dependencies
+```python
+@pytest.fixture
+def config():
+    return load_config()
+
+@pytest.fixture
+def database(config):
+    return Database(config["db_url"])
+
+@pytest.fixture
+def api_client(config, database):
+    # Fixtures can depend on other fixtures
+    return APIClient(config["api_url"], database)
+```
+
+### Auto-use Fixtures
+
+```python
+@pytest.fixture(autouse=True)
+def reset_global_state():
+    """Automatically run before each test without explicit request."""
+    clear_caches()
+    reset_singletons()
+    yield
+    # Cleanup happens after test
+
+@pytest.fixture(autouse=True, scope="session")
+def configure_test_environment():
+    """Set up test environment once for entire session."""
+    os.environ["TESTING"] = "true"
+    configure_logging("debug")
+```
+
+### Fixture Finalization
+
+```python
+@pytest.fixture
+def resource_with_finalizer(request):
+    """Using request.addfinalizer for cleanup."""
+    resource = acquire_resource()
+
+    def cleanup():
+        release_resource(resource)
+
+    request.addfinalizer(cleanup)
+    return resource
+
+# Equivalent using yield
+@pytest.fixture
+def resource_with_yield():
+    """Using yield for cleanup (preferred)."""
+    resource = acquire_resource()
+    yield resource
+    release_resource(resource)
+```
+
+---
+
+## Markers & Test Categorization
+
+### Built-in Markers
+
+```python
+import pytest
+import sys
+
+# Skip marker
+@pytest.mark.skip(reason="Not implemented yet")
+def test_future_feature():
+    pass
+
+# Conditional skip
+@pytest.mark.skipif(sys.version_info < (3, 10), reason="Requires Python 3.10+")
+def test_pattern_matching():
+    match value:
+        case 1: return "one"
+        case _: return "other"
+
+# Expected failure
+@pytest.mark.xfail(reason="Known bug #123")
+def test_known_issue():
+    assert buggy_function() == expected_value
+
+# Strict xfail - fails if test passes
+@pytest.mark.xfail(strict=True, reason="Should be fixed in v2.0")
+def test_upcoming_fix():
+    assert new_feature() == expected
+
+# Platform-specific tests
+@pytest.mark.skipif(sys.platform != "linux", reason="Linux only test")
+def test_linux_specific():
+    pass
+
+# Import skip
+def test_optional_dependency():
+    numpy = pytest.importorskip("numpy", minversion="1.20.0")
+    # Test only runs if numpy >= 1.20.0 is available
+```
+
+### Custom Markers
+
+```ini
+# pytest.ini - Register custom markers
+[pytest]
+markers =
+    slow: marks tests as slow (deselect with '-m "not slow"')
+    smoke: core functionality that must always work
+    integration: requires external services
+    unit: fast isolated unit tests
+    flaky: tests that occasionally fail
+    requires_db: tests that need database access
+    requires_network: tests that need network access
+```
+
+```python
+# Using custom markers
+@pytest.mark.slow
+@pytest.mark.integration
+def test_full_workflow():
+    """Test complete user workflow with external services."""
+    pass
+
+@pytest.mark.smoke
+def test_critical_functionality():
+    """Test that must always pass."""
+    pass
+
+# Multiple markers
+@pytest.mark.unit
+@pytest.mark.smoke
+def test_core_logic():
+    """Fast unit test for critical functionality."""
+    pass
+```
+
+### Marker Expressions
+
+```bash
+# Run only smoke tests
+pytest -m smoke
+
+# Run all tests except slow ones
+pytest -m "not slow"
+
+# Complex expressions
+pytest -m "smoke and not slow"
+pytest -m "(unit or integration) and not flaky"
+
+# List all markers
+pytest --markers
+```
+
+### Applying Markers Dynamically
+
+```python
+# In conftest.py
+def pytest_collection_modifyitems(items):
+    """Dynamically add markers during collection."""
+    for item in items:
+        # Add marker based on test location
+        if "integration" in str(item.fspath):
+            item.add_marker(pytest.mark.integration)
+
+        # Add marker based on test name
+        if "slow" in item.name:
+            item.add_marker(pytest.mark.slow)
+```
+
+---
+
+## Parametrization: Data-Driven Testing
+
+### Basic Parametrization
+
+```python
+import pytest
+
+# Single parameter
+@pytest.mark.parametrize("number", [1, 2, 3, 4, 5])
+def test_square(number):
+    assert number ** 2 == number * number
+
+# Multiple parameters
+@pytest.mark.parametrize("input,expected", [
+    (2, 4),
+    (3, 9),
+    (4, 16),
+    (-2, 4),
+])
+def test_square_with_expected(input, expected):
+    assert input ** 2 == expected
+
+# Using test IDs for better output
+@pytest.mark.parametrize("input,expected", [
+    (2, 4),
+    (3, 9),
+    (-2, 4),
+], ids=["positive_2", "positive_3", "negative_2"])
+def test_square_with_ids(input, expected):
+    assert input ** 2 == expected
+
+# ID function
+def idfn(val):
+    return f"num_{val}"
+
+@pytest.mark.parametrize("number", [1, 2, 3], ids=idfn)
+def test_with_id_function(number):
+    assert number > 0
+```
+
+### Advanced Parametrization
+
+```python
+# Nested parametrization
+@pytest.mark.parametrize("x", [1, 2])
+@pytest.mark.parametrize("y", [10, 20])
+def test_multiplication(x, y):
+    # Runs 4 times: (1,10), (1,20), (2,10), (2,20)
+    assert x * y == y * x
+
+# Parametrize with marks
+@pytest.mark.parametrize("test_input,expected", [
+    ("3+5", 8),
+    ("2+4", 6),
+    pytest.param("6*9", 42, marks=pytest.mark.xfail(reason="Hitchhiker's joke")),
+    pytest.param("1/0", 0, marks=pytest.mark.skip(reason="Division by zero")),
+])
+def test_eval(test_input, expected):
+    assert eval(test_input) == expected
+
+# Indirect parametrization (parametrize fixtures)
+@pytest.mark.parametrize("db_name", ["sqlite", "postgres"], indirect=True)
+def test_database_operations(db_name):
+    # db_name fixture receives the parameter value
+    assert db_name.connect()
+```
+
+### Parametrization Patterns
+
+```python
+# Test class parametrization
+@pytest.mark.parametrize("browser", ["chrome", "firefox", "safari"])
+class TestWebApplication:
+    def test_login(self, browser):
+        # Each test method runs with each browser
+        pass
+
+    def test_search(self, browser):
+        pass
+
+# Dynamic parametrization
+def pytest_generate_tests(metafunc):
+    """Dynamically parametrize tests."""
+    if "dynamic_value" in metafunc.fixturenames:
+        values = load_test_values_from_file()
+        metafunc.parametrize("dynamic_value", values)
+
+# Parametrization from fixtures
+@pytest.fixture(params=["admin", "user", "guest"])
+def user_role(request):
+    return create_user_with_role(request.param)
+
+def test_permissions(user_role):
+    # Test runs for each user role
+    assert user_role.can_access("/dashboard") == user_role.is_admin
+```
+
+---
+
+## Assertions & Error Handling
+
+### Enhanced Assertions
+
+```python
+# Pytest rewrites assert statements for better output
+def test_assertion_introspection():
+    data = {"name": "Alice", "items": [1, 2, 3]}
+    # Pytest shows detailed diff on failure
+    assert data == {"name": "Bob", "items": [1, 2, 3]}
+
+# Custom assertion messages
+def test_with_message():
+    result = complex_calculation()
+    assert result > 0, f"Expected positive result, got {result}"
+```
+
+### Exception Testing
+
+```python
+import pytest
+
+# Basic exception testing
+def test_raises_exception():
+    with pytest.raises(ValueError):
+        raise ValueError("Invalid value")
+
+# Check exception message
+def test_exception_message():
+    with pytest.raises(ValueError, match="Invalid.*value"):
+        raise ValueError("Invalid value provided")
+
+# Access exception info
+def test_exception_info():
+    with pytest.raises(ValueError) as exc_info:
+        raise ValueError("test error")
+
+    assert str(exc_info.value) == "test error"
+    assert exc_info.type == ValueError
+
+# Test multiple exceptions (ExceptionGroup)
+def test_exception_group():
+    with pytest.raises(ExceptionGroup) as exc_info:
+        raise ExceptionGroup("errors", [
+            ValueError("error 1"),
+            TypeError("error 2")
+        ])
+
+    assert len(exc_info.value.exceptions) == 2
+```
+
+### Warning Testing
+
+```python
+import warnings
+import pytest
+
+def test_warns():
+    with pytest.warns(UserWarning):
+        warnings.warn("This is a warning", UserWarning)
+
+def test_warns_with_match():
+    with pytest.warns(DeprecationWarning, match="deprecated"):
+        warnings.warn("This function is deprecated", DeprecationWarning)
+
+def test_no_warnings():
+    # Ensure no warnings are raised
+    with warnings.catch_warnings():
+        warnings.simplefilter("error")
+        clean_function()  # Should not raise any warnings
+```
+
+### Approximate Comparisons
+
+```python
+import pytest
+
+def test_float_comparison():
+    assert 0.1 + 0.2 == pytest.approx(0.3)
+
+def test_list_approximate():
+    assert [0.1 + 0.2, 0.2 + 0.4] == pytest.approx([0.3, 0.6])
+
+def test_dict_approximate():
+    assert {"a": 0.1 + 0.2} == pytest.approx({"a": 0.3})
+
+# Custom tolerance
+def test_custom_tolerance():
+    assert 1.0001 == pytest.approx(1.0, rel=1e-3)
+    assert 1.0001 == pytest.approx(1.0, abs=1e-3)
+```
+
+---
+
+## Configuration & Settings
+
+### Configuration File Priority (Critical Knowledge)
+
+Understanding configuration file priority is essential for debugging pytest configuration issues.
+
+**Priority Order** (first match wins - configurations are NEVER merged):
+
+| Priority | File | Notes |
+|----------|------|-------|
+| 1 (Highest) | `pytest.toml` / `.pytest.toml` | New in pytest 9.0, native TOML |
+| 2 | `pytest.ini` / `.pytest.ini` | Classic pytest config |
+| 3 | `pyproject.toml` | Modern Python project standard |
+| 4 | `tox.ini` | Tox integration |
+| 5 (Lowest) | `setup.cfg` | Legacy, not recommended |
+
+> ⚠️ **Critical Gotcha**: If an empty `pytest.ini` file exists in your project, ALL settings in `pyproject.toml` will be ignored! This is a common source of confusion. Delete any empty `pytest.ini` files.
+
+**Configuration Sections by File Type**:
+
+| File Type | Section Name |
+|-----------|--------------|
+| pytest.ini | `[pytest]` |
+| pyproject.toml (pytest 6.0-8.x) | `[tool.pytest.ini_options]` |
+| pyproject.toml (pytest 9.0+) | `[tool.pytest]` |
+| tox.ini | `[pytest]` |
+| setup.cfg | `[tool:pytest]` |
+
+**Best Practice**: Use `pyproject.toml` as your single source of truth for all Python tooling configuration (pytest, ruff, mypy, etc.).
+
+### pyproject.toml Configuration (Recommended)
+
+Using `pyproject.toml` is the modern, preferred approach for Python project configuration. It consolidates all project metadata and tool configurations in one place.
+
+```toml
+# pyproject.toml
+[tool.pytest.ini_options]
+# Minimum pytest version
+minversion = "7.0"
+
+# Default command line options
+addopts = [
+    "--strict-markers",      # Fail on unknown markers
+    "--strict-config",       # Fail on config errors
+    "--import-mode=importlib",  # Use standard import system (recommended)
+    "--verbose",             # Verbose output
+    "-ra",                   # Show all test outcomes
+    "--cov=myproject",       # Coverage for your project
+    "--cov-report=html",     # HTML coverage report
+    "--cov-report=term-missing",  # Terminal report with missing lines
+]
+
+> 💡 **Recommended**: Always include `--import-mode=importlib` in your `addopts`. This uses Python's standard import system instead of modifying `sys.path`, avoiding common import issues. This has been the default since pytest 6.0 but explicitly setting it ensures consistent behavior.
+
+# Test discovery
+testpaths = ["tests"]
+python_files = ["test_*.py", "*_test.py"]
+python_classes = ["Test*", "*Tests"]
+python_functions = ["test_*"]
+
+# Python path configuration
+pythonpath = ["src"]
+
+# Import mode (importlib is recommended for most projects)
+import_mode = "importlib"
+
+# Custom markers registration
+markers = [
+    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
+    "integration: requires external services",
+    "unit: fast isolated unit tests",
+    "smoke: core functionality that must always work",
+    "flaky: tests that occasionally fail",
+    "requires_network: tests that need network access",
+]
+
+# Output configuration
+console_output_style = "progress"
+
+# Directories to ignore
+norecursedirs = [".git", ".tox", "dist", "build", "*.egg", "__pycache__"]
+
+# Logging configuration
+log_cli = true
+log_cli_level = "INFO"
+log_cli_format = "%(asctime)s [%(levelname)8s] %(message)s"
+log_cli_date_format = "%Y-%m-%d %H:%M:%S"
+
+# Warning filters
+filterwarnings = [
+    "error",                          # Turn warnings into errors
+    "ignore::UserWarning",            # Ignore user warnings
+    "ignore::DeprecationWarning",     # Ignore deprecation warnings
+    "default:.*deprecated.*:DeprecationWarning",  # Show deprecation warnings with "deprecated" in message
+]
+
+# Required plugins
+required_plugins = [
+    "pytest-cov>=4.0",
+]
+
+# Test timeout (requires pytest-timeout)
+timeout = 300
+timeout_method = "thread"
+
+# Strict xfail
+xfail_strict = true
+
+# Asyncio configuration (requires pytest-asyncio)
+asyncio_mode = "auto"
+
+# Coverage configuration (can also be in [tool.coverage])
+[tool.coverage.run]
+source = ["myproject"]
+omit = [
+    "*/tests/*",
+    "*/venv/*",
+    "*/.venv/*",
+    "*/migrations/*",
+    "*/__pycache__/*",
+    "*/.pytest_cache/*",
+]
+
+[tool.coverage.report]
+precision = 2
+show_missing = true
+skip_covered = false
+exclude_lines = [
+    "pragma: no cover",
+    "def __repr__",
+    "raise AssertionError",
+    "raise NotImplementedError",
+    "if __name__ == .__main__.:",
+    "if TYPE_CHECKING:",
+    "if typing.TYPE_CHECKING:",
+]
+
+[tool.coverage.html]
+directory = "htmlcov"
+
+[tool.coverage.xml]
+output = "coverage.xml"
+```
+
+### Complete pyproject.toml Example
+
+Here's a complete `pyproject.toml` that includes project metadata along with pytest configuration:
+
+```toml
+[build-system]
+requires = ["setuptools>=64", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "myproject"
+version = "1.0.0"
+description = "My awesome project"
+readme = "README.md"
+requires-python = ">=3.8"
+license = {text = "MIT"}
+authors = [
+    {name = "Your Name", email = "you@example.com"},
+]
+dependencies = [
+    "requests>=2.28.0",
+    "pydantic>=2.0.0",
+]
+
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.0.0",
+    "pytest-cov>=4.0.0",
+    "pytest-mock>=3.10.0",
+    "pytest-asyncio>=0.21.0",
+    "pytest-timeout>=2.1.0",
+    "pytest-xdist>=3.0.0",
+    "black>=23.0.0",
+    "ruff>=0.1.0",
+    "mypy>=1.0.0",
+]
+
+[project.urls]
+Homepage = "https://github.com/username/myproject"
+Documentation = "https://myproject.readthedocs.io"
+Repository = "https://github.com/username/myproject.git"
+Issues = "https://github.com/username/myproject/issues"
+
+[tool.setuptools.packages.find]
+where = ["src"]
+
+[tool.pytest.ini_options]
+# ... (configuration from above)
+
+[tool.black]
+line-length = 88
+target-version = ["py38", "py39", "py310", "py311"]
+include = '\.pyi?$'
+
+[tool.ruff]
+line-length = 88
+target-version = "py38"
+select = [
+    "E",   # pycodestyle errors
+    "W",   # pycodestyle warnings
+    "F",   # pyflakes
+    "I",   # isort
+    "N",   # pep8-naming
+    "UP",  # pyupgrade
+]
+
+[tool.mypy]
+python_version = "3.8"
+warn_return_any = true
+warn_unused_configs = true
+disallow_untyped_defs = true
+```
+
+### Migration from pytest.ini to pyproject.toml
+
+If you have an existing `pytest.ini`, here's how to migrate:
+
+```ini
+# OLD: pytest.ini
+[pytest]
+markers =
+    slow: slow tests
+testpaths = tests
+```
+
+Becomes:
+
+```toml
+# NEW: pyproject.toml
+[tool.pytest.ini_options]
+markers = [
+    "slow: slow tests",
+]
+testpaths = ["tests"]
+```
+
+### pytest 9.0+ Native TOML Configuration
+
+Starting with pytest 9.0, you can use the native `[tool.pytest]` table which provides cleaner TOML syntax:
+
+```toml
+# pytest 9.0+ (native TOML arrays - cleaner syntax)
+[tool.pytest]
+minversion = "9.0"
+
+# Test discovery
+testpaths = ["tests"]
+pythonpath = ["."]
+python_files = ["test_*.py", "*_test.py"]
+python_classes = ["Test*"]
+python_functions = ["test_*"]
+norecursedirs = [".git", ".tox", "dist", "build", ".venv", "__pycache__"]
+
+# Command line options (native TOML arrays)
+addopts = [
+    "--strict-markers",
+    "--strict-config",
+    "--import-mode=importlib",
+    "-ra",
+    "--tb=short",
+]
+
+# Markers
+markers = [
+    "slow: marks tests as slow",
+    "integration: integration tests",
+]
+
+# Warning filters
+filterwarnings = [
+    "error",
+    "ignore::DeprecationWarning",
+]
+
+# Required plugins
+required_plugins = [
+    "pytest-cov>=4.0",
+]
+```
+
+**Benefits over `[tool.pytest.ini_options]`**:
+- Native TOML array syntax (clearer than space-separated strings in some cases)
+- Better TOML type support
+- Future-proof configuration format
+- Reserved by pytest team for enhanced features
+
+**Migration**: Simply rename `[tool.pytest.ini_options]` to `[tool.pytest]` when upgrading to pytest 9.0+.
+
+### Legacy pytest.ini (Not Recommended)
+
+While `pytest.ini` still works, it's considered legacy. Use `pyproject.toml` instead for these benefits:
+- Single configuration file for all Python tools
+- Better IDE support
+- TOML format is more readable
+- Standardized by PEP 518 and PEP 621
+
+### Command Line Configuration
+
+```bash
+# Common command line options
+pytest -v                    # Verbose output
+pytest -q                    # Quiet output
+pytest -s                    # No capture, show print statements
+pytest -x                    # Stop on first failure
+pytest --maxfail=3          # Stop after 3 failures
+pytest -k "user"            # Run tests matching "user"
+pytest -m "not slow"        # Run tests not marked as slow
+pytest --lf                 # Run last failed tests
+pytest --ff                 # Run failed tests first
+pytest --tb=short           # Short traceback format
+pytest --tb=no              # No traceback
+pytest --setup-show         # Show fixture setup/teardown
+pytest --fixtures           # Show available fixtures
+pytest --markers            # Show available markers
+pytest --collect-only       # Only collect tests, don't run
+pytest --cache-clear        # Clear cache before run
+pytest --doctest-modules    # Run doctests
+pytest --cov=myproject      # Coverage report
+pytest --cov-report=html    # HTML coverage report
+pytest --durations=10       # Show 10 slowest tests
+pytest --pdb                # Drop to debugger on failure
+pytest --pdbcls=IPython.terminal.debugger:TerminalPdb  # Use IPython debugger
+```
+
+---
+
+## Conftest.py: Shared Test Logic
+
+### Fixture Sharing
+
+```python
+# tests/conftest.py - Available to all tests
+import pytest
+import tempfile
+from pathlib import Path
+
+@pytest.fixture(scope="session")
+def test_data_dir():
+    """Shared test data directory."""
+    return Path(__file__).parent / "data"
+
+@pytest.fixture
+def temp_dir():
+    """Create temporary directory for test."""
+    with tempfile.TemporaryDirectory() as tmp:
+        yield Path(tmp)
+
+# tests/unit/conftest.py - Available to unit tests only
+@pytest.fixture
+def mock_database():
+    """Mock database for unit tests."""
+    return MockDatabase()
+
+# tests/integration/conftest.py - Available to integration tests only
+@pytest.fixture(scope="module")
+def real_database():
+    """Real database connection for integration tests."""
+    db = Database()
+    yield db
+    db.cleanup()
+```
+
+### Hooks in conftest.py
+
+```python
+# Modify test collection
+def pytest_collection_modifyitems(config, items):
+    """Modify test collection."""
+    # Add markers based on test file location
+    for item in items:
+        # Add markers based on location
+        if "integration" in str(item.fspath):
+            item.add_marker(pytest.mark.integration)
+
+        # Skip tests based on environment
+        if "requires_gpu" in item.keywords and not has_gpu():
+            item.add_marker(pytest.mark.skip(reason="GPU not available"))
+
+# Custom command line options
+def pytest_addoption(parser):
+    """Add custom command line options."""
+    parser.addoption(
+        "--run-slow",
+        action="store_true",
+        default=False,
+        help="Run slow tests"
+    )
+    parser.addoption(
+        "--integration",
+        action="store_true",
+        default=False,
+        help="Run integration tests"
+    )
+
+# Configure based on options
+def pytest_configure(config):
+    """Configure pytest based on command line options."""
+    if config.getoption("--run-slow"):
+        config.option.markexpr = "slow"
+
+# Custom markers registration
+def pytest_configure(config):
+    config.addinivalue_line(
+        "markers", "slow: marks tests as slow"
+    )
+```
+
+#### Hook Execution Order Control
+
+Control when your hooks run relative to other plugins:
+
+```python
+@pytest.hookimpl(tryfirst=True)
+def pytest_collection_modifyitems(items):
+    """Execute BEFORE other implementations."""
+    # Priority operations here
+    pass
+
+@pytest.hookimpl(trylast=True)
+def pytest_collection_modifyitems(items):
+    """Execute AFTER other implementations."""
+    # Cleanup or final modifications
+    pass
+```
+
+#### Wrapper Hooks (Advanced)
+
+Wrap other hook implementations for cross-cutting concerns:
+
+```python
+@pytest.hookimpl(wrapper=True)
+def pytest_runtest_makereport(item, call):
+    """Wrap report generation for custom handling."""
+    # Code before other hooks run
+    outcome = yield  # Run wrapped hooks
+    report = outcome.get_result()
+
+    # Code after - modify or log report
+    if report.when == "call" and report.failed:
+        # Handle test failure
+        log_failure(item.nodeid, report.longreprtext)
+
+    return report
+
+@pytest.hookimpl(wrapper=True, tryfirst=True)
+def pytest_runtest_setup(item):
+    """Wrap setup with timing."""
+    start = time.time()
+    yield  # Run actual setup
+    duration = time.time() - start
+    item.setup_duration = duration
+```
+
+#### Storing Data Across Hooks
+
+Use `item.stash` for type-safe data storage:
+
+```python
+from pytest import StashKey
+
+# Define typed keys
+phase_report_key = StashKey[dict]()
+timing_key = StashKey[float]()
+
+@pytest.hookimpl(wrapper=True)
+def pytest_runtest_makereport(item, call):
+    """Store reports for fixture access."""
+    outcome = yield
+    report = outcome.get_result()
+
+    # Store in stash (type-safe)
+    item.stash.setdefault(phase_report_key, {})[report.when] = report
+    return report
+
+@pytest.fixture
+def test_outcome(request):
+    """Fixture to access test outcome."""
+    yield
+    report = request.node.stash.get(phase_report_key, {}).get("call")
+    if report and report.failed:
+        # Handle failure in fixture teardown
+        pass
+```
+
+#### Custom Report Sections
+
+Add extra information to test reports:
+
+```python
+@pytest.hookimpl(tryfirst=True, wrapper=True)
+def pytest_runtest_makereport(item, call):
+    outcome = yield
+    report = outcome.get_result()
+
+    # Add custom sections to report
+    if report.when == "call":
+        report.sections.append(
+            ("Custom Info", f"Test: {item.nodeid}\nDuration: {call.duration:.2f}s")
+        )
+
+    return report
+```
+
+### Plugin Registration
+
+```python
+# Register external plugins
+pytest_plugins = [
+    "myproject.testing.fixtures",
+    "myproject.testing.helpers",
+]
+
+# Conditional plugin loading
+import sys
+if sys.platform.startswith("win"):
+    pytest_plugins.append("myproject.testing.windows")
+```
+
+---
+
+## Mocking & Monkeypatching
+
+### Using pytest-mock
+
+```python
+# Install: pip install pytest-mock
+
+def test_with_mock(mocker):
+    """Using pytest-mock plugin."""
+    # Mock a module function
+    mock_func = mocker.patch("mymodule.function")
+    mock_func.return_value = 42
+
+    # Mock an object method
+    mock_method = mocker.patch.object(MyClass, "method")
+    mock_method.return_value = "mocked"
+
+    # Spy on a function
+    spy = mocker.spy(mymodule, "function")
+    mymodule.function()
+    spy.assert_called_once()
+
+# Using side effects
+def test_side_effects(mocker):
+    mock = mocker.patch("mymodule.function")
+    mock.side_effect = [1, 2, 3]  # Returns different values each call
+
+    assert mymodule.function() == 1
+    assert mymodule.function() == 2
+    assert mymodule.function() == 3
+
+# Mock with exceptions
+def test_mock_exception(mocker):
+    mock = mocker.patch("mymodule.function")
+    mock.side_effect = ValueError("Error!")
+
+    with pytest.raises(ValueError):
+        mymodule.function()
+```
+
+### Monkeypatch
+
+```python
+def test_monkeypatch_env(monkeypatch):
+    """Monkeypatch environment variables."""
+    monkeypatch.setenv("API_KEY", "test-key")
+    monkeypatch.delenv("OLD_VAR", raising=False)
+
+    assert os.environ["API_KEY"] == "test-key"
+    assert "OLD_VAR" not in os.environ
+
+def test_monkeypatch_attribute(monkeypatch):
+    """Monkeypatch object attributes."""
+    class MyClass:
+        value = 10
+
+    obj = MyClass()
+    monkeypatch.setattr(obj, "value", 20)
+    assert obj.value == 20
+
+def test_monkeypatch_module(monkeypatch):
+    """Monkeypatch module functions."""
+    import time
+
+    def mock_time():
+        return 123456.0
+
+    monkeypatch.setattr(time, "time", mock_time)
+    assert time.time() == 123456.0
+
+def test_monkeypatch_dict(monkeypatch):
+    """Monkeypatch dictionary items."""
+    config = {"url": "production.com"}
+    monkeypatch.setitem(config, "url", "test.com")
+    assert config["url"] == "test.com"
+```
+
+### Advanced Mocking Patterns
+
+```python
+# Context manager mocking
+def test_context_manager(mocker):
+    mock_cm = mocker.MagicMock()
+    mock_cm.__enter__.return_value = "resource"
+    mock_cm.__exit__.return_value = None
+
+    mocker.patch("mymodule.get_resource", return_value=mock_cm)
+
+    with mymodule.get_resource() as resource:
+        assert resource == "resource"
+
+    mock_cm.__enter__.assert_called_once()
+    mock_cm.__exit__.assert_called_once()
+
+# Property mocking
+def test_property_mock(mocker):
+    mock_property = mocker.PropertyMock(return_value=42)
+    mocker.patch("mymodule.MyClass.my_property", new_callable=mock_property)
+
+    obj = mymodule.MyClass()
+    assert obj.my_property == 42
+    mock_property.assert_called_once()
+
+# Async mocking
+async def test_async_mock(mocker):
+    mock_async = mocker.AsyncMock(return_value="async result")
+    mocker.patch("mymodule.async_function", mock_async)
+
+    result = await mymodule.async_function()
+    assert result == "async result"
+    mock_async.assert_awaited_once()
+```
+
+---
+
+## Database Testing Patterns
+
+Testing database interactions requires careful isolation and cleanup strategies.
+
+### Transaction-Based Isolation
+
+The most reliable approach is rolling back transactions after each test:
+
+```python
+import pytest
+
+@pytest.fixture
+def db_session(engine):
+    """Create a transactional test session."""
+    connection = engine.connect()
+    transaction = connection.begin()
+    session = Session(bind=connection)
+
+    yield session
+
+    session.close()
+    transaction.rollback()
+    connection.close()
+
+def test_user_creation(db_session):
+    """Test runs in transaction that gets rolled back."""
+    user = User(name="test")
+    db_session.add(user)
+    db_session.flush()
+
+    assert user.id is not None
+    # Transaction rolled back - no cleanup needed
+```
+
+### pytest-django Database Access
+
+```python
+import pytest
+
+# Mark test to enable database access
+@pytest.mark.django_db
+def test_user_creation():
+    User.objects.create(username="testuser")
+    assert User.objects.count() == 1
+
+# Transaction testing (for testing transaction behavior)
+@pytest.mark.django_db(transaction=True)
+def test_atomic_operations():
+    with transaction.atomic():
+        User.objects.create(username="user1")
+        # Test atomic behavior
+
+# Multiple database support
+@pytest.mark.django_db(databases=["default", "secondary"])
+def test_multi_db():
+    User.objects.using("secondary").create(username="remote_user")
+```
+
+### Database Blocker Pattern
+
+Control database access at fixture level:
+
+```python
+@pytest.fixture
+def setup_data(django_db_blocker):
+    """Fixture that needs temporary DB access."""
+    with django_db_blocker.unblock():
+        # Database operations allowed here
+        User.objects.create(username="fixture_user")
+    # Database blocked again outside context
+
+@pytest.fixture
+def no_db_fixture(django_db_blocker):
+    """Ensure no accidental DB access."""
+    with django_db_blocker.block():
+        yield  # DB access will raise error
+```
+
+### Query Count Assertions
+
+Prevent N+1 query issues:
+
+```python
+def test_efficient_queries(django_assert_num_queries):
+    """Assert exact number of queries."""
+    with django_assert_num_queries(3):
+        list(User.objects.all())
+        list(Post.objects.all())
+        list(Comment.objects.all())
+
+def test_max_queries(django_assert_max_num_queries):
+    """Assert maximum query count."""
+    with django_assert_max_num_queries(5):
+        # Complex operation that should be efficient
+        process_users()
+```
+
+### SQLAlchemy Testing Patterns
+
+```python
+import pytest
+from sqlalchemy import create_engine
+from sqlalchemy.orm import sessionmaker
+
+@pytest.fixture(scope="session")
+def engine():
+    """Create test database engine."""
+    return create_engine("sqlite:///:memory:")
+
+@pytest.fixture(scope="session")
+def tables(engine):
+    """Create all tables."""
+    Base.metadata.create_all(engine)
+    yield
+    Base.metadata.drop_all(engine)
+
+@pytest.fixture
+def db_session(engine, tables):
+    """Create a new database session for each test."""
+    connection = engine.connect()
+    transaction = connection.begin()
+    Session = sessionmaker(bind=connection)
+    session = Session()
+
+    yield session
+
+    session.close()
+    transaction.rollback()
+    connection.close()
+```
+
+### Factory Pattern for Test Data
+
+```python
+import pytest
+from factory import Factory, Faker, SubFactory
+
+class UserFactory(Factory):
+    class Meta:
+        model = User
+
+    username = Faker("user_name")
+    email = Faker("email")
+
+class PostFactory(Factory):
+    class Meta:
+        model = Post
+
+    title = Faker("sentence")
+    author = SubFactory(UserFactory)
+
+@pytest.fixture
+def user_factory(db_session):
+    """Factory fixture for creating test users."""
+    def _create_user(**kwargs):
+        user = UserFactory.build(**kwargs)
+        db_session.add(user)
+        db_session.flush()
+        return user
+    return _create_user
+
+def test_user_posts(user_factory):
+    author = user_factory(username="author")
+    post = PostFactory.build(author=author)
+    assert post.author.username == "author"
+```
+
+---
+
+## Performance & Optimization
+
+### Parallel Execution with pytest-xdist
+
+```bash
+# Install pytest-xdist
+pip install pytest-xdist
+```
+
+#### Basic Usage
+
+```bash
+pytest -n auto          # Use all available CPUs
+pytest -n 4             # Use 4 workers
+pytest -n logical       # Use logical cores (requires psutil)
+```
+
+#### Distribution Strategies
+
+Understanding distribution strategies is critical for efficient parallel testing:
+
+```bash
+# Load balancing (default) - distributes tests as workers become available
+pytest -n auto --dist load
+
+# Group by scope - keeps tests sharing fixtures on same worker (RECOMMENDED)
+pytest -n auto --dist loadscope
+
+# Group by file - all tests in a file run on same worker
+pytest -n auto --dist loadfile
+
+# Each test runs on every worker (for environment-specific testing)
+pytest -n 2 --dist each
+```
+
+**When to Use Each Strategy**:
+
+| Strategy | Use Case | Performance |
+|----------|----------|-------------|
+| `load` | Independent tests, no shared state | Best parallelization |
+| `loadscope` | Tests sharing expensive fixtures | Balanced (recommended default) |
+| `loadfile` | File-level isolation needed | Good for integration tests |
+| `each` | Multi-environment testing | Multiplies test count |
+
+#### Grouping Tests with xdist_group Marker
+
+Force related tests to run on the same worker:
+
+```python
+import pytest
+
+@pytest.mark.xdist_group(name="database")
+def test_create_user():
+    """Runs on same worker as other 'database' group tests."""
+    db.create_user("alice")
+
+@pytest.mark.xdist_group(name="database")
+def test_query_user():
+    """Guaranteed same worker as test_create_user."""
+    user = db.get_user("alice")
+    assert user is not None
+
+@pytest.mark.xdist_group(name="api")
+def test_api_endpoint():
+    """Runs on potentially different worker."""
+    pass
+```
+
+#### Session-Scoped Fixtures with Parallel Execution
+
+Session-scoped fixtures require special handling in parallel execution to avoid race conditions:
+
+```python
+import json
+from pathlib import Path
+from filelock import FileLock  # pip install filelock
+
+@pytest.fixture(scope="session")
+def expensive_shared_data(tmp_path_factory, worker_id):
+    """Thread-safe session fixture for parallel execution."""
+    # Single worker mode - no synchronization needed
+    if worker_id == "master":
+        return generate_expensive_data()
+
+    # Multi-worker mode - use file locking
+    root_tmp = tmp_path_factory.getbasetemp().parent
+    data_file = root_tmp / "shared_data.json"
+    lock_file = str(data_file) + ".lock"
+
+    with FileLock(lock_file):
+        if data_file.is_file():
+            # Another worker already created the data
+            return json.loads(data_file.read_text())
+        else:
+            # First worker creates the data
+            data = generate_expensive_data()
+            data_file.write_text(json.dumps(data))
+            return data
+
+@pytest.fixture(scope="session")
+def database_url(tmp_path_factory, worker_id):
+    """Per-worker database for parallel isolation."""
+    # Each worker gets its own database
+    return f"sqlite:///test_db_{worker_id}.sqlite"
+```
+
+#### Configuration for Parallel Execution
+
+```toml
+# pyproject.toml
+[tool.pytest.ini_options]
+addopts = [
+    "-n", "auto",
+    "--dist", "loadscope",
+]
+```
+
+> ⚠️ **Warning**: Not all tests are parallelization-safe. Tests that modify global state, shared files, or external services may conflict. Use `xdist_group` or run such tests serially with `-n 0`.
+
+### Test Duration Analysis
+
+```python
+# Show test durations
+pytest --durations=10   # Show 10 slowest tests
+pytest --durations=0    # Show all test durations
+
+# In conftest.py - Custom timing
+import time
+
+@pytest.fixture(autouse=True)
+def measure_test_time(request):
+    start = time.time()
+    yield
+    duration = time.time() - start
+    print(f"\n{request.node.name} took {duration:.2f}s")
+```
+
+### Caching
+
+```python
+# Using pytest cache
+def test_expensive_computation(cache):
+    # Check cache
+    result = cache.get("computation_result", None)
+    if result is None:
+        # Compute and cache
+        result = expensive_computation()
+        cache.set("computation_result", result)
+
+    assert result == expected_value
+
+# Cache command line
+pytest --cache-show     # Show cache contents
+pytest --cache-clear    # Clear cache
+```
+
+### Fixture Optimization
+
+```python
+# Reuse expensive fixtures with broader scope
+@pytest.fixture(scope="session")
+def expensive_resource():
+    """Create once, use many times."""
+    resource = create_expensive_resource()
+    yield resource
+    resource.cleanup()
+
+# Lazy fixture creation
+@pytest.fixture
+def maybe_expensive():
+    """Only created if actually used by test."""
+    return ExpensiveObject()
+
+# Fixture factories for controlled creation
+@pytest.fixture
+def resource_factory():
+    resources = []
+
+    def _make_resource(**kwargs):
+        resource = Resource(**kwargs)
+        resources.append(resource)
+        return resource
+
+    yield _make_resource
+
+    # Cleanup all at once
+    for resource in resources:
+        resource.cleanup()
+```
+
+---
+
+## CI/CD Integration
+
+### GitHub Actions Example
+
+```yaml
+# .github/workflows/test.yml
+name: Tests
+
+on: [push, pull_request]
+
+jobs:
+  test:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest, windows-latest, macos-latest]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
+
+    steps:
+    - uses: actions/checkout@v4
+
+    - name: Set up Python
+      uses: actions/setup-python@v4
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -e ".[test]"
+
+    - name: Run tests
+      run: |
+        pytest -v --cov=myproject --cov-report=xml
+
+    - name: Upload coverage
+      uses: codecov/codecov-action@v3
+      with:
+        file: ./coverage.xml
+```
+
+### Test Stages
+
+```yaml
+# Multi-stage testing
+stages:
+  - quick-tests
+  - full-tests
+  - integration-tests
+
+quick-tests:
+  script:
+    - pytest -m "unit and not slow" --fail-fast
+
+full-tests:
+  script:
+    - pytest -m "not integration"
+
+integration-tests:
+  script:
+    - pytest -m integration
+  only:
+    - main
+    - merge_requests
+```
+
+### Coverage Configuration
+
+```ini
+# .coveragerc
+[run]
+source = myproject
+omit =
+    */tests/*
+    */venv/*
+    */migrations/*
+    */__init__.py
+
+[report]
+precision = 2
+show_missing = True
+skip_covered = False
+
+[html]
+directory = htmlcov
+
+[xml]
+output = coverage.xml
+```
+
+---
+
+## Plugin Ecosystem
+
+### Essential Plugins
+
+```bash
+# Coverage
+pip install pytest-cov
+
+# Parallel execution
+pip install pytest-xdist
+
+# Mocking
+pip install pytest-mock
+
+# Timeout
+pip install pytest-timeout
+
+# HTML reports
+pip install pytest-html
+
+# BDD
+pip install pytest-bdd
+
+# Benchmarking
+pip install pytest-benchmark
+
+# Django
+pip install pytest-django
+
+# Asyncio
+pip install pytest-asyncio
+
+# Flake8 integration
+pip install pytest-flake8
+
+# Order randomization
+pip install pytest-randomly
+```
+
+### Plugin Usage Examples
+
+```python
+# pytest-timeout
+@pytest.mark.timeout(10)  # 10 second timeout
+def test_slow_operation():
+    perform_slow_operation()
+
+# pytest-benchmark
+def test_performance(benchmark):
+    result = benchmark(my_function, arg1, arg2)
+    assert result == expected
+
+# pytest-randomly (randomize test order)
+# Just install and it works automatically
+# Use --randomly-seed=1234 to reproduce order
+```
+
+### Async Testing with pytest-asyncio
+
+#### Installation and Configuration
+
+```bash
+pip install pytest-asyncio
+```
+
+```toml
+# pyproject.toml
+[tool.pytest.ini_options]
+asyncio_mode = "auto"  # Automatically handle async tests
+```
+
+#### Basic Async Tests
+
+```python
+import pytest
+
+@pytest.mark.asyncio
+async def test_async_function():
+    """Test async function."""
+    result = await async_operation()
+    assert result == expected
+
+@pytest.mark.asyncio
+async def test_async_context_manager():
+    """Test async context manager."""
+    async with AsyncResource() as resource:
+        result = await resource.fetch()
+        assert result is not None
+```
+
+#### Async Fixtures
+
+```python
+@pytest.fixture
+async def async_client():
+    """Async fixture with proper cleanup."""
+    client = await create_async_client()
+    yield client
+    await client.close()
+
+@pytest.fixture(scope="session")
+async def async_database():
+    """Session-scoped async fixture."""
+    db = await Database.connect()
+    yield db
+    await db.disconnect()
+
+@pytest.mark.asyncio
+async def test_with_async_fixtures(async_client, async_database):
+    """Test using async fixtures."""
+    result = await async_client.query(async_database)
+    assert result is not None
+```
+
+#### Fixture Scopes for Async
+
+```python
+# Function scope (default) - new event loop per test
+@pytest.fixture
+async def function_resource():
+    return await create_resource()
+
+# Session scope - shared across tests
+@pytest.fixture(scope="session")
+async def session_resource():
+    resource = await expensive_async_setup()
+    yield resource
+    await resource.cleanup()
+```
+
+> ⚠️ **Deprecation Warning**: Sync tests depending on async fixtures will warn in pytest 8.x and error in future versions. Always use `@pytest.mark.asyncio` for tests using async fixtures.
+
+#### Event Loop Scope (pytest-asyncio 0.21+)
+
+```python
+# Control event loop scope
+@pytest.fixture(scope="session")
+def event_loop_policy():
+    """Use uvloop for faster async."""
+    import uvloop
+    return uvloop.EventLoopPolicy()
+
+# Or via configuration
+# pyproject.toml
+[tool.pytest.ini_options]
+asyncio_default_fixture_loop_scope = "function"
+```
+
+---
+
+## Snapshot & Regression Testing
+
+Snapshot testing captures expected output and compares against future runs.
+
+### Using syrupy (Recommended)
+
+```bash
+pip install syrupy
+```
+
+```python
+def test_api_response(snapshot):
+    """Compare API response against snapshot."""
+    response = api_client.get("/users/1")
+    assert response.json() == snapshot
+
+def test_html_output(snapshot):
+    """Compare rendered HTML."""
+    html = render_template("user_profile.html", user=mock_user)
+    assert html == snapshot
+
+def test_complex_object(snapshot):
+    """Snapshot complex data structures."""
+    result = process_data(input_data)
+    assert result == snapshot
+```
+
+### Snapshot Management
+
+```bash
+# Update all snapshots (after intentional changes)
+pytest --snapshot-update
+
+# Review snapshot changes interactively
+pytest --snapshot-warn-unused
+
+# CI mode - fail on snapshot mismatch
+pytest  # Default behavior
+```
+
+### Custom Snapshot Serializers
+
+```python
+from syrupy.extensions.json import JSONSnapshotExtension
+
+@pytest.fixture
+def snapshot_json(snapshot):
+    """Use JSON serialization for snapshots."""
+    return snapshot.use_extension(JSONSnapshotExtension)
+
+def test_json_api(snapshot_json):
+    response = api.get("/data")
+    assert response.json() == snapshot_json
+```
+
+### Inline Snapshots
+
+```python
+def test_inline(snapshot):
+    """Snapshot stored in test file itself."""
+    result = calculate_value()
+    assert result == snapshot(result)  # First run creates snapshot
+```
+
+### Best Practices for Snapshot Testing
+
+1. **Use for stable outputs**: HTML, JSON responses, serialized objects
+2. **Avoid for volatile data**: Timestamps, random IDs, system-specific paths
+3. **Review diffs carefully**: Snapshot updates should be intentional
+4. **Combine with unit tests**: Snapshots complement, not replace, assertions
+5. **Keep snapshots small**: Large snapshots are hard to review
+
+---
+
+## Property-Based Testing with Hypothesis
+
+Property-based testing generates random inputs to find edge cases.
+
+### Installation
+
+```bash
+pip install hypothesis
+```
+
+### Basic Property Tests
+
+```python
+from hypothesis import given, strategies as st
+
+@given(st.integers())
+def test_integer_properties(x):
+    """Test properties that should hold for all integers."""
+    assert x + 0 == x
+    assert x * 1 == x
+    assert x - x == 0
+
+@given(st.lists(st.integers()))
+def test_sort_is_idempotent(data):
+    """Sorting twice equals sorting once."""
+    assert sorted(data) == sorted(sorted(data))
+
+@given(st.lists(st.integers()))
+def test_sort_preserves_length(data):
+    """Sorting doesn't change length."""
+    assert len(sorted(data)) == len(data)
+
+@given(st.text())
+def test_string_roundtrip(s):
+    """Encoding and decoding returns original."""
+    assert s.encode("utf-8").decode("utf-8") == s
+```
+
+### Combining with pytest Fixtures
+
+```python
+@given(st.integers(min_value=1, max_value=100))
+def test_with_fixture(db_session, quantity):
+    """Property test with pytest fixture."""
+    order = Order(quantity=quantity)
+    db_session.add(order)
+    db_session.flush()
+
+    assert order.total == order.price * quantity
+
+@pytest.mark.parametrize("discount", [0, 10, 25, 50])
+@given(st.integers(min_value=1))
+def test_parametrized_property(discount, price):
+    """Combine parametrize with hypothesis."""
+    discounted = apply_discount(price, discount)
+    assert discounted <= price
+```
+
+### Custom Strategies
+
+```python
+from hypothesis import strategies as st
+
+# Email strategy
+emails = st.emails()
+
+# Custom composite strategy
+@st.composite
+def user_data(draw):
+    """Generate valid user data."""
+    return {
+        "username": draw(st.text(min_size=3, max_size=20)),
+        "email": draw(st.emails()),
+        "age": draw(st.integers(min_value=18, max_value=120)),
+    }
+
+@given(user_data())
+def test_user_creation(data):
+    user = User(**data)
+    assert user.is_valid()
+```
+
+### Controlling Test Generation
+
+```python
+from hypothesis import given, settings, Verbosity
+
+@given(st.integers())
+@settings(
+    max_examples=500,        # More thorough testing
+    deadline=1000,           # 1 second timeout per example
+    verbosity=Verbosity.verbose,
+)
+def test_thorough(x):
+    assert some_property(x)
+
+@given(st.integers())
+@settings(max_examples=10)  # Quick smoke test
+def test_quick(x):
+    assert basic_property(x)
+```
+
+### Example Database for Reproducibility
+
+```python
+from hypothesis import given, settings, Phase
+
+@given(st.integers())
+@settings(
+    database=None,  # Disable example database
+    phases=[Phase.generate],  # Only generate, don't replay
+)
+def test_stateless(x):
+    pass
+```
+
+### Best Practices
+
+1. **Test properties, not examples**: Focus on invariants that always hold
+2. **Keep tests fast**: Each example should be quick
+3. **Use `@settings(deadline=None)`** for slow operations
+4. **Review failing examples**: Hypothesis shows minimal failing case
+5. **Combine with unit tests**: Property tests find edge cases, unit tests verify specific behavior
+
+---
+
+## Test Asset Generation & Management
+
+Dynamic test asset generation ensures tests are self-contained, reproducible, and independent of external files. This is especially critical for ML/ONNX testing where models must be generated programmatically.
+
+### Core Principle: Code-Generated Assets
+
+**CARDINAL RULE**: Never rely on pre-existing files or LLM-generated test data. All test assets must be generated by code during test execution.
+
+```python
+# ❌ BAD: Relying on pre-existing files
+def test_model_optimization():
+    model = onnx.load("tests/fixtures/bert_model.onnx")  # External dependency!
+    optimized = optimize(model)
+    assert optimized is not None
+
+# ✅ GOOD: Generate assets programmatically
+def test_model_optimization(simple_model_fixture):
+    """Model is generated by fixture - no external dependencies."""
+    optimized = optimize(simple_model_fixture)
+    assert optimized is not None
+```
+
+### Fixture-Based Asset Generation
+
+#### Session-Scoped Expensive Assets
+
+For expensive-to-generate assets, use session scope to generate once per test session:
+
+```python
+# conftest.py
+import onnx
+from onnx import helper, TensorProto
+import numpy as np
+
+@pytest.fixture(scope="session")
+def base_model() -> onnx.ModelProto:
+    """Generate a base ONNX model for testing.
+
+    Session-scoped to avoid regenerating for every test.
+    """
+    # Create input
+    X = helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 128])
+
+    # Create nodes
+    nodes = [
+        helper.make_node("Relu", ["input"], ["relu_out"], name="relu_1"),
+        helper.make_node("Sigmoid", ["relu_out"], ["output"], name="sigmoid_1"),
+    ]
+
+    # Create output
+    Y = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 128])
+
+    # Build graph and model
+    graph = helper.make_graph(nodes, "test_graph", [X], [Y])
+    model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
+
+    return model
+```
+
+#### Function-Scoped Mutable Assets
+
+For assets that tests may modify, use function scope:
+
+```python
+@pytest.fixture(scope="function")
+def mutable_model(base_model) -> onnx.ModelProto:
+    """Create a fresh copy for tests that modify the model."""
+    import copy
+    return copy.deepcopy(base_model)
+```
+
+### Pattern-Specific Model Generation
+
+Generate models containing specific patterns for targeted testing:
+
+```python
+# tests/optim/conftest.py
+
+@pytest.fixture(scope="session")
+def gelu_pattern_model() -> onnx.ModelProto:
+    """Generate model with GELU approximation pattern.
+
+    GELU(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))
+    This pattern should be detected and fused by GELU fusion optimizers.
+    """
+    X = helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 768])
+
+    # Create GELU approximation nodes
+    nodes = [
+        # x³
+        helper.make_node("Pow", ["input", "three"], ["x_cubed"], name="pow_1"),
+        # 0.044715 * x³
+        helper.make_node("Mul", ["x_cubed", "coef"], ["scaled_cube"], name="mul_1"),
+        # x + 0.044715 * x³
+        helper.make_node("Add", ["input", "scaled_cube"], ["sum_1"], name="add_1"),
+        # sqrt(2/π) * (x + 0.044715 * x³)
+        helper.make_node("Mul", ["sum_1", "sqrt_2_pi"], ["tanh_input"], name="mul_2"),
+        # tanh(...)
+        helper.make_node("Tanh", ["tanh_input"], ["tanh_out"], name="tanh_1"),
+        # 1 + tanh(...)
+        helper.make_node("Add", ["one", "tanh_out"], ["one_plus_tanh"], name="add_2"),
+        # 0.5 * x
+        helper.make_node("Mul", ["half", "input"], ["half_x"], name="mul_3"),
+        # 0.5 * x * (1 + tanh(...))
+        helper.make_node("Mul", ["half_x", "one_plus_tanh"], ["output"], name="mul_4"),
+    ]
+
+    # Create initializers for constants
+    initializers = [
+        numpy_helper.from_array(np.array([3.0], dtype=np.float32), "three"),
+        numpy_helper.from_array(np.array([0.044715], dtype=np.float32), "coef"),
+        numpy_helper.from_array(np.array([0.7978845608], dtype=np.float32), "sqrt_2_pi"),
+        numpy_helper.from_array(np.array([1.0], dtype=np.float32), "one"),
+        numpy_helper.from_array(np.array([0.5], dtype=np.float32), "half"),
+    ]
+
+    Y = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 768])
+    graph = helper.make_graph(nodes, "gelu_pattern", [X], [Y], initializers)
+
+    return helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
+
+
+@pytest.fixture(scope="session")
+def matmul_add_pattern_model() -> onnx.ModelProto:
+    """Generate model with MatMul+Add pattern for Gemm fusion testing."""
+    X = helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 512])
+
+    # Weight and bias initializers
+    weight = numpy_helper.from_array(
+        np.random.randn(512, 256).astype(np.float32), "weight"
+    )
+    bias = numpy_helper.from_array(
+        np.random.randn(256).astype(np.float32), "bias"
+    )
+
+    nodes = [
+        helper.make_node("MatMul", ["input", "weight"], ["matmul_out"], name="matmul_1"),
+        helper.make_node("Add", ["matmul_out", "bias"], ["output"], name="add_1"),
+    ]
+
+    Y = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 256])
+    graph = helper.make_graph(nodes, "matmul_add_pattern", [X], [Y], [weight, bias])
+
+    return helper.make_model(graph, opset_imports=[helper.make_opsetid("", 17)])
+```
+
+### Multi-Pattern Test Models
+
+For comprehensive testing, generate models with multiple patterns:
+
+```python
+@pytest.fixture(scope="session")
+def all_patterns_model() -> onnx.ModelProto:
+    """Generate model with ALL optimization patterns for comprehensive testing.
+
+    Patterns included (with prefixes for identification):
+    - p01_identity_: Identity elimination pattern
+    - p02_dropout_: Dropout elimination pattern
+    - p03_reshape_: Reshape fusion pattern
+    - p04_transpose_: Transpose optimization pattern
+    - p05_conv_: Conv optimization pattern
+    - p06_matmuladdrelu_: MatMul+Add+Relu fusion pattern
+    - p07_attention_: Attention pattern
+    - p08_biasgelu_: Bias+GELU fusion pattern
+    - p09_skiplayernorm_: SkipLayerNorm pattern
+
+    Node naming convention: {pattern_prefix}{operation}_{index}
+    Example: p06_matmuladdrelu_matmul_1
+    """
+    # Implementation generates all patterns in one model
+    # Each pattern uses consistent naming for verification
+    ...
+```
+
+### Conftest Hierarchy for Asset Sharing
+
+Organize conftest files hierarchically for proper asset sharing:
+
+```
+tests/
+├── conftest.py                    # Root: Core helpers (optimize_at_level, etc.)
+├── optim/
+│   ├── conftest.py               # Optim-wide: Base model fixtures
+│   ├── capabilities/
+│   │   ├── conftest.py           # Capability-specific: Pattern models, ORT names
+│   │   ├── test_gelu_fusion.py
+│   │   └── test_matmul_add.py
+│   ├── pipes/
+│   │   ├── conftest.py           # Pipe-specific: Pipe configs, mock models
+│   │   ├── test_pipe_graph.py
+│   │   └── test_pipe_fusion.py
+│   └── integration/
+│       ├── conftest.py           # Integration: Complex model fixtures
+│       └── test_optimizer.py
+```
+
+#### Root conftest.py - Core Helpers
+
+```python
+# tests/conftest.py
+"""Root conftest - Core testing utilities."""
+
+import onnx
+import onnxruntime as ort
+import tempfile
+from pathlib import Path
+
+def optimize_at_level(
+    model: onnx.ModelProto,
+    level: int = 2,
+    disabled_optimizers: list[str] | None = None,
+) -> onnx.ModelProto:
+    """Apply ORT graph optimization at specified level.
+
+    This is the RAW ORT API helper - does NOT use Pipe classes.
+    Use this in capability tests for isolation testing.
+    """
+    opts = ort.SessionOptions()
+    opts.graph_optimization_level = ort.GraphOptimizationLevel(level)
+
+    if disabled_optimizers:
+        for name in disabled_optimizers:
+            opts.add_session_config_entry(
+                f"session.disable_specified_optimizers",
+                ",".join(disabled_optimizers)
+            )
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        input_path = Path(tmpdir) / "input.onnx"
+        output_path = Path(tmpdir) / "output.onnx"
+
+        onnx.save(model, str(input_path))
+        opts.optimized_model_filepath = str(output_path)
+
+        # Create session to trigger optimization
+        ort.InferenceSession(str(input_path), opts)
+
+        return onnx.load(str(output_path))
+```
+
+#### Domain conftest.py - Shared Fixtures
+
+```python
+# tests/optim/capabilities/conftest.py
+"""Capability test fixtures - Pattern-specific models."""
+
+import pytest
+from typing import TYPE_CHECKING
+
+if TYPE_CHECKING:
+    import onnx
+
+# Import pattern model generators
+from tests.optim.conftest import (
+    gelu_pattern_model,
+    matmul_add_pattern_model,
+    all_patterns_model,
+)
+
+def get_all_ort_names() -> list[str]:
+    """Get all registered ORT optimizer names for isolation testing."""
+    return [
+        "GeluFusionL2",
+        "BiasGeluFusion",
+        "MatMulAddFusion",
+        "LayerNormFusion",
+        # ... all 49+ ORT optimizer names
+    ]
+
+@pytest.fixture(scope="session")
+def ort_optimizer_names() -> list[str]:
+    """Fixture providing all ORT optimizer names."""
+    return get_all_ort_names()
+```
+
+### Asset Verification Helpers
+
+Create helpers to verify generated assets have expected structure:
+
+```python
+# tests/helpers/model_verification.py
+
+def count_nodes_by_op(model: onnx.ModelProto, op_type: str) -> int:
+    """Count nodes of specific operation type."""
+    return sum(1 for n in model.graph.node if n.op_type == op_type)
+
+def count_nodes_by_prefix(model: onnx.ModelProto, prefix: str) -> int:
+    """Count nodes with name prefix (for pattern identification)."""
+    return sum(1 for n in model.graph.node if n.name.startswith(prefix))
+
+def count_nodes_by_prefix_and_op(
+    model: onnx.ModelProto, prefix: str, op_type: str
+) -> int:
+    """Count nodes matching both prefix and operation type."""
+    return sum(
+        1 for n in model.graph.node
+        if n.name.startswith(prefix) and n.op_type == op_type
+    )
+
+def verify_pattern_exists(
+    model: onnx.ModelProto,
+    pattern_prefix: str,
+    expected_ops: list[str],
+) -> bool:
+    """Verify a pattern exists in the model with expected operations."""
+    for op in expected_ops:
+        if count_nodes_by_prefix_and_op(model, pattern_prefix, op) == 0:
+            return False
+    return True
+```
+
+### Differential Testing with Generated Assets
+
+Test optimization effects by comparing before/after states:
+
+```python
+def test_gelu_fusion_effectiveness(gelu_pattern_model):
+    """Test that GELU fusion actually reduces node count."""
+    from tests.conftest import optimize_at_level
+    from tests.helpers.model_verification import count_nodes_by_op
+
+    # Before optimization
+    before_tanh = count_nodes_by_op(gelu_pattern_model, "Tanh")
+    before_mul = count_nodes_by_op(gelu_pattern_model, "Mul")
+
+    # Apply optimization with GELU fusion enabled
+    optimized = optimize_at_level(
+        gelu_pattern_model,
+        level=2,
+        disabled_optimizers=[]  # All enabled
+    )
+
+    # After optimization - GELU pattern should be fused
+    after_tanh = count_nodes_by_op(optimized, "Tanh")
+    after_mul = count_nodes_by_op(optimized, "Mul")
+
+    # Verify fusion occurred
+    assert after_tanh < before_tanh, "GELU fusion should reduce Tanh nodes"
+    assert after_mul < before_mul, "GELU fusion should reduce Mul nodes"
+```
+
+### Best Practices Summary
+
+1. **Always generate assets in code**: Never rely on external files
+2. **Use appropriate fixture scope**: Session for expensive, function for mutable
+3. **Name patterns consistently**: Use prefixes for pattern identification
+4. **Create verification helpers**: Standardize how you check asset structure
+5. **Document pattern structure**: Explain what each generated model contains
+6. **Test asset generation**: Verify fixtures produce expected structures
+7. **Use conftest hierarchy**: Share assets at appropriate levels
+8. **Prefer RAW APIs in unit tests**: Don't couple to higher-level abstractions
+
+---
+
+## Common Patterns & Anti-Patterns
+
+### Patterns ✅
+
+```python
+# Good: Descriptive test names
+def test_user_registration_sends_welcome_email():
+    pass
+
+# Good: Focused tests
+def test_calculate_tax_for_standard_rate():
+    income = 50000
+    assert calculate_tax(income) == 10000
+
+# Good: Using fixtures for setup
+@pytest.fixture
+def authenticated_client(client, user):
+    client.login(username=user.username, password="password")
+    return client
+
+# Good: Parametrize instead of loops
+@pytest.mark.parametrize("value,expected", [
+    (1, 1),
+    (2, 4),
+    (3, 9),
+])
+def test_square(value, expected):
+    assert value ** 2 == expected
+
+# Good: Clear test structure (Arrange-Act-Assert)
+def test_user_creation():
+    # Arrange
+    data = {"username": "john", "email": "john@example.com"}
+
+    # Act
+    user = User.create(**data)
+
+    # Assert
+    assert user.username == "john"
+    assert user.email == "john@example.com"
+```
+
+### Anti-Patterns ❌
+
+```python
+# Bad: Test doing too much
+def test_everything():
+    user = create_user()
+    post = create_post(user)
+    comment = create_comment(post)
+    assert user.is_active
+    assert post.author == user
+    assert comment.post == post
+    # Too many things tested at once
+
+# Bad: Modifying global state
+def test_with_global_state():
+    global CONFIG
+    CONFIG["debug"] = True  # Don't modify globals
+    assert my_function() == expected
+
+# Bad: Tests depending on order
+def test_first():
+    global shared_data
+    shared_data = setup_data()
+
+def test_second():
+    # Depends on test_first running first
+    assert shared_data.value == expected
+
+# Bad: Catching all exceptions
+def test_broad_exception():
+    try:
+        risky_operation()
+    except Exception:  # Too broad
+        pass  # Test passes even if unexpected error
+
+# Bad: No assertion
+def test_without_assertion():
+    result = my_function()
+    # No assert - test always passes
+```
+
+---
+
+## Debugging & Troubleshooting
+
+### Debugging Techniques
+
+```python
+# Drop into debugger on failure
+pytest --pdb
+
+# Drop into IPython debugger
+pytest --pdbcls=IPython.terminal.debugger:TerminalPdb
+
+# Set breakpoint in code
+def test_debug():
+    value = calculate()
+    import pdb; pdb.set_trace()  # or breakpoint() in Python 3.7+
+    assert value == expected
+
+# Print debugging (use -s flag)
+def test_with_print():
+    print("Debug info:", value)  # Visible with pytest -s
+    assert value == expected
+
+# Capture logs
+def test_with_logging(caplog):
+    with caplog.at_level(logging.INFO):
+        my_function()
+    assert "Expected message" in caplog.text
+
+# Detailed failure info
+pytest -vv  # Very verbose
+pytest --tb=short  # Short traceback
+pytest --tb=line   # One line per failure
+pytest --tb=no     # No traceback
+```
+
+### Common Issues & Solutions
+
+```python
+# Issue: Import errors
+# Solution: Check PYTHONPATH and use --import-mode
+pytest --import-mode=importlib
+
+# Issue: Fixture not found
+# Solution: Check scope and conftest.py location
+pytest --fixtures  # List available fixtures
+
+# Issue: Tests not discovered
+# Solution: Check naming conventions
+pytest --collect-only  # See what's collected
+
+# Issue: Flaky tests
+# Solution: Use pytest-rerunfailures
+pip install pytest-rerunfailures
+pytest --reruns 3 --reruns-delay 1
+
+# Issue: Test isolation
+# Solution: Use fixtures and avoid global state
+@pytest.fixture(autouse=True)
+def reset_state():
+    cleanup_before_test()
+    yield
+    cleanup_after_test()
+```
+
+---
+
+---
+
+## Deprecations & Migration Guide
+
+### Deprecated Patterns to Avoid
+
+Understanding deprecated patterns helps maintain forward compatibility.
+
+#### Marker Access (Changed in pytest 4.0+)
+
+```python
+# ❌ DEPRECATED - will be removed
+marker = item.get_marker("slow")
+
+# ✅ CURRENT - use these instead
+marker = item.get_closest_marker("slow")  # Single marker
+markers = list(item.iter_markers("slow"))  # Multiple markers
+```
+
+#### Hook Decorators (Changed in pytest 7.0+)
+
+```python
+# ❌ DEPRECATED
+@pytest.mark.tryfirst
+def pytest_collection_modifyitems(items):
+    pass
+
+# ✅ CURRENT
+@pytest.hookimpl(tryfirst=True)
+def pytest_collection_modifyitems(items):
+    pass
+```
+
+#### pytest_namespace Hook (Removed in pytest 8.0)
+
+```python
+# ❌ REMOVED - no longer works
+def pytest_namespace():
+    return {"my_value": 42}
+
+# ✅ CURRENT - use pytest_configure instead
+def pytest_configure(config):
+    config.my_value = 42
+```
+
+#### Async Fixtures with Sync Tests (Warning in pytest 8.x+)
+
+```python
+# ❌ DEPRECATED - will warn and eventually error
+@pytest.fixture
+async def async_data():
+    return await fetch_data()
+
+def test_sync(async_data):  # Sync test using async fixture
+    assert async_data is not None
+
+# ✅ CURRENT - explicit async handling
+@pytest.fixture
+def async_data():
+    import asyncio
+    return asyncio.run(fetch_data())
+
+def test_sync(async_data):
+    assert async_data is not None
+
+# OR use async test
+@pytest.fixture
+async def async_data():
+    return await fetch_data()
+
+@pytest.mark.asyncio
+async def test_async(async_data):
+    assert async_data is not None
+```
+
+#### yield_fixture Decorator (Removed)
+
+```python
+# ❌ REMOVED
+@pytest.yield_fixture
+def resource():
+    r = acquire()
+    yield r
+    release(r)
+
+# ✅ CURRENT - use regular fixture with yield
+@pytest.fixture
+def resource():
+    r = acquire()
+    yield r
+    release(r)
+```
+
+### Migration Checklist
+
+When upgrading pytest versions, check for:
+
+- [ ] Replace `item.get_marker()` with `item.get_closest_marker()`
+- [ ] Replace `@pytest.mark.tryfirst/trylast` with `@pytest.hookimpl(tryfirst=True/trylast=True)`
+- [ ] Remove any `pytest_namespace` hooks
+- [ ] Update async fixtures to use explicit handling
+- [ ] Replace `@pytest.yield_fixture` with `@pytest.fixture`
+- [ ] Check `--strict-config` passes with your configuration
+- [ ] Review `filterwarnings` for any pytest deprecation warnings
+
+### Version Compatibility Matrix
+
+| Feature | Minimum Version | Notes |
+|---------|-----------------|-------|
+| `pyproject.toml` support | pytest 6.0 | `[tool.pytest.ini_options]` |
+| Native TOML `[tool.pytest]` | pytest 9.0 | Cleaner syntax |
+| `--import-mode=importlib` | pytest 6.0 | Recommended default |
+| `@pytest.hookimpl` | pytest 7.0 | Replaces mark decorators |
+| `item.iter_markers()` | pytest 4.0 | Replaces `get_marker()` |
+| `required_plugins` | pytest 7.0 | With `--strict-config` |
+
+## Best Practices Checklist
+
+### ✅ DO's
+
+1. **Write descriptive test names** that explain what is being tested
+2. **Use fixtures** for setup and teardown
+3. **Keep tests focused** - one concept per test
+4. **Use parametrize** for data-driven tests
+5. **Organize tests** to mirror source code structure
+6. **Register custom markers** in pytest.ini
+7. **Use appropriate scopes** for fixtures
+8. **Mock external dependencies** in unit tests
+9. **Run fastest tests first** in CI/CD
+10. **Use pytest.raises** for exception testing
+11. **Document complex test scenarios**
+12. **Use tmp_path fixture** for file operations
+13. **Configure pytest** in pyproject.toml or pytest.ini
+14. **Use pytest plugins** to extend functionality
+15. **Profile slow tests** and optimize
+16. **Start without `__init__.py`** in test directories - add only when needed
+17. **Use `--import-mode=importlib`** for modern import handling
+18. **Declare `required_plugins`** for team/CI consistency
+19. **Use `--strict-config`** to catch configuration errors early
+20. **Handle async fixtures properly** with `@pytest.mark.asyncio`
+21. **Use file locking** for session fixtures with parallel execution
+
+### ❌ DON'Ts
+
+1. **Don't write tests that depend on execution order**
+2. **Don't use global state** that affects other tests
+3. **Don't catch broad exceptions** without re-raising
+4. **Don't hardcode paths** - use fixtures and tmp_path
+5. **Don't skip writing tests** for "simple" functions
+6. **Don't mix test types** in the same file
+7. **Don't use production credentials** in tests
+8. **Don't ignore flaky tests** - fix or mark them
+9. **Don't write tests without assertions**
+10. **Don't duplicate test logic** - use fixtures
+11. **Don't test implementation details** - test behavior
+12. **Don't use time.sleep** - use proper synchronization
+13. **Don't modify source code** for testing - use mocks
+14. **Don't run all tests locally** for every change
+15. **Don't ignore test warnings** - fix or suppress explicitly
+16. **Don't add `__init__.py` to tests by default** - pytest works without it
+17. **Don't use deprecated marker access** - use `get_closest_marker()` not `get_marker()`
+18. **Don't mix sync tests with async fixtures** - will warn/error in pytest 8+
+19. **Don't ignore configuration file priority** - empty `pytest.ini` blocks `pyproject.toml`
+20. **Don't use `@pytest.yield_fixture`** - use `@pytest.fixture` with yield
+21. **Don't forget `xdist_group`** when tests must share state in parallel execution
+
+### Final Recommendations
+
+1. **Start Simple**: Begin with basic tests and add complexity as needed
+2. **Test First**: Consider TDD for complex logic
+3. **Continuous Integration**: Run tests automatically on every commit
+4. **Code Coverage**: Aim for high coverage but focus on critical paths
+5. **Performance**: Monitor and optimize test suite performance
+6. **Documentation**: Document complex test scenarios and fixtures
+7. **Maintenance**: Regularly update and refactor tests
+8. **Team Standards**: Establish and follow team testing conventions
+
+Remember: Good tests are as important as good code. They provide confidence, documentation, and safety for refactoring.

From 494c24fd9dd9049090e4ba55f1f21dbfd9df7b51 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Mon, 8 Jun 2026 14:55:02 +0800
Subject: [PATCH 085/143] docs: reset e2e test files and .gitignore to main
 (not part of docs changes)

---
 .gitignore | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index a1815e457..6d8e97985 100644
--- a/.gitignore
+++ b/.gitignore
@@ -252,7 +252,6 @@ ui/node_modules/
 ui/src-tauri/target/
 
 specs/
-!docs/superpowers/specs/
 /tests/integration/pattern_crawl/output/
 /tests/unit/pattern_crawl/output/
 /tests/integration/static_analyzer/output/

From cbfd909988e9a09d2792b3bc7620eb407659b0bf Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Mon, 8 Jun 2026 14:55:38 +0800
Subject: [PATCH 086/143] chore: reset e2e tests to match main exactly

---
 tests/e2e/test_eval_e2e.py     | 869 +++++++++++----------------------
 tests/e2e/test_quantize_e2e.py | 215 ++++----
 2 files changed, 365 insertions(+), 719 deletions(-)

diff --git a/tests/e2e/test_eval_e2e.py b/tests/e2e/test_eval_e2e.py
index 5c4bf73e1..8cac6f637 100644
--- a/tests/e2e/test_eval_e2e.py
+++ b/tests/e2e/test_eval_e2e.py
@@ -69,7 +69,7 @@ def tiny_textcls_script(tmp_path: Path) -> Path:
     """
     script = tmp_path / "build_tiny_textcls.py"
     script.write_text(
-        """import argparse
+        '''import argparse
 from datasets import Dataset
 
 ROWS = [
@@ -89,7 +89,7 @@ def tiny_textcls_script(tmp_path: Path) -> Path:
 p.add_argument("--output", required=True)
 args = p.parse_args()
 Dataset.from_list(ROWS).save_to_disk(args.output)
-""",
+''',
         encoding="utf-8",
     )
     return script
@@ -130,10 +130,7 @@ def _assert_metrics_present(output_path: Path, required_keys: list[str]) -> dict
 
 
 def _assert_in_range(
-    metrics: dict,
-    key: str,
-    lo: float,
-    hi: float,
+    metrics: dict, key: str, lo: float, hi: float,
 ) -> None:
     """Assert ``metrics[key]`` is a finite number within ``[lo, hi]``.
 
@@ -147,7 +144,9 @@ def _assert_in_range(
         f"metric {key} not numeric: {value!r} ({type(value).__name__})"
     )
     assert math.isfinite(value), f"metric {key} is not finite: {value}"
-    assert lo <= value <= hi, f"metric {key}={value} outside expected range [{lo}, {hi}]"
+    assert lo <= value <= hi, (
+        f"metric {key}={value} outside expected range [{lo}, {hi}]"
+    )
 
 
 # ===========================================================================
@@ -165,20 +164,13 @@ def test_image_classification(self, runner: CliRunner, tmp_path: Path) -> None:
         # HF evaluate.evaluator("image-classification") returns `accuracy`.
         # --streaming avoids caching full mini-imagenet (~1-2 GB).
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "google/vit-base-patch16-224",
-                "--task",
-                "image-classification",
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "google/vit-base-patch16-224",
+            "--task", "image-classification",
+            "--streaming",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["accuracy"])
         # ViT-base full ImageNet ≈ 0.81; floor at 0.5 still catches
         # broken-pipeline regressions on 10 samples.
@@ -188,19 +180,12 @@ def test_text_classification(self, runner: CliRunner, tmp_path: Path) -> None:
         # Model aligned with CLI default dataset (nyu-mll/glue/mrpc).
         # HF evaluate.evaluator("text-classification") returns `accuracy`.
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["accuracy"])
         # bert-mrpc full MRPC ≈ 0.86; MRPC majority baseline ≈ 0.68.
         # Magnitude assertion is QNN-only: VitisAI W8A8 quantization
@@ -212,19 +197,12 @@ def test_token_classification(self, runner: CliRunner, tmp_path: Path) -> None:
         # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
         require_not_ep("vitisai")
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "dslim/bert-base-NER",
-                "--task",
-                "token-classification",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "dslim/bert-base-NER",
+            "--task", "token-classification",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(
             out,
             ["overall_precision", "overall_recall", "overall_f1", "overall_accuracy"],
@@ -238,20 +216,13 @@ def test_object_detection(self, runner: CliRunner, tmp_path: Path) -> None:
         # COCO val is ~6 GB; --streaming keeps only the bytes needed
         # for the sampled subset.
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "hustvl/yolos-small",
-                "--task",
-                "object-detection",
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "hustvl/yolos-small",
+            "--task", "object-detection",
+            "--streaming",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["map", "map_50", "mar_100"])
         # COCO mAP / mAR are bounded by [0, 1]; torchmetrics may report -1
         # when no positives are sampled, which is acceptable for tiny N.
@@ -263,43 +234,27 @@ def test_image_segmentation(self, runner: CliRunner, tmp_path: Path) -> None:
         # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
         require_not_ep("vitisai")
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "nvidia/segformer-b1-finetuned-ade-512-512",
-                "--task",
-                "image-segmentation",
-                "--dataset",
-                "danjacobellis/scene_parse_150",
-                "--split",
-                "validation",
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "nvidia/segformer-b1-finetuned-ade-512-512",
+            "--task", "image-segmentation",
+            "--dataset", "danjacobellis/scene_parse_150",
+            "--split", "validation",
+            "--streaming",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["mean_iou"])
         _assert_in_range(data["metrics"], "mean_iou", 0.0, 1.0)
 
     def test_question_answering(self, runner: CliRunner, tmp_path: Path) -> None:
         require_ep("qnn")
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "distilbert/distilbert-base-cased-distilled-squad",
-                "--task",
-                "question-answering",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "distilbert/distilbert-base-cased-distilled-squad",
+            "--task", "question-answering",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["exact_match", "f1"])
         # distilbert-squad full SQuAD v1: EM ≈ 77, F1 ≈ 85 (percentages).
         # Both are harsh on N=10 (heavy per-sample variance with seed=42).
@@ -309,19 +264,12 @@ def test_question_answering(self, runner: CliRunner, tmp_path: Path) -> None:
 
     def test_feature_extraction(self, runner: CliRunner, tmp_path: Path) -> None:
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "sentence-transformers/all-MiniLM-L6-v2",
-                "--task",
-                "feature-extraction",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "sentence-transformers/all-MiniLM-L6-v2",
+            "--task", "feature-extraction",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         # Spearman correlation reported as percentage in [-100, 100].
         # MiniLM-L6-v2 full STSB ≈ 80; 10-sample noise can be large.
         # Magnitude assertion is QNN-only: VitisAI W8A8 quantization
@@ -333,28 +281,19 @@ def test_feature_extraction(self, runner: CliRunner, tmp_path: Path) -> None:
     def test_sentence_similarity(self, runner: CliRunner, tmp_path: Path) -> None:
         # Alias for feature-extraction.
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "sentence-transformers/all-MiniLM-L6-v2",
-                "--task",
-                "sentence-similarity",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "sentence-transformers/all-MiniLM-L6-v2",
+            "--task", "sentence-similarity",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         # Same quantization caveat as test_feature_extraction.
         data = _assert_metrics_present(out, ["cosine_spearman"])
         if is_host("qnn"):
             _assert_in_range(data["metrics"], "cosine_spearman", 40.0, 100.0)
 
     def test_image_feature_extraction(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # kNN accuracies reported as percentages 0..100.
         # --streaming avoids caching mini-imagenet.
@@ -364,23 +303,15 @@ def test_image_feature_extraction(
         # modality-aware task vocabulary, so it is not a valid task for a vision
         # model (it would resolve to the text evaluator/dataset and fail).
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "facebook/dinov2-small",
-                "--task",
-                "image-feature-extraction",
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "facebook/dinov2-small",
+            "--task", "image-feature-extraction",
+            "--streaming",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(
-            out,
-            ["knn_top1_accuracy", "knn_top5_accuracy"],
+            out, ["knn_top1_accuracy", "knn_top5_accuracy"],
         )
         # Loose floors guard against degenerate output, not magnitude: with
         # SAMPLES=10 the kNN estimate has heavy variance (cf. test_question_answering).
@@ -395,28 +326,17 @@ def test_image_to_text_fp16(self, runner: CliRunner, tmp_path: Path) -> None:
         # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
         require_not_ep("vitisai")
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Salesforce/blip-image-captioning-base",
-                "--task",
-                "image-to-text",
-                "--dataset",
-                "lmms-lab/flickr30k",
-                "--split",
-                "test",
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "--precision",
-                "fp16",
-                "--column",
-                "label_column=caption",
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Salesforce/blip-image-captioning-base",
+            "--task", "image-to-text",
+            "--dataset", "lmms-lab/flickr30k",
+            "--split", "test",
+            "--streaming",
+            "--samples", SAMPLES,
+            "--precision", "fp16",
+            "--column", "label_column=caption",
+            "-o", str(out),
+        ])
         # CLI contract: exit 0 and produce the metric keys. Tiny N may
         # yield None values; magnitude is checked in the accuracy regression
         # suite, not here.
@@ -425,26 +345,21 @@ def test_image_to_text_fp16(self, runner: CliRunner, tmp_path: Path) -> None:
         for k, hi in (("cer", 10.0), ("cider", 20.0)):
             v = m[k]
             assert v is None or (
-                isinstance(v, (int, float)) and math.isfinite(v) and 0.0 <= v <= hi
+                isinstance(v, (int, float))
+                and math.isfinite(v)
+                and 0.0 <= v <= hi
             ), f"metric {k}={v!r} not None or in [0,{hi}]"
         assert isinstance(m["n_samples"], int) and m["n_samples"] >= 0
 
     def test_fill_mask(self, runner: CliRunner, tmp_path: Path) -> None:
         # Pseudo-perplexity >= 1 (perplexity is exp of non-neg NLL).
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "distilbert/distilbert-base-uncased",
-                "--task",
-                "fill-mask",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "distilbert/distilbert-base-uncased",
+            "--task", "fill-mask",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["pseudo_perplexity", "nll"])
         # Pseudo-perplexity over a 10-sample wikitext stream can vary
         # widely (we observed ~3000 with seed=42). Cap is set well above
@@ -453,26 +368,17 @@ def test_fill_mask(self, runner: CliRunner, tmp_path: Path) -> None:
         _assert_in_range(data["metrics"], "nll", 0.0, 15.0)
 
     def test_zero_shot_classification(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         require_ep("qnn")
         # Zero-shot uses ClassificationMetric → accuracy + f1.
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "cross-encoder/nli-deberta-v3-small",
-                "--task",
-                "zero-shot-classification",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "cross-encoder/nli-deberta-v3-small",
+            "--task", "zero-shot-classification",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["accuracy", "f1"])
         # nli-deberta-v3-small zero-shot on AG News, N=10. 4-class random
         # baseline = 0.25; tiny-N variance can push real models below
@@ -481,26 +387,17 @@ def test_zero_shot_classification(
         _assert_in_range(data["metrics"], "f1", 0.1, 1.0)
 
     def test_zero_shot_image_classification(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
         require_not_ep("vitisai")
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "openai/clip-vit-base-patch32",
-                "--task",
-                "zero-shot-image-classification",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "openai/clip-vit-base-patch32",
+            "--task", "zero-shot-image-classification",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["top1_accuracy", "top5_accuracy"])
         # CLIP-ViT-B/32 zero-shot on CIFAR-100: top1 ≈ 0.63, top5 ≈ 0.88
         # (full set). Floors leave headroom for tiny-N variance.
@@ -517,26 +414,15 @@ class TestEvalModelInputForms:
     """Coverage for the two non-default ``-m`` forms."""
 
     def test_onnx_file_mode_monolithic(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         hf_id = "google/vit-base-patch16-224"
         task = "image-classification"
 
         # Warm cache via HF id (use streaming to avoid mini-imagenet cache).
-        _invoke(
-            runner,
-            [
-                "-m",
-                hf_id,
-                "--task",
-                task,
-                "--streaming",
-                "--samples",
-                SAMPLES,
-            ],
-        )
+        _invoke(runner, [
+            "-m", hf_id, "--task", task, "--streaming", "--samples", SAMPLES,
+        ])
 
         cache_dir = find_cache_dir(hf_id, task=task)
         assert cache_dir is not None, "expected cache after warm run"
@@ -544,29 +430,19 @@ def test_onnx_file_mode_monolithic(
         assert onnx_files, f"no *_model.onnx in {cache_dir}"
 
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                str(onnx_files[0]),
-                "--model-id",
-                hf_id,
-                "--task",
-                task,
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", str(onnx_files[0]),
+            "--model-id", hf_id,
+            "--task", task,
+            "--streaming",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["accuracy"])
         _assert_in_range(data["metrics"], "accuracy", 0.5, 1.0)
 
     def test_onnx_file_mode_split_encoder(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
         require_not_ep("vitisai")
@@ -593,23 +469,14 @@ def _pick_onnx(prefix: str) -> Path:
         text_onnx = _pick_onnx("feat")
 
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                f"image-encoder={image_onnx}",
-                "-m",
-                f"text-encoder={text_onnx}",
-                "--model-id",
-                hf_id,
-                "--task",
-                task,
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", f"image-encoder={image_onnx}",
+            "-m", f"text-encoder={text_onnx}",
+            "--model-id", hf_id,
+            "--task", task,
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["top1_accuracy"])
         _assert_in_range(data["metrics"], "top1_accuracy", 30.0, 100.0)
 
@@ -623,24 +490,15 @@ class TestEvalOutput:
     """``-o`` path creation + JSON validity."""
 
     def test_creates_nested_output_dir(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         out = tmp_path / "nested" / "subdir" / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         assert out.exists(), "nested output dir not auto-created"
         data = json.loads(out.read_text())
         assert "metrics" in data
@@ -688,52 +546,33 @@ def test_device_cpu(self, runner: CliRunner, tmp_path: Path) -> None:
         # classifier well-suited to a CPU smoke test (no per-token forward
         # passes like fill-mask).
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "microsoft/resnet-50",
-                "--task",
-                "image-classification",
-                "--device",
-                "cpu",
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "microsoft/resnet-50",
+            "--task", "image-classification",
+            "--device", "cpu",
+            "--streaming",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["accuracy"])
         # ResNet-50 full ImageNet ≈ 0.76; mini-imagenet is shifted, floor 0.4.
         _assert_in_range(data["metrics"], "accuracy", 0.4, 1.0)
 
     def test_device_npu_and_ep_qnn(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # Combined --device + --ep.
         require_ep("qnn")
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "google/vit-base-patch16-224",
-                "--task",
-                "image-classification",
-                "--device",
-                "npu",
-                "--ep",
-                "qnn",
-                "--streaming",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "google/vit-base-patch16-224",
+            "--task", "image-classification",
+            "--device", "npu",
+            "--ep", "qnn",
+            "--streaming",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["accuracy"])
         _assert_in_range(data["metrics"], "accuracy", 0.5, 1.0)
 
@@ -745,41 +584,26 @@ def test_device_npu_and_ep_qnn(
 
 class TestEvalAdditionalOptions:
     def test_dataset_name_explicit(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--dataset",
-                "nyu-mll/glue",
-                "--dataset-name",
-                "mrpc",
-                "--column",
-                "input_column=sentence1",
-                "--column",
-                "second_input_column=sentence2",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--dataset", "nyu-mll/glue",
+            "--dataset-name", "mrpc",
+            "--column", "input_column=sentence1",
+            "--column", "second_input_column=sentence2",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
         if is_host("qnn"):
             _assert_in_range(data["metrics"], "accuracy", 0.6, 1.0)
 
     def test_label_mapping_image_segmentation(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # Skip e2e for VitisAI due to Windows Access violation in model compilation for some models
         require_not_ep("vitisai")
@@ -790,56 +614,34 @@ def test_label_mapping_image_segmentation(
             pytest.skip(f"label-mapping file not in repo: {label_map}")
 
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "nvidia/segformer-b1-finetuned-ade-512-512",
-                "--task",
-                "image-segmentation",
-                "--dataset",
-                "danjacobellis/scene_parse_150",
-                "--split",
-                "validation",
-                "--streaming",
-                "--label-mapping",
-                str(label_map),
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "nvidia/segformer-b1-finetuned-ade-512-512",
+            "--task", "image-segmentation",
+            "--dataset", "danjacobellis/scene_parse_150",
+            "--split", "validation",
+            "--streaming",
+            "--label-mapping", str(label_map),
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         data = _assert_metrics_present(out, ["mean_iou"])
         _assert_in_range(data["metrics"], "mean_iou", 0.0, 1.0)
 
     def test_config_file_basic(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # `eval` section provides task + samples.
         cfg = tmp_path / "cfg.json"
-        cfg.write_text(
-            json.dumps(
-                {
-                    "loader": {"task": "text-classification"},
-                    "eval": {"dataset": {"samples": 5}},
-                }
-            )
-        )
+        cfg.write_text(json.dumps({
+            "loader": {"task": "text-classification"},
+            "eval": {"dataset": {"samples": 5}},
+        }))
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--config",
-                str(cfg),
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--config", str(cfg),
+            "-o", str(out),
+        ])
         # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
         if is_host("qnn"):
@@ -849,34 +651,21 @@ def test_config_file_basic(
         )
 
     def test_config_file_cli_override(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # CLI wins over config file.
         cfg = tmp_path / "cfg.json"
-        cfg.write_text(
-            json.dumps(
-                {
-                    "loader": {"task": "text-classification"},
-                    "eval": {"dataset": {"samples": 5}},
-                }
-            )
-        )
+        cfg.write_text(json.dumps({
+            "loader": {"task": "text-classification"},
+            "eval": {"dataset": {"samples": 5}},
+        }))
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--config",
-                str(cfg),
-                "--samples",
-                "7",
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--config", str(cfg),
+            "--samples", "7",
+            "-o", str(out),
+        ])
         # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
         if is_host("qnn"):
@@ -886,23 +675,15 @@ def test_config_file_cli_override(
         )
 
     def test_auto_task_detection(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # No --task flag; CLI infers from HF model.
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         # Same quantization caveat as TestEvalPerTask.test_text_classification.
         data = _assert_metrics_present(out, ["accuracy"])
         if is_host("qnn"):
@@ -912,10 +693,7 @@ def test_auto_task_detection(
         )
 
     def test_precision_warning_for_prebuilt_onnx(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
-        caplog,
+        self, runner: CliRunner, tmp_path: Path, caplog,
     ) -> None:
         # Pre-built ONNX + --precision emits warning, still succeeds.
         import logging as _logging
@@ -932,96 +710,60 @@ def test_precision_warning_for_prebuilt_onnx(
 
         out = tmp_path / "result.json"
         with caplog.at_level(_logging.WARNING, logger="winml.modelkit.commands.eval"):
-            _invoke(
-                runner,
-                [
-                    "-m",
-                    str(onnx_files[0]),
-                    "--model-id",
-                    hf_id,
-                    "--task",
-                    task,
-                    "--precision",
-                    "fp16",
-                    "--streaming",
-                    "--samples",
-                    SAMPLES,
-                    "-o",
-                    str(out),
-                ],
-            )
+            _invoke(runner, [
+                "-m", str(onnx_files[0]),
+                "--model-id", hf_id,
+                "--task", task,
+                "--precision", "fp16",
+                "--streaming",
+                "--samples", SAMPLES,
+                "-o", str(out),
+            ])
         # Warning is emitted via ``logger.warning(...)``; capture from log records.
         msgs = [r.getMessage().lower() for r in caplog.records]
-        assert any("precision" in m and ("ignor" in m or "pre-built" in m) for m in msgs), (
-            f"expected precision-ignored warning, got:\n{msgs!r}"
-        )
+        assert any(
+            "precision" in m and ("ignor" in m or "pre-built" in m)
+            for m in msgs
+        ), f"expected precision-ignored warning, got:\n{msgs!r}"
         _assert_metrics_present(out, ["accuracy"])
 
     def test_dataset_script_with_column_remap(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
-        tiny_textcls_script: Path,
+        self, runner: CliRunner, tmp_path: Path, tiny_textcls_script: Path,
     ) -> None:
         # --dataset-script + --column + --trust-remote-code (happy path).
         ds_path = tmp_path / "tiny_textcls"
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--dataset-script",
-                str(tiny_textcls_script),
-                "--dataset",
-                str(ds_path),
-                "--trust-remote-code",
-                "--column",
-                "input_column=text_a",
-                "--column",
-                "second_input_column=text_b",
-                "--samples",
-                "10",
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--dataset-script", str(tiny_textcls_script),
+            "--dataset", str(ds_path),
+            "--trust-remote-code",
+            "--column", "input_column=text_a",
+            "--column", "second_input_column=text_b",
+            "--samples", "10",
+            "-o", str(out),
+        ])
         assert ds_path.exists(), "dataset script did not write to --dataset path"
         data = _assert_metrics_present(out, ["accuracy"])
         _assert_in_range(data["metrics"], "accuracy", 0.0, 1.0)
 
     def test_dataset_script_without_trust_remote_code(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
-        tiny_textcls_script: Path,
+        self, runner: CliRunner, tmp_path: Path, tiny_textcls_script: Path,
     ) -> None:
         ds_path = tmp_path / "tiny_textcls"
-        result = _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--dataset-script",
-                str(tiny_textcls_script),
-                "--dataset",
-                str(ds_path),
-                "--samples",
-                "10",
-            ],
-            expect_success=False,
-        )
+        result = _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--dataset-script", str(tiny_textcls_script),
+            "--dataset", str(ds_path),
+            "--samples", "10",
+        ], expect_success=False)
         assert result.exit_code != 0
         assert "trust-remote-code" in result.output.lower(), result.output
 
     def test_compare_mode_image_classification(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # --mode compare runs the ONNX candidate and the HF PyTorch reference
         # on the same random inputs and reports per-output tensor-parity
@@ -1030,23 +772,14 @@ def test_compare_mode_image_classification(
         # over 5 metrics (sqnr_db, psnr_db, cosine_similarity, mse,
         # max_abs_diff) x 4 stats (mean, std, min, max) = 20 top-level keys.
         out = tmp_path / "result.json"
-        _invoke(
-            runner,
-            [
-                "--mode",
-                "compare",
-                "-m",
-                "microsoft/resnet-50",
-                "--task",
-                "image-classification",
-                "--precision",
-                "fp16",
-                "--samples",
-                SAMPLES,
-                "-o",
-                str(out),
-            ],
-        )
+        _invoke(runner, [
+            "--mode", "compare",
+            "-m", "microsoft/resnet-50",
+            "--task", "image-classification",
+            "--precision", "fp16",
+            "--samples", SAMPLES,
+            "-o", str(out),
+        ])
         assert out.exists(), f"output file not created: {out}"
         data = json.loads(out.read_text())
         metrics = data.get("metrics", {})
@@ -1065,7 +798,9 @@ def test_compare_mode_image_classification(
         per_output_names: set[str] | None = None
         for key in expected_keys:
             row = metrics[key]
-            assert isinstance(row, dict) and row, f"metrics[{key!r}] not a non-empty dict: {row!r}"
+            assert isinstance(row, dict) and row, (
+                f"metrics[{key!r}] not a non-empty dict: {row!r}"
+            )
             assert all(isinstance(v, (int, float)) for v in row.values()), (
                 f"non-numeric value in metrics[{key!r}]: {row!r}"
             )
@@ -1094,7 +829,8 @@ def test_compare_mode_image_classification(
         threshold = 0.95 if is_host("qnn") else 0.5
         for output_name, value in cos_mean.items():
             assert value >= threshold, (
-                f"cosine_similarity_mean[{output_name}]={value} below {threshold} sanity floor"
+                f"cosine_similarity_mean[{output_name}]={value} "
+                f"below {threshold} sanity floor"
             )
 
 
@@ -1105,76 +841,45 @@ def test_compare_mode_image_classification(
 
 class TestEvalErrorPaths:
     def test_bad_column_format(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
-        result = _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--column",
-                "foo",  # missing '='
-                "--samples",
-                "1",
-            ],
-            expect_success=False,
-        )
+        result = _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--column", "foo",  # missing '='
+            "--samples", "1",
+        ], expect_success=False)
         assert result.exit_code != 0
         assert "key=value" in result.output.lower() or "invalid" in result.output.lower(), (
             result.output
         )
 
     def test_missing_label_mapping_file(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         missing = tmp_path / "does-not-exist.json"
-        result = _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--label-mapping",
-                str(missing),
-                "--samples",
-                "1",
-            ],
-            expect_success=False,
-        )
+        result = _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--label-mapping", str(missing),
+            "--samples", "1",
+        ], expect_success=False)
         assert result.exit_code != 0
         out_lower = result.output.lower()
-        assert (
-            "does not exist" in out_lower or "not found" in out_lower or "no such file" in out_lower
-        ), result.output
+        assert ("does not exist" in out_lower
+                or "not found" in out_lower
+                or "no such file" in out_lower), result.output
 
     def test_bogus_dataset_name(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
-        result = _invoke(
-            runner,
-            [
-                "-m",
-                "Intel/bert-base-uncased-mrpc",
-                "--task",
-                "text-classification",
-                "--dataset",
-                "nyu-mll/glue",
-                "--dataset-name",
-                "not_a_real_glue_config",
-                "--samples",
-                "1",
-            ],
-            expect_success=False,
-        )
+        result = _invoke(runner, [
+            "-m", "Intel/bert-base-uncased-mrpc",
+            "--task", "text-classification",
+            "--dataset", "nyu-mll/glue",
+            "--dataset-name", "not_a_real_glue_config",
+            "--samples", "1",
+        ], expect_success=False)
         assert result.exit_code != 0
         # Loose: exact wording depends on datasets lib version
         assert "config" in result.output.lower() or "not_a_real_glue_config" in result.output, (
@@ -1190,23 +895,18 @@ def test_schema_without_task(self, runner: CliRunner) -> None:
     def test_schema_bogus_task(self, runner: CliRunner) -> None:
         # get_evaluator_class ValueError wrapped as UsageError.
         result = _invoke(
-            runner,
-            ["--schema", "--task", "not-a-real-task"],
+            runner, ["--schema", "--task", "not-a-real-task"],
             expect_success=False,
         )
         assert result.exit_code != 0
         out_lower = result.output.lower()
-        assert (
-            "not-a-real-task" in out_lower
-            or "unknown" in out_lower
-            or "unsupported" in out_lower
-            or "invalid" in out_lower
-        ), result.output
+        assert ("not-a-real-task" in out_lower
+                or "unknown" in out_lower
+                or "unsupported" in out_lower
+                or "invalid" in out_lower), result.output
 
     def test_onnx_file_without_model_id(
-        self,
-        runner: CliRunner,
-        tmp_path: Path,
+        self, runner: CliRunner, tmp_path: Path,
     ) -> None:
         # Needs a real .onnx file path that exists; reuse warmed cache.
         hf_id = "google/vit-base-patch16-224"
@@ -1217,17 +917,10 @@ def test_onnx_file_without_model_id(
         onnx_files = list(cache_dir.glob("*_model.onnx"))
         assert onnx_files
 
-        result = _invoke(
-            runner,
-            [
-                "-m",
-                str(onnx_files[0]),
-                "--task",
-                task,
-                "--samples",
-                "1",
-            ],
-            expect_success=False,
-        )
+        result = _invoke(runner, [
+            "-m", str(onnx_files[0]),
+            "--task", task,
+            "--samples", "1",
+        ], expect_success=False)
         assert result.exit_code != 0
         assert "model-id" in result.output.lower(), result.output
diff --git a/tests/e2e/test_quantize_e2e.py b/tests/e2e/test_quantize_e2e.py
index 786eb6a6e..d95288714 100644
--- a/tests/e2e/test_quantize_e2e.py
+++ b/tests/e2e/test_quantize_e2e.py
@@ -133,7 +133,9 @@ def _export_hf_to_onnx(hf_id: str, task: str, slug: str) -> Path:
     args = ["-m", hf_id, "-o", str(out), "--task", task]
     r = CliRunner().invoke(export, args, obj={}, catch_exceptions=False)
     if r.exit_code != 0 or not out.exists():
-        raise RuntimeError(f"winml export failed for {hf_id}: exit={r.exit_code}\n{r.output}")
+        raise RuntimeError(
+            f"winml export failed for {hf_id}: exit={r.exit_code}\n{r.output}"
+        )
     return out
 
 
@@ -144,7 +146,9 @@ def onnx_imgcls() -> Path:
 
 @pytest.fixture(scope="session")
 def onnx_txtcls() -> Path:
-    return _export_hf_to_onnx("Intel/bert-base-uncased-mrpc", "text-classification", "bert_mrpc")
+    return _export_hf_to_onnx(
+        "Intel/bert-base-uncased-mrpc", "text-classification", "bert_mrpc"
+    )
 
 
 @pytest.fixture(scope="session")
@@ -164,9 +168,7 @@ def onnx_imgseg() -> Path:
 @pytest.fixture(scope="session")
 def onnx_dinov2() -> Path:
     return _export_hf_to_onnx(
-        "facebook/dinov2-small",
-        "image-feature-extraction",
-        "dinov2_small",
+        "facebook/dinov2-small", "image-feature-extraction", "dinov2_small",
     )
 
 
@@ -282,7 +284,8 @@ def _gen_input(ort_type: str, shape: list[int]) -> np.ndarray:
                     assert np.isfinite(arr).all()
             outs_runs.append(outs)
         differ = any(
-            not np.array_equal(a, b) for a, b in zip(outs_runs[0], outs_runs[1], strict=True)
+            not np.array_equal(a, b)
+            for a, b in zip(outs_runs[0], outs_runs[1], strict=True)
         )
         assert differ, "outputs identical across two distinct inputs (degenerate)"
 
@@ -356,24 +359,19 @@ def test_explicit_weight_activation_type_override_precision(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(tiny_onnx),
-                "-o",
-                str(out),
-                "--precision",
-                "int8",
-                "--weight-type",
-                "int8",
-                "--activation-type",
-                "uint8",
-                "--samples",
-                "4",
+                "-m", str(tiny_onnx), "-o", str(out),
+                "--precision", "int8",
+                "--weight-type", "int8",
+                "--activation-type", "uint8",
+                "--samples", "4",
             ],
         )
         model = _assert_quantized_output(input_onnx=tiny_onnx, output_onnx=out, stdout=r.output)
         assert _weight_dq_zero_point_dtype(model) == onnx.TensorProto.INT8
 
-    def test_non_quant_precision_rejected(self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path):
+    def test_non_quant_precision_rejected(
+        self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path
+    ):
         """Float precisions like fp16 must be rejected at CLI parse time.
 
         Replaces the legacy ``test_unknown_precision_falls_back_to_uint8`` which
@@ -446,15 +444,10 @@ def test_symmetric_int8_zero_point_is_zero(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(tiny_onnx),
-                "-o",
-                str(out),
+                "-m", str(tiny_onnx), "-o", str(out),
                 "--symmetric",
-                "--weight-type",
-                "int8",
-                "--samples",
-                "4",
+                "--weight-type", "int8",
+                "--samples", "4",
             ],
         )
         model = _assert_quantized_output(input_onnx=tiny_onnx, output_onnx=out, stdout=r.output)
@@ -482,15 +475,8 @@ def test_task_random_uses_random_dataset(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(tiny_onnx),
-                "-o",
-                str(out),
-                "--task",
-                "random",
-                "--samples",
-                "4",
-                "-v",
+                "-m", str(tiny_onnx), "-o", str(out),
+                "--task", "random", "--samples", "4", "-v",
             ],
         )
         _assert_quantized_output(input_onnx=tiny_onnx, output_onnx=out, stdout=r.output)
@@ -504,17 +490,10 @@ def test_task_image_classification_dataset(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(onnx_imgcls),
-                "-o",
-                str(out),
-                "--task",
-                "image-classification",
-                "--model-name",
-                "microsoft/resnet-50",
-                "--samples",
-                "4",
-                "-v",
+                "-m", str(onnx_imgcls), "-o", str(out),
+                "--task", "image-classification",
+                "--model-name", "microsoft/resnet-50",
+                "--samples", "4", "-v",
             ],
         )
         _assert_quantized_output(input_onnx=onnx_imgcls, output_onnx=out, stdout=r.output)
@@ -528,17 +507,10 @@ def test_task_text_classification_dataset(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(onnx_txtcls),
-                "-o",
-                str(out),
-                "--task",
-                "text-classification",
-                "--model-name",
-                "Intel/bert-base-uncased-mrpc",
-                "--samples",
-                "4",
-                "-v",
+                "-m", str(onnx_txtcls), "-o", str(out),
+                "--task", "text-classification",
+                "--model-name", "Intel/bert-base-uncased-mrpc",
+                "--samples", "4", "-v",
             ],
         )
         _assert_quantized_output(input_onnx=onnx_txtcls, output_onnx=out, stdout=r.output)
@@ -552,21 +524,16 @@ def test_task_object_detection_dataset(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(onnx_objdet),
-                "-o",
-                str(out),
-                "--task",
-                "object-detection",
-                "--model-name",
-                "hustvl/yolos-small",
-                "--samples",
-                "4",
-                "-v",
+                "-m", str(onnx_objdet), "-o", str(out),
+                "--task", "object-detection",
+                "--model-name", "hustvl/yolos-small",
+                "--samples", "4", "-v",
             ],
         )
         _assert_quantized_output(input_onnx=onnx_objdet, output_onnx=out, stdout=r.output)
-        assert "Creating object-detection dataset with ObjectDetectionDataset" in r.output, r.output
+        assert (
+            "Creating object-detection dataset with ObjectDetectionDataset" in r.output
+        ), r.output
 
     @pytest.mark.network
     def test_task_image_segmentation_dataset(
@@ -576,23 +543,16 @@ def test_task_image_segmentation_dataset(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(onnx_imgseg),
-                "-o",
-                str(out),
-                "--task",
-                "image-segmentation",
-                "--model-name",
-                "nvidia/segformer-b0-finetuned-ade-512-512",
-                "--samples",
-                "4",
-                "-v",
+                "-m", str(onnx_imgseg), "-o", str(out),
+                "--task", "image-segmentation",
+                "--model-name", "nvidia/segformer-b0-finetuned-ade-512-512",
+                "--samples", "4", "-v",
             ],
         )
         _assert_quantized_output(input_onnx=onnx_imgseg, output_onnx=out, stdout=r.output)
-        assert "Creating image-segmentation dataset with ImageSegmentationDataset" in r.output, (
-            r.output
-        )
+        assert (
+            "Creating image-segmentation dataset with ImageSegmentationDataset" in r.output
+        ), r.output
 
     def test_unsupported_task_falls_back_to_random_dataset(
         self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path
@@ -601,14 +561,9 @@ def test_unsupported_task_falls_back_to_random_dataset(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(tiny_onnx),
-                "-o",
-                str(out),
-                "--task",
-                "automatic-speech-recognition",
-                "--samples",
-                "4",
+                "-m", str(tiny_onnx), "-o", str(out),
+                "--task", "automatic-speech-recognition",
+                "--samples", "4",
             ],
         )
         _assert_quantized_output(input_onnx=tiny_onnx, output_onnx=out, stdout=r.output)
@@ -630,21 +585,16 @@ def test_image_feature_extraction_uses_image_dataset(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(onnx_dinov2),
-                "-o",
-                str(out),
-                "--task",
-                "image-feature-extraction",
-                "--model-name",
-                "facebook/dinov2-small",
-                "--samples",
-                "4",
-                "-v",
+                "-m", str(onnx_dinov2), "-o", str(out),
+                "--task", "image-feature-extraction",
+                "--model-name", "facebook/dinov2-small",
+                "--samples", "4", "-v",
             ],
         )
         _assert_quantized_output(input_onnx=onnx_dinov2, output_onnx=out, stdout=r.output)
-        assert "Creating image-feature-extraction dataset with ImageDataset" in r.output, r.output
+        assert (
+            "Creating image-feature-extraction dataset with ImageDataset" in r.output
+        ), r.output
 
 
 # ===========================================================================
@@ -689,7 +639,9 @@ def test_external_data_sidecar_written(
         out_dir = tmp_path / "out_ext"
         out_dir.mkdir()
         out = out_dir / "quant_ext.onnx"
-        r = _invoke(runner, ["-m", str(tiny_onnx_external), "-o", str(out), "--samples", "4"])
+        r = _invoke(
+            runner, ["-m", str(tiny_onnx_external), "-o", str(out), "--samples", "4"]
+        )
         assert out.exists()
         assert (out_dir / f"{out.name}.data").exists()
         _assert_quantized_output(input_onnx=tiny_onnx_external, output_onnx=out, stdout=r.output)
@@ -721,14 +673,9 @@ def test_cli_samples_overrides_config_and_config_method_used(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(tiny_onnx),
-                "-o",
-                str(out),
-                "--config",
-                str(bc),
-                "--samples",
-                "4",
+                "-m", str(tiny_onnx), "-o", str(out),
+                "--config", str(bc),
+                "--samples", "4",
             ],
         )
         _assert_quantized_output(input_onnx=tiny_onnx, output_onnx=out, stdout=r.output)
@@ -749,16 +696,10 @@ def test_cli_precision_wins_over_empty_config(
         r = _invoke(
             runner,
             [
-                "-m",
-                str(tiny_onnx),
-                "-o",
-                str(out),
-                "--config",
-                str(bc),
-                "--precision",
-                "int16",
-                "--samples",
-                "4",
+                "-m", str(tiny_onnx), "-o", str(out),
+                "--config", str(bc),
+                "--precision", "int16",
+                "--samples", "4",
             ],
         )
         # uint16 activations may not run on CPU EP â€” skip S7/S9
@@ -822,9 +763,10 @@ def test_malformed_onnx_input(self, runner: CliRunner, tmp_path: Path):
         assert "Quantization failed" in r.output
         # Must surface a parse-related cause, not just the generic prefix.
         lowered = r.output.lower()
-        assert any(kw in lowered for kw in ("parse", "protobuf", "decode", "load", "invalid")), (
-            f"expected parse-related cause in output, got:\n{r.output}"
-        )
+        assert any(
+            kw in lowered
+            for kw in ("parse", "protobuf", "decode", "load", "invalid")
+        ), f"expected parse-related cause in output, got:\n{r.output}"
 
 
 # ===========================================================================
@@ -838,7 +780,9 @@ class TestConfigPrecedenceSweep:
     Verifies via structural inspection of the produced model, not stdout.
     """
 
-    def test_weight_type_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path):
+    def test_weight_type_from_config(
+        self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path
+    ):
         bc = tmp_path / "bc.json"
         _write_build_config(bc, {"weight_type": "int8"})
         out = tmp_path / "f3a.onnx"
@@ -849,7 +793,9 @@ def test_weight_type_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_p
         model = _assert_quantized_output(input_onnx=tiny_onnx, output_onnx=out, stdout=r.output)
         assert _weight_dq_zero_point_dtype(model) == onnx.TensorProto.INT8
 
-    def test_per_channel_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path):
+    def test_per_channel_from_config(
+        self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path
+    ):
         bc = tmp_path / "bc.json"
         _write_build_config(bc, {"per_channel": True})
         out = tmp_path / "f3b.onnx"
@@ -874,7 +820,9 @@ def test_per_channel_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_p
                     break
         assert has_vector, "per_channel from config not applied (scales are scalar)"
 
-    def test_symmetric_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path):
+    def test_symmetric_from_config(
+        self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path
+    ):
         bc = tmp_path / "bc.json"
         # symmetric only unambiguously yields zp==0 with int8 weights
         _write_build_config(bc, {"symmetric": True, "weight_type": "int8"})
@@ -894,7 +842,9 @@ def test_symmetric_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_pat
                 arr = onnx.numpy_helper.to_array(init)
                 assert np.all(arr == 0), f"symmetric from config not applied; zp={arr}"
 
-    def test_task_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path):
+    def test_task_from_config(
+        self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path
+    ):
         """task='automatic-speech-recognition' from config must trigger fallback warning."""
         bc = tmp_path / "bc.json"
         _write_build_config(bc, {"task": "automatic-speech-recognition"})
@@ -915,7 +865,9 @@ def test_task_from_config(self, runner: CliRunner, tiny_onnx: Path, tmp_path: Pa
 
 
 class TestVerbose:
-    def test_verbose_emits_more_output(self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path):
+    def test_verbose_emits_more_output(
+        self, runner: CliRunner, tiny_onnx: Path, tmp_path: Path
+    ):
         out_q = tmp_path / "quiet.onnx"
         out_v = tmp_path / "verbose.onnx"
         r_quiet = _invoke(runner, ["-m", str(tiny_onnx), "-o", str(out_q), "--samples", "4"])
@@ -926,3 +878,4 @@ def test_verbose_emits_more_output(self, runner: CliRunner, tiny_onnx: Path, tmp
             f"verbose did not increase output\n--- quiet ---\n{r_quiet.output}\n"
             f"--- verbose ---\n{r_verbose.output}"
         )
+

From 93a19a3421b9840d4099f9b7a16343005e51df1b Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Mon, 8 Jun 2026 15:02:20 +0800
Subject: [PATCH 087/143] docs: exclude internal files from build, gitignore
 versions.json

---
 .gitignore         | 1 +
 docs/versions.json | 3 ---
 mkdocs.yml         | 2 ++
 3 files changed, 3 insertions(+), 3 deletions(-)
 delete mode 100644 docs/versions.json

diff --git a/.gitignore b/.gitignore
index 6d8e97985..0dc405bae 100644
--- a/.gitignore
+++ b/.gitignore
@@ -264,3 +264,4 @@ specs/
 
 # Runtime check rule artifacts (hosted in external repo)
 src/winml/modelkit/analyze/rules/runtime_check_rules/**/*.parquet
+docs/versions.json
diff --git a/docs/versions.json b/docs/versions.json
deleted file mode 100644
index 7b51a8ab9..000000000
--- a/docs/versions.json
+++ /dev/null
@@ -1,3 +0,0 @@
-[
-  {"version": "0.1", "title": "0.1", "aliases": ["latest"]}
-]
diff --git a/mkdocs.yml b/mkdocs.yml
index ba0bc373c..fdde06f82 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -9,6 +9,8 @@ docs_dir: docs
 
 exclude_docs: |
   /design/
+  /naming-convention.md
+  /pytest-best-practices.md
 
 extra:
   version:

From 606cd267ddafec54ad370e54e9f805400815c2f8 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Mon, 8 Jun 2026 15:57:07 +0800
Subject: [PATCH 088/143] docs: generalize NPU prerequisites (not
 Qualcomm-only, QAIRT SDK only needed for --compiler qairt)

---
 docs/getting-started/end-to-end.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/getting-started/end-to-end.md b/docs/getting-started/end-to-end.md
index 7d387c8fa..8f6d27dc2 100644
--- a/docs/getting-started/end-to-end.md
+++ b/docs/getting-started/end-to-end.md
@@ -20,11 +20,11 @@ reading from that device.
 - winml-cli installed (see [Installation](installation.md))
 
 !!! note "NPU users only"
-    To target the Qualcomm NPU you also need:
+    To target an NPU you also need:
 
-    - A Qualcomm Snapdragon X device
-    - QAIRT SDK installed; `QNN_SDK_ROOT` env var pointing at it
-    - `--extra qnn` installed (Python 3.11+)
+    - A device with an NPU (e.g., Qualcomm Snapdragon X, Intel Core Ultra)
+    - `--extra qnn` installed (for Qualcomm NPU, Python 3.11+)
+    - Only if using `--compiler qairt`: QAIRT SDK installed with `QNN_SDK_ROOT` env var set. The default `--compiler ort` does not require any external SDK.
 
     Everything else on this page works without these.
 

From 044893a4f0920da8e3043f4742269e23f3829958 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 10:42:13 +0800
Subject: [PATCH 089/143] docs: address review comments on perf pages

- Fix monitor description: NPU / GPU utilization (not just NPU)
- Remove awkward 'no --compare-devices flag' phrasing
- Mention --output in perf-and-monitoring concept page
---
 docs/commands/perf.md                | 4 ++--
 docs/concepts/perf-and-monitoring.md | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/commands/perf.md b/docs/commands/perf.md
index f070469ff..b4d3948ef 100644
--- a/docs/commands/perf.md
+++ b/docs/commands/perf.md
@@ -35,7 +35,7 @@ $ winml perf [options]
 
 ## How it works
 
-`winml perf` loads the model through `WinMLAutoModel` — accepting both HuggingFace IDs and local ONNX files — then generates random input tensors from the model's I/O configuration. It runs the specified number of warm-up iterations (excluded from statistics) followed by the timed iterations, collecting per-sample latency. The final report includes mean, min, max, P50, P90, P95, P99, standard deviation, and throughput in samples per second. When `--monitor` is active, a hardware polling loop runs in parallel and records NPU utilization, CPU usage, and device memory alongside the timing data.
+`winml perf` loads the model through `WinMLAutoModel` — accepting both HuggingFace IDs and local ONNX files — then generates random input tensors from the model's I/O configuration. It runs the specified number of warm-up iterations (excluded from statistics) followed by the timed iterations, collecting per-sample latency. The final report includes mean, min, max, P50, P90, P95, P99, standard deviation, and throughput in samples per second. When `--monitor` is active, a hardware polling loop runs in parallel and records NPU / GPU utilization, CPU usage, and device memory alongside the timing data.
 
 ## Examples
 
@@ -91,7 +91,7 @@ $ winml perf -m bert-base-uncased --module BertAttention --iterations 200
 - **`--shape-config` is silently ignored in two cases.** It has no effect on pre-exported ONNX files (shapes are baked into the graph) and is ignored in `--module` mode. The command prints a warning in both situations.
 - **`--op-tracing` requires `onnxruntime-qnn`.** The flag activates the QNN profiler, which is only present in the `onnxruntime-qnn` package. If that package is not installed, the benchmark still runs but the op-trace step exits with an error.
 - **Random inputs do not represent real data distributions.** Latency numbers are accurate, but memory access patterns may differ from production because the generated tensors are uniform random values. For memory-bandwidth-sensitive models this can understate real-world latency.
-- **Cross-device comparison.** There is no `--compare-devices` flag. To compare performance across devices, run `winml perf` separately with different `--device` values and compare the resulting JSON files.
+- **Cross-device comparison.** To compare performance across devices, run `winml perf` separately with different `--device` values and compare the resulting JSON reports.
 
 ## See also
 
diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index 511fb791f..3f96ced93 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -10,7 +10,7 @@ At its core, `winml perf` runs a configurable number of inference iterations and
 
 You can control the run length with `--iterations` and the input shape with `--batch-size` or a `--shape-config` JSON file for models with dynamic axes. The `--device` flag selects the target EP — `cpu`, `gpu`, `npu`, or `auto` (default) — allowing you to collect numbers on each target with the same command and compare them directly. For fine-grained EP control, `--ep` lets you name a specific provider such as `qnn` or `dml`.
 
-The results are written to a JSON file at `~/.cache/winml/perf/<slug>/<timestamp>.json` so they can be archived and compared across builds.
+The results are written to a JSON file at `~/.cache/winml/perf/<slug>/<timestamp>.json` (or a custom path via `--output`) so they can be archived and compared across builds.
 
 ## Live monitoring
 

From 60d61e21eecd6b8bba80c54978874fe40ad5d757 Mon Sep 17 00:00:00 2001
From: ssss141414 <407748083@qq.com>
Date: Tue, 9 Jun 2026 12:41:01 +0800
Subject: [PATCH 090/143]  Fix: link 404 in doc (#840)

Fix link 404 in doc

---------

Co-authored-by: Zhenchao Ni <zhenni@microsoft.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Yue Sun <yuesu@microsoft.com>
---
 docs/getting-started/agent-skill.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/getting-started/agent-skill.md b/docs/getting-started/agent-skill.md
index 756bd766d..518a77c24 100644
--- a/docs/getting-started/agent-skill.md
+++ b/docs/getting-started/agent-skill.md
@@ -26,7 +26,7 @@ The skill teaches the agent:
 
 ### With GitHub Copilot Coding Agent
 
-The [Copilot Coding Agent](https://docs.github.com/en/copilot/using-github-copilot/using-the-copilot-coding-agent)
+The [Copilot Coding Agent](https://docs.github.com/en/copilot/how-tos/copilot-on-github/use-copilot-agents/overview)
 (the cloud agent that creates PRs) automatically reads `skills/use-winml-cli/SKILL.md`
 when working on this repository. No setup needed — assign an issue or ask
 Copilot to build/optimize a model and it will follow the skill's guidance to

From 958178b1d9bca7b1c1c63c2d0a678096c245653c Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 14:00:46 +0800
Subject: [PATCH 091/143] docs: address timenick review comments

- Fix sys.md example: use realistic Snapdragon device/EP combo
- installation.md: recommend PyPI install, source install as tip
- installation.md: collapse optional extras into expandable section
- CONTRIBUTING.md: add full clone+sync+rules setup steps
---
 CONTRIBUTING.md                      | 14 +++++++++-
 docs/commands/sys.md                 |  6 ++--
 docs/getting-started/installation.md | 41 ++++++++++++++++------------
 3 files changed, 40 insertions(+), 21 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index ff4ce5e10..312f01a57 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -14,7 +14,9 @@ We're always looking for your help to improve the product (bug fixes, new featur
 See the [README](./README.md#getting-started) for prerequisites and installation instructions. Then set up your development environment:
 
 ```bash
-uv sync
+git clone https://github.com/microsoft/winml-cli.git
+cd winml-cli
+uv sync --extra dev
 uv run pre-commit install
 ```
 
@@ -24,6 +26,16 @@ This installs all dependencies and enables [pre-commit hooks](https://pre-commit
 
 When running WinML CLI from a source tree (`uv run winml ...`), you need to populate the runtime check rule zips locally. See [`src/winml/modelkit/analyze/rules/runtime_check_rules/README.md`](./src/winml/modelkit/analyze/rules/runtime_check_rules/README.md) for setup options (GitHub release for external contributors, `gim-home` script for Microsoft internal, `WINMLCLI_RULES_DIR` override).
 
+For external contributors, download from a GitHub release:
+
+```bash
+gh release download <tag> --repo microsoft/winml-cli --pattern 'rules-v*.zip' --dir .
+# Windows:
+Expand-Archive -Path .\rules-v*.zip -DestinationPath src\winml\modelkit\analyze\rules\runtime_check_rules -Force
+# Linux/macOS:
+# unzip -o rules-v*.zip -d src/winml/modelkit/analyze/rules/runtime_check_rules
+```
+
 ## Coding conventions and standards
 
 ### Python code style
diff --git a/docs/commands/sys.md b/docs/commands/sys.md
index 3cbdd1e08..442e83012 100644
--- a/docs/commands/sys.md
+++ b/docs/commands/sys.md
@@ -64,9 +64,9 @@ ML Libraries
   ...
 
 Available Devices (priority order)
-  #1  NPU   Qualcomm(R) AI 100
-  #2  GPU   NVIDIA GeForce RTX 4090
-  #3  CPU   AMD Ryzen 9 7940HS
+  #1  NPU   Qualcomm(R) Hexagon NPU
+  #2  GPU   Qualcomm(R) Adreno GPU
+  #3  CPU   Snapdragon(R) X Elite
 
 Available Execution Providers
   QNNExecutionProvider           -> NPU
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index 364c4791e..589f5c9b3 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -16,18 +16,25 @@
 ## Install
 
 ```bash
-git clone https://github.com/microsoft/winml-cli.git
-cd winml-cli
 uv python install 3.11
-uv sync
+uv pip install winml-cli
 ```
 
-Cloning the repository pulls down all source code and configuration. `uv python install 3.11` downloads and pins the exact Python version the project requires. `uv sync` creates an isolated virtual environment and installs all declared dependencies from `pyproject.toml` in a single step. No separate `pip install` or manual venv activation is needed.
+`uv python install 3.11` downloads and pins the exact Python version the project requires. `uv pip install winml-cli` installs the latest release from PyPI into a managed environment. No separate venv activation is needed.
+
+!!! tip "Install from source (for development)"
+    If you want to contribute or run the latest unreleased code:
+
+    ```bash
+    git clone https://github.com/microsoft/winml-cli.git
+    cd winml-cli
+    uv sync
+    ```
 
 ## Verify
 
 ```bash
-uv run winml sys
+winml sys
 ```
 
 Expected output (abbreviated):
@@ -60,24 +67,24 @@ Available Execution Providers
 
 This command enumerates available compute devices and execution providers on your machine. If an expected device or SDK is missing, `winml sys` is the right place to diagnose it. See [winml sys](../commands/sys.md) for the full flag reference and troubleshooting tips.
 
-## Optional extras
+??? note "Optional extras (hardware-specific backends)"
 
-Two optional dependency groups are available for hardware-specific backends:
+    Two optional dependency groups are available:
 
-- `--extra openvino` — installs [OpenVINO](https://docs.openvino.ai/) for inference on Intel CPU and GPU targets.
-- `--extra qnn` — installs `onnxruntime-qnn` for Qualcomm NPU support.
+    - `--extra openvino` — installs [OpenVINO](https://docs.openvino.ai/) for inference on Intel CPU and GPU targets.
+    - `--extra qnn` — installs `onnxruntime-qnn` for Qualcomm NPU support.
 
-To install an extra:
+    To install an extra:
 
-```bash
-uv sync --extra openvino
-```
+    ```bash
+    uv pip install winml-cli[qnn]
+    ```
 
-Both extras can be combined:
+    Both extras can be combined:
 
-```bash
-uv sync --extra openvino --extra qnn
-```
+    ```bash
+    uv pip install winml-cli[openvino,qnn]
+    ```
 
 ## Next steps
 

From 81e54fbcd68b2283475ae5825669cb09b3e1b0d1 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 15:00:21 +0800
Subject: [PATCH 092/143] docs: rename 'Build Config Schema' to 'Config Schema'
 (it's general-purpose)

---
 docs/reference/index.md         | 2 +-
 docs/reference/output-layout.md | 2 +-
 docs/reference/python-api.md    | 2 +-
 mkdocs.yml                      | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/reference/index.md b/docs/reference/index.md
index 23c169ecb..5c3dd8c71 100644
--- a/docs/reference/index.md
+++ b/docs/reference/index.md
@@ -1,4 +1,4 @@
-# Reference — Build Configuration Schema
+# Reference — Config Schema
 
 This page documents the full schema for `WinMLBuildConfig`, the JSON configuration
 file that drives `winml build` and related commands. Generate a config with
diff --git a/docs/reference/output-layout.md b/docs/reference/output-layout.md
index c560ae8f5..d2aee17cd 100644
--- a/docs/reference/output-layout.md
+++ b/docs/reference/output-layout.md
@@ -233,5 +233,5 @@ Key fields:
 ## See also
 
 - [winml build](../commands/build.md) — build command reference
-- [Reference — Build Configuration Schema](index.md) — config file format
+- [Reference — Config Schema](index.md) — config file format
 - [How winml-cli Works](../concepts/how-it-works.md) — pipeline stages explained
diff --git a/docs/reference/python-api.md b/docs/reference/python-api.md
index be80045bd..8b76acb95 100644
--- a/docs/reference/python-api.md
+++ b/docs/reference/python-api.md
@@ -253,6 +253,6 @@ print(f"P99 latency: {stats.p99_ms:.2f} ms")
 
 ## See also
 
-- [Reference — Build Configuration Schema](index.md) — full config field reference
+- [Reference — Config Schema](index.md) — full config field reference
 - [winml build](../commands/build.md) — CLI equivalent
 - [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview
diff --git a/mkdocs.yml b/mkdocs.yml
index fdde06f82..a2e5a2acb 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -126,7 +126,7 @@ nav:
       - ConvNeXt on NPU: tutorials/npu-convnext.md
       - Bring Your Own ONNX Model: tutorials/build-from-onnx.md
   - Reference:
-      - Build Config Schema: reference/index.md
+      - Config Schema: reference/index.md
       - Python API: reference/python-api.md
       - Output Layout: reference/output-layout.md
       - Supported Models: reference/supported-models.md

From c42702a8447aca8a470da32ee5ae89de61e0a16f Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 15:02:01 +0800
Subject: [PATCH 093/143] docs: move Python API to last in Reference nav

---
 mkdocs.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mkdocs.yml b/mkdocs.yml
index a2e5a2acb..61d086e59 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -127,9 +127,9 @@ nav:
       - Bring Your Own ONNX Model: tutorials/build-from-onnx.md
   - Reference:
       - Config Schema: reference/index.md
-      - Python API: reference/python-api.md
       - Output Layout: reference/output-layout.md
       - Supported Models: reference/supported-models.md
+      - Python API: reference/python-api.md
   - Troubleshooting: troubleshooting.md
   - Contributing: contributing.md
   - Privacy: Privacy.md

From f5b7ecb6f4e3673026b86e53a229b6f85ad3cbbe Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 15:16:09 +0800
Subject: [PATCH 094/143] docs: fix Qwen3->CLIP in samples description; add
 doc-check extension

- Fix index.md: samples are ConvNeXt, BERT, and CLIP (not Qwen3)
- Add .github/extensions/doc-check with 3 tools:
  - doc_check_consistency: broken links, orphan files, nav mismatches
  - doc_check_code_alignment: CLI flags/EPs/config vs source code
  - doc_check_samples: sample commands use documented flags
---
 .github/extensions/doc-check/extension.mjs | 274 +++++++++++++++++++++
 docs/index.md                              |   2 +-
 2 files changed, 275 insertions(+), 1 deletion(-)
 create mode 100644 .github/extensions/doc-check/extension.mjs

diff --git a/.github/extensions/doc-check/extension.mjs b/.github/extensions/doc-check/extension.mjs
new file mode 100644
index 000000000..c82a27583
--- /dev/null
+++ b/.github/extensions/doc-check/extension.mjs
@@ -0,0 +1,274 @@
+// Extension: doc-check
+// Verify docs are self-consistent, cross-referenced correctly, and match source code
+
+import { joinSession } from "@github/copilot-sdk/extension";
+import { readFile } from "fs/promises";
+import { resolve, join, dirname } from "path";
+import { glob } from "fs/promises";
+
+const ROOT = resolve(import.meta.dirname, "../../..");
+const DOCS_DIR = join(ROOT, "docs");
+const SRC_DIR = join(ROOT, "src");
+
+async function readTextFile(path) {
+    try {
+        return await readFile(path, "utf-8");
+    } catch {
+        return null;
+    }
+}
+
+async function findFiles(pattern, cwd) {
+    const results = [];
+    for await (const entry of glob(pattern, { cwd })) {
+        results.push(entry);
+    }
+    return results;
+}
+
+const session = await joinSession({
+    tools: [
+        {
+            name: "doc_check_consistency",
+            description:
+                "Audit the docs/ folder for internal consistency issues: broken cross-references between markdown files, nav entries in mkdocs.yml that point to missing files, and markdown files in docs/ that are not listed in nav or exclude_docs. Returns a report of findings.",
+            parameters: { type: "object", properties: {} },
+            skipPermission: true,
+            handler: async () => {
+                const findings = [];
+
+                // 1. Read mkdocs.yml nav and check all referenced files exist
+                const mkdocsContent = await readTextFile(join(ROOT, "mkdocs.yml"));
+                if (!mkdocsContent) return "Error: mkdocs.yml not found";
+
+                const navFileRefs = [...mkdocsContent.matchAll(/:\s+(\S+\.md)/g)].map((m) => m[1]);
+                for (const ref of navFileRefs) {
+                    const fullPath = join(DOCS_DIR, ref);
+                    const content = await readTextFile(fullPath);
+                    if (content === null) {
+                        findings.push(`[NAV_MISSING_FILE] mkdocs.yml references '${ref}' but file does not exist`);
+                    }
+                }
+
+                // 2. Check internal markdown links in all docs files
+                const mdFiles = await findFiles("**/*.md", DOCS_DIR);
+                for (const relPath of mdFiles) {
+                    const fullPath = join(DOCS_DIR, relPath);
+                    const content = await readTextFile(fullPath);
+                    if (!content) continue;
+
+                    // Find markdown links like [text](../path/to/file.md) or [text](./file.md#anchor)
+                    const linkPattern = /\[([^\]]*)\]\(([^)]+)\)/g;
+                    let match;
+                    while ((match = linkPattern.exec(content)) !== null) {
+                        const target = match[2];
+                        // Skip external URLs and anchors-only
+                        if (target.startsWith("http") || target.startsWith("#") || target.startsWith("mailto:")) continue;
+                        // Strip anchor
+                        const filePart = target.split("#")[0];
+                        if (!filePart) continue;
+                        // Resolve relative to current file's directory
+                        const resolvedPath = resolve(dirname(fullPath), filePart);
+                        const exists = await readTextFile(resolvedPath);
+                        if (exists === null) {
+                            findings.push(`[BROKEN_LINK] ${relPath}: link to '${filePart}' resolves to non-existent file`);
+                        }
+                    }
+                }
+
+                // 3. Check for md files not in nav and not in exclude_docs
+                const excludeMatch = mkdocsContent.match(/exclude_docs:\s*\|([\s\S]*?)(?=\n\S|\n*$)/);
+                const excludePatterns = excludeMatch
+                    ? excludeMatch[1]
+                        .split("\n")
+                        .map((l) => l.trim())
+                        .filter(Boolean)
+                    : [];
+                const navSet = new Set(navFileRefs);
+                for (const relPath of mdFiles) {
+                    const normalized = relPath.replace(/\\/g, "/");
+                    if (navSet.has(normalized)) continue;
+                    if (normalized === "index.md") continue;
+                    // Check exclude patterns (simple glob: just filename or path prefix)
+                    const excluded = excludePatterns.some((pat) => {
+                        const cleanPat = pat.replace(/^\//, "");
+                        return normalized === cleanPat || normalized.startsWith(cleanPat);
+                    });
+                    if (excluded) continue;
+                    findings.push(`[ORPHAN_FILE] ${normalized} is not in mkdocs.yml nav or exclude_docs`);
+                }
+
+                if (findings.length === 0) return "✅ No consistency issues found.";
+                return `Found ${findings.length} issue(s):\n\n${findings.join("\n")}`;
+            },
+        },
+        {
+            name: "doc_check_code_alignment",
+            description:
+                "Cross-reference documentation claims against source code. Checks: CLI flag names documented in command pages match actual Click/Typer parameter definitions in src/; EP names and device mappings in docs match source; config schema fields match the WinMLBuildConfig dataclass. Returns mismatches.",
+            parameters: {
+                type: "object",
+                properties: {
+                    scope: {
+                        type: "string",
+                        description: "Which aspect to check: 'flags' (CLI flags vs source), 'eps' (EP table vs source), 'config' (config schema fields vs dataclass), or 'all'",
+                        enum: ["flags", "eps", "config", "all"],
+                    },
+                },
+            },
+            skipPermission: true,
+            handler: async (args) => {
+                const scope = args.scope || "all";
+                const findings = [];
+
+                // Helper: find Python files containing a pattern
+                async function searchSrc(pattern) {
+                    const pyFiles = await findFiles("**/*.py", SRC_DIR);
+                    const results = [];
+                    for (const f of pyFiles) {
+                        const content = await readTextFile(join(SRC_DIR, f));
+                        if (content && content.includes(pattern)) {
+                            results.push({ file: f, content });
+                        }
+                    }
+                    return results;
+                }
+
+                if (scope === "flags" || scope === "all") {
+                    // Check command docs for flags and verify they exist in source
+                    const cmdFiles = await findFiles("*.md", join(DOCS_DIR, "commands"));
+                    for (const cmdFile of cmdFiles) {
+                        const content = await readTextFile(join(DOCS_DIR, "commands", cmdFile));
+                        if (!content) continue;
+                        // Extract flags from markdown tables: | `--flag-name` |
+                        const flagPattern = /\|\s*`(--[\w-]+)`/g;
+                        let match;
+                        const docFlags = [];
+                        while ((match = flagPattern.exec(content)) !== null) {
+                            docFlags.push(match[1]);
+                        }
+                        if (docFlags.length === 0) continue;
+
+                        // Try to find the command source file
+                        const cmdName = cmdFile.replace(".md", "");
+                        const srcFiles = await searchSrc(`def ${cmdName}`);
+                        if (srcFiles.length === 0) continue;
+
+                        // Check each documented flag exists in source (as click option or argument)
+                        const srcContent = srcFiles.map((s) => s.content).join("\n");
+                        for (const flag of docFlags) {
+                            const paramName = flag.replace(/^--/, "").replace(/-/g, "_");
+                            const altName = flag; // --flag-name form
+                            if (!srcContent.includes(paramName) && !srcContent.includes(altName)) {
+                                findings.push(`[FLAG_NOT_IN_SRC] ${cmdFile}: '${flag}' not found in source for '${cmdName}' command`);
+                            }
+                        }
+                    }
+                }
+
+                if (scope === "eps" || scope === "all") {
+                    // Check EP table in docs/concepts/eps-and-devices.md
+                    const epDoc = await readTextFile(join(DOCS_DIR, "concepts", "eps-and-devices.md"));
+                    if (epDoc) {
+                        const epPattern = /\|\s*`(\w+ExecutionProvider)`\s*\|\s*`(\w+)`/g;
+                        let match;
+                        while ((match = epPattern.exec(epDoc)) !== null) {
+                            const epName = match[1];
+                            const shortName = match[2];
+                            // Verify EP short name exists somewhere in source
+                            const srcHits = await searchSrc(shortName);
+                            if (srcHits.length === 0) {
+                                findings.push(`[EP_SHORT_NAME_MISSING] EP '${epName}' short name '${shortName}' not found in source`);
+                            }
+                        }
+                    }
+                }
+
+                if (scope === "config" || scope === "all") {
+                    // Check config schema fields in docs/reference/index.md against source dataclass
+                    const refDoc = await readTextFile(join(DOCS_DIR, "reference", "index.md"));
+                    if (refDoc) {
+                        // Extract field names from table rows: | `field_name` |
+                        const fieldPattern = /\|\s*`([\w.]+)`\s*\|/g;
+                        let match;
+                        const docFields = new Set();
+                        while ((match = fieldPattern.exec(refDoc)) !== null) {
+                            docFields.add(match[1].split(".").pop()); // Get leaf field name
+                        }
+                        // Find WinMLBuildConfig in source
+                        const configFiles = await searchSrc("WinMLBuildConfig");
+                        if (configFiles.length > 0) {
+                            const configSrc = configFiles.map((f) => f.content).join("\n");
+                            // Check each doc field appears in source
+                            for (const field of docFields) {
+                                if (!configSrc.includes(field)) {
+                                    findings.push(`[CONFIG_FIELD_MISSING] Field '${field}' documented but not found in WinMLBuildConfig source`);
+                                }
+                            }
+                        }
+                    }
+                }
+
+                if (findings.length === 0) return "✅ Documentation aligns with source code.";
+                return `Found ${findings.length} mismatch(es):\n\n${findings.join("\n")}`;
+            },
+        },
+        {
+            name: "doc_check_samples",
+            description:
+                "Verify that sample pages (docs/samples/) use correct model IDs, command flags, and pipeline steps that match the current CLI capabilities. Checks model IDs are valid HuggingFace references and command examples use documented flags.",
+            parameters: { type: "object", properties: {} },
+            skipPermission: true,
+            handler: async () => {
+                const findings = [];
+                const sampleFiles = await findFiles("*.md", join(DOCS_DIR, "samples"));
+
+                // Load all documented flags from command pages
+                const cmdFiles = await findFiles("*.md", join(DOCS_DIR, "commands"));
+                const allFlags = new Map(); // command -> Set of flags
+                for (const cmdFile of cmdFiles) {
+                    const content = await readTextFile(join(DOCS_DIR, "commands", cmdFile));
+                    if (!content) continue;
+                    const cmdName = cmdFile.replace(".md", "");
+                    const flags = new Set();
+                    const flagPattern = /\|\s*`(--[\w-]+)`/g;
+                    let match;
+                    while ((match = flagPattern.exec(content)) !== null) {
+                        flags.add(match[1]);
+                    }
+                    allFlags.set(cmdName, flags);
+                }
+
+                for (const sampleFile of sampleFiles) {
+                    const content = await readTextFile(join(DOCS_DIR, "samples", sampleFile));
+                    if (!content) continue;
+
+                    // Check command examples use valid flags
+                    const codeBlocks = content.match(/```bash\n([\s\S]*?)```/g) || [];
+                    for (const block of codeBlocks) {
+                        // Find winml commands
+                        const cmdPattern = /winml\s+(\w+)(.*)/g;
+                        let match;
+                        while ((match = cmdPattern.exec(block)) !== null) {
+                            const cmd = match[1];
+                            const argsStr = match[2];
+                            const docFlags = allFlags.get(cmd);
+                            if (!docFlags || docFlags.size === 0) continue;
+
+                            // Extract flags used
+                            const usedFlags = argsStr.match(/--[\w-]+/g) || [];
+                            for (const flag of usedFlags) {
+                                if (!docFlags.has(flag)) {
+                                    findings.push(`[UNDOCUMENTED_FLAG] ${sampleFile}: 'winml ${cmd} ${flag}' uses flag not in docs/commands/${cmd}.md`);
+                                }
+                            }
+                        }
+                    }
+                }
+
+                if (findings.length === 0) return "✅ All sample commands use documented flags.";
+                return `Found ${findings.length} issue(s):\n\n${findings.join("\n")}`;
+            },
+        },
+    ],
+});
diff --git a/docs/index.md b/docs/index.md
index 885ebc057..ef792d612 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -24,7 +24,7 @@ winml-cli is a CLI toolkit to build portable, performant, and high-quality model
 
 - **[How winml-cli Works](concepts/how-it-works.md)** — the pipeline from a PyTorch model to an EP-compiled artifact.
 - **[Commands](commands/overview.md)** — reference for all 12 `winml` subcommands.
-- **[Samples](samples/convnext-primitives.md)** — end-to-end walkthroughs for ConvNeXt, BERT, and Qwen3.
+- **[Samples](samples/convnext-primitives.md)** — end-to-end walkthroughs for ConvNeXt, BERT, and CLIP.
 
 ## License
 

From d8182868506b0b287177ad4944385551a554050e Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 15:18:42 +0800
Subject: [PATCH 095/143] docs: simplify installation prerequisites

- Hardware: just 'Device with CPU, GPU, or NPU'
- Remove Python version pin detail
- Remove optional extras (qnn/openvino) section
---
 docs/getting-started/installation.md | 23 ++---------------------
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index 589f5c9b3..c5039a488 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -5,8 +5,8 @@
 | Component | Details |
 |---|---|
 | Windows | Windows 11 24H2 or later (required for NPU support) |
-| Hardware | Copilot+PC with NPU (40+ TOPS recommended for NPU acceleration; CPU/DirectML works without an NPU) |
-| Python | 3.11 (the project pins `requires-python = ">=3.11,<3.12"`) |
+| Hardware | Device with CPU, GPU, or NPU |
+| Python | 3.11 |
 | Package manager | [`uv`](https://github.com/astral-sh/uv) |
 | Version control | `git` |
 
@@ -67,25 +67,6 @@ Available Execution Providers
 
 This command enumerates available compute devices and execution providers on your machine. If an expected device or SDK is missing, `winml sys` is the right place to diagnose it. See [winml sys](../commands/sys.md) for the full flag reference and troubleshooting tips.
 
-??? note "Optional extras (hardware-specific backends)"
-
-    Two optional dependency groups are available:
-
-    - `--extra openvino` — installs [OpenVINO](https://docs.openvino.ai/) for inference on Intel CPU and GPU targets.
-    - `--extra qnn` — installs `onnxruntime-qnn` for Qualcomm NPU support.
-
-    To install an extra:
-
-    ```bash
-    uv pip install winml-cli[qnn]
-    ```
-
-    Both extras can be combined:
-
-    ```bash
-    uv pip install winml-cli[openvino,qnn]
-    ```
-
 ## Next steps
 
 - **[Quickstart](quickstart.md)** — export your first model in 5 minutes.

From dfec878bcce76a9fdaebd6e4dffac763dbeb30af Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 15:22:40 +0800
Subject: [PATCH 096/143] docs: remove QNN/OpenVINO SDK installation references

Remove --extra qnn/openvino, QAIRT SDK setup, QNN_SDK_ROOT env var
requirements, --compiler qairt examples/tabs, and onnxruntime-qnn
package mentions. Keep EP names and --compiler flag but without
SDK-specific setup guidance.
---
 docs/commands/compile.md               | 17 +++--------------
 docs/commands/overview.md              |  2 +-
 docs/commands/perf.md                  |  2 --
 docs/commands/sys.md                   | 16 ++++++----------
 docs/concepts/compile-and-epcontext.md |  2 +-
 docs/concepts/perf-and-monitoring.md   |  2 +-
 docs/getting-started/end-to-end.md     |  9 +--------
 docs/getting-started/installation.md   |  2 +-
 docs/getting-started/quickstart.md     |  2 +-
 docs/reference/index.md                |  2 +-
 docs/reference/supported-models.md     |  4 ++--
 docs/samples/convnext-primitives.md    | 16 ++++------------
 docs/tutorials/build-from-onnx.md      |  8 +++-----
 docs/tutorials/npu-convnext.md         | 20 +++++---------------
 14 files changed, 30 insertions(+), 74 deletions(-)

diff --git a/docs/commands/compile.md b/docs/commands/compile.md
index 46126f1f4..6c3d7dc9d 100644
--- a/docs/commands/compile.md
+++ b/docs/commands/compile.md
@@ -24,7 +24,7 @@ $ winml compile [options]
 | `--ep` | | choice | `None` | Force a specific execution provider, overriding device-to-provider mapping. Choices: `cpu`, `cuda`, `dml`, `migraphx`, `openvino`, `qnn`, `tensorrt`, `vitisai`. |
 | `--no-validate` | | flag | `false` | Skip validation of the compiled model after compilation. |
 | `--compiler` | | choice | `ort` | Compiler backend: `ort` (ONNX Runtime) or `qairt` (Qualcomm AI Runtime Tools). |
-| `--qnn-sdk-root` | | path | `None` | Path to the QAIRT/QNN SDK root directory. Required when `--compiler qairt` is set. |
+| `--qnn-sdk-root` | | path | `None` | Path to the QNN SDK root directory. |
 | `--embed` | | flag | `false` | Embed the EP context blob inside the ONNX file instead of writing a separate `.bin` file. |
 | `--list` | | flag | `false` | List available compiler backends for the selected device and exit without compiling. |
 | `--help` | `-h` | flag | | Show this message and exit. |
@@ -37,9 +37,8 @@ EP's offline compilation toolchain. When `--device auto` (the default), the
 target EP is determined by auto-detecting available hardware. For NPU targets,
 ONNX Runtime's QNN EP generates a binary `.bin` context file (or embeds it
 inline with `--embed`) that encodes the hardware-optimized execution plan,
-eliminating graph partitioning at load time. When `--compiler qairt` is used,
-the Qualcomm AI Runtime Tools SDK is invoked directly (requires `--qnn-sdk-root`).
-An optional post-compilation validation pass runs a forward pass through the
+eliminating graph partitioning at load time. An optional post-compilation
+validation pass runs a forward pass through the
 target EP; skip it with `--no-validate` when the target hardware is absent.
 
 ## Examples
@@ -78,18 +77,8 @@ winml compile -m bert-base-uncased_qdq.onnx --embed
 winml compile -m microsoft_resnet50.onnx --device gpu --ep migraphx
 ```
 
-```bash
-# Compile using the QAIRT SDK and skip post-compilation validation
-winml compile -m facebook_convnext_qdq.onnx \
-  --compiler qairt \
-  --qnn-sdk-root /opt/qnn-sdk \
-  --no-validate
-```
-
 ## Common pitfalls
 
-- **`--compiler qairt` requires `--qnn-sdk-root`.** Without a valid SDK path,
-  compilation will fail immediately with a missing-executable error.
 - **`--embed` inflates the `.onnx` file significantly.** Embedding the EP
   context produces a single portable file but can make it impractical to open or
   inspect the ONNX graph with standard tooling.
diff --git a/docs/commands/overview.md b/docs/commands/overview.md
index e3ca6d603..e867e1622 100644
--- a/docs/commands/overview.md
+++ b/docs/commands/overview.md
@@ -23,7 +23,7 @@ measure speed and accuracy.
 
 | Command | Group | Purpose |
 |---|---|---|
-| [`sys`](sys.md) | Discover | Inspect your machine — devices, EPs, SDKs, runtime versions at a glance. |
+| [`sys`](sys.md) | Discover | Inspect your machine — devices, EPs, and runtime versions at a glance. |
 | [`inspect`](inspect.md) | Discover | Inspect a model's tasks, classes, and hierarchy before committing to an export. |
 | [`catalog`](catalog.md) | Discover | Browse the curated winml-cli catalog of validated models and benchmarks. |
 | [`analyze`](analyze.md) | Discover | Verify an ONNX model is compatible with a target execution provider before deployment. |
diff --git a/docs/commands/perf.md b/docs/commands/perf.md
index b4d3948ef..f88230aac 100644
--- a/docs/commands/perf.md
+++ b/docs/commands/perf.md
@@ -31,7 +31,6 @@ $ winml perf [options]
 | `--ignore-cache` | | flag | `false` | Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies `--rebuild`. |
 | `--module` | | `TEXT` | — | PyTorch module class name for per-module benchmarking (e.g., `BertAttention`). Builds and times each matching instance separately. See [Load and export](../concepts/load-and-export.md). |
 | `--monitor` | | flag | `false` | Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report. |
-| `--op-tracing` | | `basic\|detail` | — | *(Advanced, hidden=True)* Enable operator-level profiling. Requires `onnxruntime-qnn`. Hidden from `--help` by design; gated on QNN-only profiling support. |
 
 ## How it works
 
@@ -89,7 +88,6 @@ $ winml perf -m bert-base-uncased --module BertAttention --iterations 200
 
 - **Warm-up too low on NPU.** The first several inferences on an NPU EP can be significantly slower due to kernel compilation and caching. The default of 10 warm-up iterations is usually enough for vision models, but transformer models with many operators may need `--warmup 30` or higher to reach steady-state latency.
 - **`--shape-config` is silently ignored in two cases.** It has no effect on pre-exported ONNX files (shapes are baked into the graph) and is ignored in `--module` mode. The command prints a warning in both situations.
-- **`--op-tracing` requires `onnxruntime-qnn`.** The flag activates the QNN profiler, which is only present in the `onnxruntime-qnn` package. If that package is not installed, the benchmark still runs but the op-trace step exits with an error.
 - **Random inputs do not represent real data distributions.** Latency numbers are accurate, but memory access patterns may differ from production because the generated tensors are uniform random values. For memory-bandwidth-sensitive models this can understate real-world latency.
 - **Cross-device comparison.** To compare performance across devices, run `winml perf` separately with different `--device` values and compare the resulting JSON reports.
 
diff --git a/docs/commands/sys.md b/docs/commands/sys.md
index 442e83012..2c73177d4 100644
--- a/docs/commands/sys.md
+++ b/docs/commands/sys.md
@@ -1,6 +1,6 @@
 # winml sys
 
-> Inspect your machine — devices, EPs, SDKs, runtime versions at a glance.
+> Inspect your machine — devices, EPs, and runtime versions at a glance.
 
 ## When to use this
 
@@ -21,7 +21,7 @@ $ winml sys [options]
 | `--format` | `-f` | `text` \| `json` \| `compact` | `text` | Output format. `text` renders rich tables, `json` emits machine-readable JSON, `compact` prints a single-line summary. |
 | `--list-device` | — | flag | `false` | List available compute devices (NPU, GPU, CPU) in priority order instead of showing the full system report. |
 | `--list-ep` | — | flag | `false` | List available ONNX Runtime execution providers instead of showing the full system report. Can be combined with `--list-device`. |
-| `--verbose` | `-v` | flag | `false` | Surface additional diagnostic sections: Backend SDKs and Export Readiness. |
+| `--verbose` | `-v` | flag | `false` | Surface additional diagnostic sections: backend availability and Export Readiness. |
 | `--help` | `-h` | flag | — | Show help and exit. |
 
 > `winml sys` takes no `--model`, `--device`, `--ep`, `--task`, or `--precision`
@@ -31,10 +31,10 @@ $ winml sys [options]
 
 `winml sys` queries Python's `platform` and `importlib.metadata` modules to report
 library versions, then probes PyTorch for CUDA availability and GPU device names.
-Backend SDK detection checks for `QNN_SDK_ROOT` / `QAIRT_SDK_ROOT` environment
-variables (QNN) and attempts to import `openvino` (OpenVINO). Device enumeration
-queries hardware directly in NPU > GPU > CPU priority order, while EP enumeration
-merges the WinML EP registry with ONNX Runtime's `get_available_providers()`. When
+Backend availability checks use the installed runtime environment, while device
+enumeration queries hardware directly in NPU > GPU > CPU priority order, and EP
+enumeration merges the WinML EP registry with ONNX Runtime's
+`get_available_providers()`. When
 `--format json` is used the full report — including devices and EPs — is emitted as
 a single JSON object, making it easy to capture in CI pipelines.
 
@@ -96,10 +96,6 @@ $ winml sys --list-ep --format json
 
 ## Common pitfalls
 
-- **QNN SDK not found even though it is installed.** The detection relies on the
-  `QNN_SDK_ROOT` or `QAIRT_SDK_ROOT` environment variables. If neither is set,
-  `winml sys` will report the SDK as absent even if the binaries exist on disk.
-  Set the variable and re-run.
 - **`--list-device` and `--list-ep` suppress the full report.** When either flag is
   present, only the requested section is printed. Omit both flags to see the
   complete system report.
diff --git a/docs/concepts/compile-and-epcontext.md b/docs/concepts/compile-and-epcontext.md
index f5c37521c..da2288f33 100644
--- a/docs/concepts/compile-and-epcontext.md
+++ b/docs/concepts/compile-and-epcontext.md
@@ -10,7 +10,7 @@ For EPs that are fully integrated into ONNX Runtime — CPU, DirectML, and simil
 
 For QNN-family EPs (the `--ep qnn` and `--ep vitisai` targets used for NPU inference), the compiler goes further. QNN takes the ONNX graph and produces a binary artifact — the **EP context blob** — that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the QNN EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
 
-The default compiler backend is `ort` (ONNX Runtime). If you have a QAIRT SDK installed you can select `--compiler qairt` and point `--qnn-sdk-root` at the SDK root for direct QAIRT compilation instead.
+The default compiler backend is `ort` (ONNX Runtime).
 
 ## Embedded vs external EPContext
 
diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index 3f96ced93..f507e179a 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -28,7 +28,7 @@ When end-to-end latency is higher than expected, per-operator tracing lets you f
 
 `--op-tracing detail` goes further, collecting timing for every individual operator node in the graph. This is useful when the same operator type appears in different parts of the model with very different costs — for instance, early-layer convolutions versus late-layer convolutions in a ResNet-style architecture.
 
-Both levels require an `onnxruntime-qnn` build with profiling support. If the requirement is not met, `winml-cli` will tell you at startup rather than silently running without tracing.
+If tracing is unavailable, `winml-cli` will tell you at startup rather than silently running without tracing.
 
 ## Per-module benchmarking
 
diff --git a/docs/getting-started/end-to-end.md b/docs/getting-started/end-to-end.md
index 8f6d27dc2..c60de322b 100644
--- a/docs/getting-started/end-to-end.md
+++ b/docs/getting-started/end-to-end.md
@@ -23,10 +23,8 @@ reading from that device.
     To target an NPU you also need:
 
     - A device with an NPU (e.g., Qualcomm Snapdragon X, Intel Core Ultra)
-    - `--extra qnn` installed (for Qualcomm NPU, Python 3.11+)
-    - Only if using `--compiler qairt`: QAIRT SDK installed with `QNN_SDK_ROOT` env var set. The default `--compiler ort` does not require any external SDK.
 
-    Everything else on this page works without these.
+    Everything else on this page works without it.
 
 ## Step 0: See what your machine has
 
@@ -110,11 +108,6 @@ deeper look at how each stage works, see
 [Concepts → How winml-cli works](../concepts/how-it-works.md) and
 [Config and Build](../concepts/config-and-build.md).
 
-!!! warning "NPU users"
-    `winml build` reads `QNN_SDK_ROOT` from the environment. Make sure it
-    points at your QAIRT SDK before this step, or the compile stage will fail
-    with *"QAIRT SDK path not found"*.
-
 ## Step 3: Benchmark on your device
 
 ```bash
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index c5039a488..c83036b4a 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -65,7 +65,7 @@ Available Execution Providers
   CPUExecutionProvider           -> CPU
 ```
 
-This command enumerates available compute devices and execution providers on your machine. If an expected device or SDK is missing, `winml sys` is the right place to diagnose it. See [winml sys](../commands/sys.md) for the full flag reference and troubleshooting tips.
+This command enumerates available compute devices and execution providers on your machine. If an expected device or execution provider is missing, `winml sys` is the right place to diagnose it. See [winml sys](../commands/sys.md) for the full flag reference and troubleshooting tips.
 
 ## Next steps
 
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
index 22aefa807..fd52ae6f7 100644
--- a/docs/getting-started/quickstart.md
+++ b/docs/getting-started/quickstart.md
@@ -15,7 +15,7 @@ uv run winml sys --list-device --list-ep
 ```
 
 `--list-device` and `--list-ep` print only the hardware and EP inventory,
-skipping SDK versions and Python environment details that plain `winml sys`
+skipping runtime-version and Python environment details that plain `winml sys`
 would include. If the command exits without error, your winml-cli install is
 ready. See [`winml sys`](../commands/sys.md) for the full flag reference.
 
diff --git a/docs/reference/index.md b/docs/reference/index.md
index 5c3dd8c71..9c98c2751 100644
--- a/docs/reference/index.md
+++ b/docs/reference/index.md
@@ -116,7 +116,7 @@ Set to `null` to skip compilation.
 | `ep_config.embed_context` | `bool` | `false` | Embed binary in ONNX (true) or external .bin (false). |
 | `ep_config.compiler` | `str` | `"ort"` | Compiler backend: `ort` or `qairt`. |
 | `ep_config.provider_options` | `dict` | `{}` | EP-specific options. |
-| `ep_config.qnn_sdk_root` | `str \| null` | `null` | QAIRT SDK path (required for `compiler: "qairt"`). |
+| `ep_config.qnn_sdk_root` | `str \| null` | `null` | QNN SDK path for QAIRT compiler backend. |
 | `validate` | `bool` | `true` | Validate compiled model. |
 | `verbose` | `bool` | `false` | Verbose compilation logging. |
 
diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index 9054bb97f..95215cc32 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -128,8 +128,8 @@ Each validated model is tested against available EPs:
 | NvTensorRTRTXExecutionProvider | `nvtensorrtrtx`, `nv_tensorrt_rtx` | GPU | NVIDIA TensorRT-RTX; NVIDIA GPU with TensorRT runtime |
 | CUDAExecutionProvider | `cuda` | GPU | NVIDIA CUDA; any CUDA-capable GPU |
 | MIGraphXExecutionProvider | `migraphx` | GPU | AMD ROCm MIGraphX |
-| QNNExecutionProvider | `qnn` | NPU, GPU | Qualcomm Snapdragon; bundled in ORT (`--compiler qairt` needs QNN SDK) |
-| OpenVINOExecutionProvider | `openvino` | NPU, GPU, CPU | Intel hardware; install with `--extra openvino` |
+| QNNExecutionProvider | `qnn` | NPU, GPU | Qualcomm Snapdragon; bundled in ORT |
+| OpenVINOExecutionProvider | `openvino` | NPU, GPU, CPU | Intel hardware |
 | DmlExecutionProvider | `dml` | GPU | DirectML; any DirectX 12 GPU |
 | CPUExecutionProvider | `cpu` | CPU | Always available |
 | VitisAIExecutionProvider | `vitisai` | NPU | AMD/Xilinx |
diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
index d556de464..f2eea1353 100644
--- a/docs/samples/convnext-primitives.md
+++ b/docs/samples/convnext-primitives.md
@@ -13,7 +13,6 @@ This walkthrough drives the full pipeline using the primitive commands directly:
 
 - winml-cli installed and `winml` available on your PATH — see [Installation](../getting-started/installation.md).
 - Internet access so HuggingFace Hub can download the model weights on first run.
-- Optional: QNN SDK installed on a Snapdragon Copilot+ PC for the NPU section.
 
 ## Step 1: Inspect the model
 
@@ -84,7 +83,7 @@ Saved: convnext_int8.onnx
 
 ## Step 5: Compile for each EP
 
-Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the runtime can skip per-session JIT compilation. Two compiler backends are available: `ort` (default, uses ONNX Runtime's built-in compiler) and `qairt` (uses the QAIRT SDK's offline compiler directly). Pass `--compiler qairt` if you have the QAIRT SDK and want direct QNN compilation.
+Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the runtime can skip per-session JIT compilation. The examples below use the default `ort` compiler backend, which uses ONNX Runtime's built-in compiler.
 
 === "CPU"
 
@@ -104,15 +103,8 @@ Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the run
     winml compile -m convnext_int8.onnx --output-dir . --device npu
     ```
 
-=== "NPU (QAIRT SDK)"
-
-    ```bash
-    # Requires QNN_SDK_ROOT env var or --qnn-sdk-root
-    winml compile -m convnext_int8.onnx --output-dir . --device npu --compiler qairt --qnn-sdk-root <path-to-qnn-sdk>
-    ```
-
-!!! note "NPU compiler backends"
-    The default `--compiler ort` backend uses ONNX Runtime's built-in QNN compilation — no separate SDK installation is needed. The `--compiler qairt` backend calls the QAIRT SDK's offline compiler directly and requires either `--qnn-sdk-root` or the `QNN_SDK_ROOT` environment variable. Both produce the same EPContext output format. For a full explanation of how EPs relate to device targets see [ONNX & Execution Providers](../concepts/eps-and-devices.md).
+!!! note "NPU compiler backend"
+    The default `--compiler ort` backend uses ONNX Runtime's built-in compilation. For a full explanation of how EPs relate to device targets see [ONNX & Execution Providers](../concepts/eps-and-devices.md).
 
 Only the NPU invocation writes a new compiled artifact — `convnext_int8_npu_ctx.onnx` — which contains an EPContext node embedding the pre-compiled binary. CPU and GPU compile with `enable_ep_context=False` by default: the compile step validates the model against the target EP but does not produce a new file. For CPU and GPU perf benchmarks (Step 6), use the quantized `convnext_int8.onnx` directly.
 
@@ -180,4 +172,4 @@ To compare quantized accuracy against the floating-point baseline, run the same
 - [BERT — Config + Build + Perf](bert-config-build.md) — the same pipeline driven through `winml build` with a config file
 - [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview and stage descriptions
 - [Quantization & QDQ](../concepts/quantization.md) — calibration methods and accuracy trade-offs
-- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — EP selection, device flags, and QNN SDK setup
+- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — EP selection and device flags
diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index fe6898e3c..0fcfa55f9 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -14,8 +14,6 @@ The tutorial is split into two sections. Section A walks through the analyze →
 - **Python 3.11** and **uv** installed (`pip install uv` or follow [astral.sh/uv](https://astral.sh/uv))
 - **winml-cli** installed — see [Installation](../getting-started/installation.md)
 - **An ONNX model file** — this tutorial uses `my_model.onnx` as a placeholder; substitute your own file
-- **For QNN (Snapdragon NPU):** QAIRT SDK installed and `QNN_SDK_ROOT` set to its root directory
-- **For OpenVINO (Intel CPU/GPU/NPU):** OpenVINO runtime installed and registered as an ONNX Runtime EP
 
 > No NPU? Set `--device cpu` wherever you see `--device npu`. Every other flag stays the same.
 
@@ -151,7 +149,7 @@ If your target is NPU deployment, continue the pipeline with quantization and co
 # Quantize (INT8, QDQ format)
 uv run winml quantize -m my_model_optimized.onnx -o my_model_int8.onnx --precision int8 --samples 32
 
-# Compile for NPU (default --compiler ort; use --compiler qairt for QAIRT SDK)
+# Compile for NPU
 uv run winml compile -m my_model_int8.onnx --device npu
 ```
 
@@ -229,7 +227,7 @@ By default when auto-generating config (no `-c` flag):
 Override flags:
 
 - `--no-quant` — force skip quantization (even on NPU)
-- `--compile` — force enable compilation (requires EP SDK)
+- `--compile` — force enable compilation
 - `--no-compile` — force skip compilation (default when no config file)
 
 ```bash
@@ -284,7 +282,7 @@ print(f"Final model: {result.final_onnx_path}")
 | Analyzer reports unsupported ops | Check if an optimization fusion resolves them; if not, the model needs modification for that EP |
 | Optimization loop doesn't converge | The default max is 3 iterations; if patterns persist, they may not be fusible — use `--no-quant --no-compile` and inspect |
 | Quantization accuracy regression | Try `--precision int16`, `--per-channel`, or increase `--samples` for better calibration |
-| EP compilation fails | Ensure the target EP SDK is installed (`QNN_SDK_ROOT` for QNN, OpenVINO runtime for Intel) |
+| EP compilation fails | Check the selected EP, model compatibility, and target device availability |
 | Model too large for memory | Use `--no-compile` and compile on the target device |
 
 ---
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index 91b71ae17..0e4d46a1f 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -21,8 +21,6 @@ The tutorial is split into two sections. Section A runs through eight primitive
 - **Copilot+PC with NPU** — 40+ TOPS recommended; CPU and DirectML work as fallback throughout
 - **Python 3.11** and **uv** installed (`pip install uv` or follow [astral.sh/uv](https://astral.sh/uv))
 - **winml-cli** installed — see [Installation](../getting-started/installation.md)
-- **For QNN (Snapdragon NPU):** QAIRT SDK installed and `QNN_SDK_ROOT` set to its root directory
-- **For OpenVINO (Intel CPU/GPU/NPU):** OpenVINO runtime installed and registered as an ONNX Runtime EP
 
 > No NPU? Set `--device cpu` wherever you see `--device npu` and drop `--monitor` from perf commands. Every other flag stays the same.
 
@@ -140,12 +138,11 @@ The quantizer generates 32 random calibration samples, runs them through the mod
 
 ### Step 7: Compile for the target EP
 
-Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time. This is the step that produces a device-locked artifact — the output is tied to the specific EP and, for QNN, to the QNN SDK version.
+Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time. This is the step that produces a device-locked artifact tied to the selected EP.
 
-Two compiler backends are available:
+The examples below use the default compiler backend:
 
-- **`--compiler ort`** (default) — uses ONNX Runtime's built-in EP context compiler. Works for QNN and OpenVINO targets without needing the vendor SDK on the build machine (the ORT package bundles the necessary libraries).
-- **`--compiler qairt`** — uses the QAIRT SDK's offline compiler directly. Requires `QNN_SDK_ROOT` to point at a local QAIRT SDK installation. Produces the same EPContext output but goes through Qualcomm's native toolchain.
+- **`--compiler ort`** (default) — uses ONNX Runtime's built-in EP context compiler.
 
 === "QNN via ORT (default)"
 
@@ -153,13 +150,6 @@ Two compiler backends are available:
     uv run winml compile -m convnext_int8.onnx --device npu
     ```
 
-=== "QNN via QAIRT SDK"
-
-    ```bash
-    # Requires QNN_SDK_ROOT env var set to your QAIRT SDK root
-    uv run winml compile -m convnext_int8.onnx --device npu --compiler qairt
-    ```
-
 === "OpenVINO (Intel CPU/GPU/NPU)"
 
     ```bash
@@ -169,7 +159,7 @@ Two compiler backends are available:
 The compiled output file appears in the same directory as the input model. The file name follows the pattern `convnext_int8_npu_ctx.onnx` (using the resolved device string `npu`, not the EP name) and an accompanying `.bin` context binary is written alongside it (unless `--embed` is passed, which embeds the binary inside the ONNX file). CPU builds do not produce a new artifact — the compile step validates EP compatibility but writes no output file; use `convnext_int8.onnx` directly for CPU inference.
 
 !!! note "What we just did"
-    Compilation embeds EP context — the compiled binary — inside or alongside the ONNX file using the `EPContext` node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15–60 second JIT penalty on first load. The default `--compiler ort` backend bundles compilation within ONNX Runtime itself. The `--compiler qairt` backend calls the QAIRT SDK directly and requires `QNN_SDK_ROOT` (set as an environment variable, or passed with `--qnn-sdk-root` on `winml compile`). `winml build` reads only the env var. See [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) for the full picture of what gets embedded and how the context is consumed at runtime.
+    Compilation embeds EP context — the compiled binary — inside or alongside the ONNX file using the `EPContext` node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph, eliminating the 15–60 second JIT penalty on first load. The default `--compiler ort` backend bundles compilation within ONNX Runtime itself. See [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) for the full picture of what gets embedded and how the context is consumed at runtime.
 
 ---
 
@@ -259,7 +249,7 @@ uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o conv
 ```
 
 !!! note "What we just did"
-    `winml build` is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary. The config file you pass with `-c` fully specifies the device target, precision, and EP — so you get an NPU-targeted INT8 compiled model without needing to repeat those flags on every primitive command. The QNN SDK path is read from the `QNN_SDK_ROOT` environment variable, not from the config or CLI flags.
+    `winml build` is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary. The config file you pass with `-c` fully specifies the device target, precision, and EP — so you get an NPU-targeted INT8 compiled model without needing to repeat those flags on every primitive command.
 
 Once the build completes, benchmark the final artifact from `convnext_out/`:
 

From ed7e9abd5a5ecd4b387f8e4513c59502b6db27bd Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 15:26:01 +0800
Subject: [PATCH 097/143] docs: replace Unicode box-drawing chars with ASCII
 for consistent rendering

---
 docs/commands/catalog.md             | 14 +++++++-------
 docs/commands/inspect.md             | 14 +++++++-------
 docs/commands/sys.md                 |  6 +++---
 docs/getting-started/installation.md |  6 +++---
 docs/getting-started/quickstart.md   | 14 +++++++-------
 docs/samples/convnext-primitives.md  | 14 +++++++-------
 6 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/docs/commands/catalog.md b/docs/commands/catalog.md
index 941cf85fd..fd9b00468 100644
--- a/docs/commands/catalog.md
+++ b/docs/commands/catalog.md
@@ -46,13 +46,13 @@ $ winml catalog
 ```
 
 ```text
-╭─── winml-cli Catalog  |  12 validated model(s) ───────────────────────────╮
-│  Model                             Task                    Model Type     │
-│ ├ microsoft/resnet-50              image-classification    resnet         │
-│ ├ bert-base-uncased                fill-mask               bert           │
-│ ├ ProsusAI/finbert                 text-classification     bert           │
-│ └ ...                                                                     │
-╰────────────────────────────────────────────────────────────────────────────╯
++--- winml-cli Catalog  |  12 validated model(s) --------------------------+
+|  Model                             Task                    Model Type     |
+|  microsoft/resnet-50              image-classification    resnet          |
+|  bert-base-uncased                fill-mask               bert            |
+|  ProsusAI/finbert                 text-classification     bert            |
+|  ...                                                                      |
++---------------------------------------------------------------------------+
 Use  --ep  or  --device  to filter by execution provider or target device.
 ```
 
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
index 863c78947..4cdc5c9f0 100644
--- a/docs/commands/inspect.md
+++ b/docs/commands/inspect.md
@@ -51,13 +51,13 @@ $ winml inspect -m microsoft/resnet-50
 ```
 
 ```text
-╭─────────────────────────── microsoft/resnet-50 ───────────────────────────╮
-│ Task          image-classification                                         │
-│ Model Class   ResNetForImageClassification                                 │
-│ Exporter      OptimumExporter                                              │
-│ WinML Class   WinMLImageClassificationModel                                │
-│ Status        Supported                                                    │
-╰────────────────────────────────────────────────────────────────────────────╯
++--------------------------- microsoft/resnet-50 ---------------------------+
+| Task          image-classification                                         |
+| Model Class   ResNetForImageClassification                                 |
+| Exporter      OptimumExporter                                              |
+| WinML Class   WinMLImageClassificationModel                                |
+| Status        Supported                                                    |
++---------------------------------------------------------------------------+
 ```
 
 ```bash
diff --git a/docs/commands/sys.md b/docs/commands/sys.md
index 2c73177d4..167692846 100644
--- a/docs/commands/sys.md
+++ b/docs/commands/sys.md
@@ -46,9 +46,9 @@ $ winml sys
 ```
 
 ```text
-╭──────────────────────────────────╮
-│   winml-cli System Information    │
-╰──────────────────────────────────╯
++------------------------------------+
+|   winml-cli System Information     |
++------------------------------------+
 
 Environment
   Python Version    3.11.9
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index c83036b4a..7d96b875d 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -40,9 +40,9 @@ winml sys
 Expected output (abbreviated):
 
 ```text
-╭──────────────────────────────────╮
-│   winml-cli System Information    │
-╰──────────────────────────────────╯
++------------------------------------+
+|   winml-cli System Information     |
++------------------------------------+
 
 Environment
   Python Version    3.11.x
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
index fd52ae6f7..06bc5a16b 100644
--- a/docs/getting-started/quickstart.md
+++ b/docs/getting-started/quickstart.md
@@ -28,13 +28,13 @@ uv run winml inspect -m microsoft/resnet-50
 ```
 
 ```text
-╭─────────────────────────── microsoft/resnet-50 ───────────────────────────╮
-│ Task          image-classification                                         │
-│ Model Class   ResNetForImageClassification                                 │
-│ Exporter      OptimumExporter                                              │
-│ WinML Class   WinMLImageClassificationModel                                │
-│ Status        Supported                                                    │
-╰────────────────────────────────────────────────────────────────────────────╯
++--------------------------- microsoft/resnet-50 ---------------------------+
+| Task          image-classification                                         |
+| Model Class   ResNetForImageClassification                                 |
+| Exporter      OptimumExporter                                              |
+| WinML Class   WinMLImageClassificationModel                                |
+| Status        Supported                                                    |
++---------------------------------------------------------------------------+
 ```
 
 !!! note "What just happened"
diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
index f2eea1353..5093dfaa9 100644
--- a/docs/samples/convnext-primitives.md
+++ b/docs/samples/convnext-primitives.md
@@ -23,13 +23,13 @@ winml inspect -m facebook/convnext-tiny-224
 ```
 
 ```text
-╭─────────────────────────── facebook/convnext-tiny-224 ────────────────────────╮
-│ Task          image-classification                                              │
-│ Model Class   ConvNextForImageClassification                                   │
-│ Exporter      OptimumExporter                                                  │
-│ WinML Class   WinMLImageClassificationModel                                    │
-│ Status        Supported                                                        │
-╰────────────────────────────────────────────────────────────────────────────────╯
++------------------------- facebook/convnext-tiny-224 --------------------------+
+| Task          image-classification                                             |
+| Model Class   ConvNextForImageClassification                                   |
+| Exporter      OptimumExporter                                                  |
+| WinML Class   WinMLImageClassificationModel                                    |
+| Status        Supported                                                        |
++-------------------------------------------------------------------------------+
 ```
 
 !!! note "What we just did"

From bf01055367e462bb1564dc6bab9d1beaf3951e75 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 16:25:06 +0800
Subject: [PATCH 098/143] docs: fix incorrect claim about Coding Agent
 auto-reading skills/

Coding Agent reads .github/copilot-instructions.md, not skills/ directly.
---
 docs/getting-started/agent-skill.md | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/docs/getting-started/agent-skill.md b/docs/getting-started/agent-skill.md
index 518a77c24..6b0e73e3a 100644
--- a/docs/getting-started/agent-skill.md
+++ b/docs/getting-started/agent-skill.md
@@ -26,11 +26,10 @@ The skill teaches the agent:
 
 ### With GitHub Copilot Coding Agent
 
-The [Copilot Coding Agent](https://docs.github.com/en/copilot/how-tos/copilot-on-github/use-copilot-agents/overview)
-(the cloud agent that creates PRs) automatically reads `skills/use-winml-cli/SKILL.md`
-when working on this repository. No setup needed — assign an issue or ask
-Copilot to build/optimize a model and it will follow the skill's guidance to
-run the correct `winml` commands.
+To make the [Copilot Coding Agent](https://docs.github.com/en/copilot/how-tos/copilot-on-github/use-copilot-agents/overview)
+(the cloud agent that creates PRs) follow the skill's guidance, reference it in
+`.github/copilot-instructions.md`. The Coding Agent reads that file automatically
+when working on this repository.
 
 ### With other AI agents
 

From 8ba1a468eb5cd994acb01866d78c6675af97cc5d Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 16:25:53 +0800
Subject: [PATCH 099/143] docs: remove redundant 'Key principles' section
 (already in table above)

---
 docs/getting-started/agent-skill.md | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/docs/getting-started/agent-skill.md b/docs/getting-started/agent-skill.md
index 6b0e73e3a..89cbdc945 100644
--- a/docs/getting-started/agent-skill.md
+++ b/docs/getting-started/agent-skill.md
@@ -59,25 +59,6 @@ winml-cli/
 
 ---
 
-## Key principles encoded in the skill
-
-1. **Inspect first** — always run `winml inspect` before building to catch
-   unsupported architectures early.
-
-2. **Don't fabricate flags** — if a flag isn't in `--help`, it doesn't exist.
-   The skill enforces this as a hard rule.
-
-3. **Published outputs only** — each command has an explicit `-o` output; never
-   fish artifacts from internal cache.
-
-4. **EP-compiled models are EP-bound** — don't benchmark a QNN-compiled model on
-   the CPU EP. Use the pre-compile optimized ONNX for cross-EP comparison.
-
-5. **Scope gate** — the agent will refuse to attempt generative/decoder-only
-   models (GPT, LLaMA, Phi, Stable Diffusion) and explain they're out of scope.
-
----
-
 ## Example agent interaction
 
 ```

From 48081c7638cee635ed35c8e414e81f73ed2c8def Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 16:28:01 +0800
Subject: [PATCH 100/143] docs: remove unnecessary 'Updating the skill' section

---
 docs/getting-started/agent-skill.md | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/docs/getting-started/agent-skill.md b/docs/getting-started/agent-skill.md
index 89cbdc945..6d09c222d 100644
--- a/docs/getting-started/agent-skill.md
+++ b/docs/getting-started/agent-skill.md
@@ -72,11 +72,3 @@ Agent (with skill):
 5. Runs `winml perf -m output/model.onnx -d npu --monitor`
 6. Reports latency + NPU utilization to user
 ```
-
----
-
-## Updating the skill
-
-The skill lives at `skills/use-winml-cli/SKILL.md` in the repository root.
-When commands or flags change, update both the docs site and the skill file to
-keep agent behavior aligned with the CLI.

From b884b25048a1d27330ac4852adcbb5e05f8b7990 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:07:25 +0800
Subject: [PATCH 101/143] docs: add eval to config tree, link to Config Schema
 reference page

---
 docs/concepts/how-it-works.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index f18fa40ce..3b5ee13e6 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -110,7 +110,8 @@ WinMLBuildConfig
 ├── export    — input tensor specs, opset, backend
 ├── optim     — fusion flags, optimization level
 ├── quant     — precision, calibration settings (null = skip stage)
-└── compile   — target EP, device (null = skip stage)
+├── compile   — target EP, device (null = skip stage)
+└── eval      — evaluation settings
 ```
 
 Setting `quant` or `compile` to `null` in the JSON file is equivalent to passing
@@ -123,6 +124,8 @@ completes, capturing any autoconf-adjusted fusion flags so the build is reproduc
 This persisted `winml_build_config.json` is a self-contained pipeline specification that
 you can check into version control and run in CI/CD (`winml build -c winml_build_config.json -m <model> -o output/`) for repeatable, unattended builds across environments.
 
+For the full field-by-field schema, see [Reference — Config Schema](../reference/index.md).
+
 ## See Also
 
 - [winml build](../commands/build.md) — full reference for the build command

From 77fa880676b0ef154014e3b914447225313cc70d Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:09:40 +0800
Subject: [PATCH 102/143] docs: clarify -c config.json is optional for winml
 build

---
 docs/concepts/how-it-works.md | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/docs/concepts/how-it-works.md b/docs/concepts/how-it-works.md
index 3b5ee13e6..7a1d10b75 100644
--- a/docs/concepts/how-it-works.md
+++ b/docs/concepts/how-it-works.md
@@ -81,6 +81,15 @@ normal workflow is `winml build`, which orchestrates the full pipeline in a sing
 command:
 
 ```bash
+winml build -m microsoft/resnet-50 -o output/
+```
+
+The `-c config.json` flag is optional. If omitted, `winml build` auto-generates a
+default config internally. To customize pipeline settings, generate a config first
+with `winml config` and then pass it:
+
+```bash
+winml config -m microsoft/resnet-50 -o config.json
 winml build -c config.json -m microsoft/resnet-50 -o output/
 ```
 
@@ -93,10 +102,10 @@ Individual stages can be bypassed from the command line without editing the conf
 
 ```bash
 # Skip quantization and compilation
-winml build -c config.json -m bert-base-uncased -o output/ --no-quant --no-compile
+winml build -m bert-base-uncased -o output/ --no-quant --no-compile
 
 # Skip optimization (for pre-quantized input)
-winml build -c config.json -m model_qdq.onnx -o output/ --no-optimize
+winml build -m model_qdq.onnx -o output/ --no-optimize
 ```
 
 ## Configuration: `WinMLBuildConfig` vs CLI Flags

From d018281f0a1533b3090a5867b2b199dba2766476 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:12:05 +0800
Subject: [PATCH 103/143] docs: add cross-references to Reference pages (Config
 Schema, Output Layout, Supported Models)

---
 docs/commands/build.md             | 2 ++
 docs/commands/config.md            | 1 +
 docs/commands/export.md            | 6 ++++++
 docs/commands/inspect.md           | 1 +
 docs/concepts/config-and-build.md  | 1 +
 docs/getting-started/end-to-end.md | 6 ++++++
 docs/samples/clip-composite.md     | 1 +
 7 files changed, 18 insertions(+)

diff --git a/docs/commands/build.md b/docs/commands/build.md
index a825ed4a6..89039bc5e 100644
--- a/docs/commands/build.md
+++ b/docs/commands/build.md
@@ -111,3 +111,5 @@ winml build -c config.json -m microsoft/resnet-50 \
 - [winml compile](compile.md)
 - [Config and build](../concepts/config-and-build.md)
 - [How it works](../concepts/how-it-works.md)
+- [Config Schema](../reference/index.md) — full field-by-field config reference
+- [Output Layout](../reference/output-layout.md) — what each output file contains
diff --git a/docs/commands/config.md b/docs/commands/config.md
index d3eccb41a..2c3cadf73 100644
--- a/docs/commands/config.md
+++ b/docs/commands/config.md
@@ -92,6 +92,7 @@ $ winml config -m facebook/convnext-tiny-224.onnx --no-quant -o convnext_optim_o
 ## See also
 
 - [Config and build](../concepts/config-and-build.md) — structure of `WinMLBuildConfig` and how stages interact
+- [Config Schema](../reference/index.md) — full field-by-field config reference
 - [build.md](build.md) — run the full pipeline using a generated config
 - [export.md](export.md) — export a HuggingFace model to ONNX as a standalone step
 - [optimize.md](optimize.md) — apply graph optimizations to an existing ONNX file
diff --git a/docs/commands/export.md b/docs/commands/export.md
index 23014a472..2a16473df 100644
--- a/docs/commands/export.md
+++ b/docs/commands/export.md
@@ -81,6 +81,12 @@ winml export -m bert-base-uncased -o bert.onnx --input-specs inputs.json
 winml export -m microsoft/resnet-50 -o resnet50_clean.onnx --clean-onnx
 ```
 
+## See also
+
+- [Output Layout](../reference/output-layout.md) — what each output file contains
+- [winml optimize](optimize.md) — the next pipeline stage after export
+- [Load and export concept](../concepts/load-and-export.md) — details on the export process
+
 ## Common pitfalls
 
 - **Task detection fails on unusual model IDs.** If auto-detection picks the
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
index 4cdc5c9f0..8f92de579 100644
--- a/docs/commands/inspect.md
+++ b/docs/commands/inspect.md
@@ -100,6 +100,7 @@ $ winml inspect -m facebook/convnext-tiny-224 -v -H
 
 - [catalog.md](catalog.md) — browse the curated catalog and check accuracy verdicts before
   inspecting
+- [Supported Models](../reference/supported-models.md) — full list of validated model architectures
 - [Load and export concept](../concepts/load-and-export.md) — how `winml.hierarchy.tag`
   metadata is written and what you can do with the module tree
 - [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview showing where
diff --git a/docs/concepts/config-and-build.md b/docs/concepts/config-and-build.md
index 8c9265d66..4e53eb3bb 100644
--- a/docs/concepts/config-and-build.md
+++ b/docs/concepts/config-and-build.md
@@ -157,5 +157,6 @@ benefits:
 
 - [Primitives and pipeline](primitives-and-pipeline.md) — when to use `winml build`
   vs individual primitive commands
+- [Config Schema](../reference/index.md) — full field-by-field config reference
 - [winml config command reference](../commands/config.md)
 - [winml build command reference](../commands/build.md)
diff --git a/docs/getting-started/end-to-end.md b/docs/getting-started/end-to-end.md
index c60de322b..2b1d36fa0 100644
--- a/docs/getting-started/end-to-end.md
+++ b/docs/getting-started/end-to-end.md
@@ -155,6 +155,12 @@ artifact path from the build output to get the exact name.
     Throughput: 80.45 samples/sec
     ```
 
+## See also
+
+- [Config Schema](../reference/index.md) — full field-by-field config reference
+- [Output Layout](../reference/output-layout.md) — what each output file contains
+- [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview
+
 === "CPU"
 
     ```text
diff --git a/docs/samples/clip-composite.md b/docs/samples/clip-composite.md
index f0d79015a..04b12e610 100644
--- a/docs/samples/clip-composite.md
+++ b/docs/samples/clip-composite.md
@@ -157,4 +157,5 @@ The same composite model pattern is used for:
 
 - [BERT — Config + Build + Perf](bert-config-build.md) — single-model workflow
 - [ConvNeXt — Primitive commands](convnext-primitives.md) — step-by-step pipeline
+- [Supported Models](../reference/supported-models.md) — full list of validated architectures
 - [Config and build](../concepts/config-and-build.md) — concept overview

From 5fd701f84e2365d148e2793272f352fb20cc21a9 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:14:11 +0800
Subject: [PATCH 104/143] docs: fix cross-references - remove Output Layout
 from export, add Supported Models to build/config/export

---
 docs/commands/build.md  | 1 +
 docs/commands/config.md | 1 +
 docs/commands/export.md | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/commands/build.md b/docs/commands/build.md
index 89039bc5e..4a5b79cc3 100644
--- a/docs/commands/build.md
+++ b/docs/commands/build.md
@@ -113,3 +113,4 @@ winml build -c config.json -m microsoft/resnet-50 \
 - [How it works](../concepts/how-it-works.md)
 - [Config Schema](../reference/index.md) — full field-by-field config reference
 - [Output Layout](../reference/output-layout.md) — what each output file contains
+- [Supported Models](../reference/supported-models.md) — validated model architectures
diff --git a/docs/commands/config.md b/docs/commands/config.md
index 2c3cadf73..0226b193a 100644
--- a/docs/commands/config.md
+++ b/docs/commands/config.md
@@ -93,6 +93,7 @@ $ winml config -m facebook/convnext-tiny-224.onnx --no-quant -o convnext_optim_o
 
 - [Config and build](../concepts/config-and-build.md) — structure of `WinMLBuildConfig` and how stages interact
 - [Config Schema](../reference/index.md) — full field-by-field config reference
+- [Supported Models](../reference/supported-models.md) — validated model architectures
 - [build.md](build.md) — run the full pipeline using a generated config
 - [export.md](export.md) — export a HuggingFace model to ONNX as a standalone step
 - [optimize.md](optimize.md) — apply graph optimizations to an existing ONNX file
diff --git a/docs/commands/export.md b/docs/commands/export.md
index 2a16473df..88e936ef8 100644
--- a/docs/commands/export.md
+++ b/docs/commands/export.md
@@ -83,8 +83,8 @@ winml export -m microsoft/resnet-50 -o resnet50_clean.onnx --clean-onnx
 
 ## See also
 
-- [Output Layout](../reference/output-layout.md) — what each output file contains
 - [winml optimize](optimize.md) — the next pipeline stage after export
+- [Supported Models](../reference/supported-models.md) — full list of validated architectures
 - [Load and export concept](../concepts/load-and-export.md) — details on the export process
 
 ## Common pitfalls

From 32351f875aba84c9d065c5f9fc2db3b688778984 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:28:47 +0800
Subject: [PATCH 105/143] docs: update CLI flags for --flag/--no-flag pairs
 (per PR #844)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Boolean is_flag options now show both --flag/--no-flag forms
- Negative-only flags (--no-quant) → --quant/--no-quant with default true
- --clean-onnx deprecated in favor of --no-hierarchy
- Updated flag tables, defaults, and descriptions across all command pages
---
 docs/commands/build.md              | 13 +++++++------
 docs/commands/compile.md            |  2 +-
 docs/commands/config.md             |  4 ++--
 docs/commands/eval.md               |  2 +-
 docs/commands/export.md             | 12 +++++++-----
 docs/commands/inspect.md            |  2 +-
 docs/commands/perf.md               |  8 ++++----
 docs/commands/quantize.md           |  4 ++--
 docs/concepts/load-and-export.md    |  2 +-
 docs/samples/convnext-primitives.md |  2 +-
 docs/tutorials/npu-convnext.md      |  2 +-
 11 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/docs/commands/build.md b/docs/commands/build.md
index 4a5b79cc3..39cd6a8c9 100644
--- a/docs/commands/build.md
+++ b/docs/commands/build.md
@@ -23,16 +23,17 @@ $ winml build [options]
 | `--config` | `-c` | path | `None` | `WinMLBuildConfig` JSON file, generated by `winml config`. If omitted, config is auto-generated from `-m`. |
 | `--model` | `-m` | string | `None` | Hugging Face model ID or path to an existing `.onnx` file. |
 | `--output-dir` | `-o` | path | `None` | Directory for all build artifacts. Mutually exclusive with `--use-cache`. |
-| `--use-cache` | | flag | `false` | Store artifacts in the winml-cli global cache (`~/.cache/winml/`). Mutually exclusive with `--output-dir`. |
-| `--rebuild` | | flag | `false` | Overwrite existing artifacts and re-run the full pipeline. |
-| `--no-quant` | | flag | `false` | Skip the quantization stage, overriding the config. |
+| `--use-cache/--no-use-cache` | | flag | `false` | Store artifacts in the winml-cli global cache (`~/.cache/winml/`). Mutually exclusive with `--output-dir`. |
+| `--rebuild/--no-rebuild` | | flag | `false` | Overwrite existing artifacts and re-run the full pipeline. |
+| `--quant/--no-quant` | | flag | `true` | Run the quantization stage (use `--no-quant` to skip), overriding the config. |
 | `--no-compile` / `--compile` | | flag | `None` | Override compilation. `--compile` forces enable (config must have a compile section). `--no-compile` forces skip. Default: inherit from config. |
-| `--no-optimize` | | flag | `false` | Skip the optimization stage (for pre-quantized ONNX input models). |
+| `--optimize/--no-optimize` | | flag | `true` | Run the optimization stage (use `--no-optimize` to skip). |
 | `--ep` | | string | `None` | Target execution provider for the analyzer (e.g., `qnn`). Falls back to the compile config EP if not set. |
 | `--device` | `-d` | string | `auto` | Target device for the analyzer (e.g., `npu`, `gpu`). Default: `auto` (auto-detect). |
-| `--no-analyze` | | flag | `false` | Skip the analyzer loop during build. |
+| `--analyze/--no-analyze` | | flag | `true` | Run the analyzer loop during build (use `--no-analyze` to skip). |
 | `--max-optim-iterations` | | integer | `None` | Maximum autoconf re-optimization rounds (3 enforced internally when not set). `--no-analyze` implicitly sets this to 0. |
-| `--trust-remote-code` | | flag | `false` | Allow executing custom code from model repositories. Use only with trusted sources. |
+| `--trust-remote-code/--no-trust-remote-code` | | flag | `false` | Allow executing custom code from model repositories. Use only with trusted sources. |
+| `--allow-unsupported-nodes/--no-allow-unsupported-nodes` | | flag | `false` | Allow unsupported nodes to remain in the graph instead of failing the build. |
 | `--help` | `-h` | flag | | Show this message and exit. |
 
 ## How it works
diff --git a/docs/commands/compile.md b/docs/commands/compile.md
index 6c3d7dc9d..ca69a5267 100644
--- a/docs/commands/compile.md
+++ b/docs/commands/compile.md
@@ -25,7 +25,7 @@ $ winml compile [options]
 | `--no-validate` | | flag | `false` | Skip validation of the compiled model after compilation. |
 | `--compiler` | | choice | `ort` | Compiler backend: `ort` (ONNX Runtime) or `qairt` (Qualcomm AI Runtime Tools). |
 | `--qnn-sdk-root` | | path | `None` | Path to the QNN SDK root directory. |
-| `--embed` | | flag | `false` | Embed the EP context blob inside the ONNX file instead of writing a separate `.bin` file. |
+| `--embed/--no-embed` | | flag | `false` | Embed the EP context blob inside the ONNX file instead of writing a separate `.bin` file. |
 | `--list` | | flag | `false` | List available compiler backends for the selected device and exit without compiling. |
 | `--help` | `-h` | flag | | Show this message and exit. |
 
diff --git a/docs/commands/config.md b/docs/commands/config.md
index 0226b193a..63c5bdf0e 100644
--- a/docs/commands/config.md
+++ b/docs/commands/config.md
@@ -28,9 +28,9 @@ $ winml config [options]
 | `--precision` | `-p` | `TEXT` | `auto` | Target precision: `auto`, `fp32`, `fp16`, `int8`, `int16`, or a mixed format such as `w8a16`. `auto` selects the precision based on the chosen device. |
 | `--output` | `-o` | `PATH` | *(stdout)* | Write the generated JSON to this file instead of printing to stdout. |
 | `--library` | | `TEXT` | `transformers` | Source library for `TasksManager` task lookup. Defaults to `transformers`; set to `diffusers` or another Optimum-supported library when needed. |
-| `--no-quant` | | flag | off | Omit quantization from the generated config (sets `quant` to `null`). Equivalent to removing the `quant` section before passing to `winml build`. |
+| `--quant/--no-quant` | | flag | `true` | Include quantization in the generated config (use `--no-quant` to omit it and set `quant` to `null`). |
 | `--no-compile` / `--compile` | | flag | `--no-compile` (compile excluded by default) | Controls whether compilation is included in the generated config. By default compilation is **excluded** (`compile: null`). Pass `--compile` to include a compile section. |
-| `--trust-remote-code` | | flag | off | Allow execution of custom model code from the HuggingFace repository. Required for some community models. Only enable for repositories you trust. |
+| `--trust-remote-code/--no-trust-remote-code` | | flag | `false` | Allow execution of custom model code from the HuggingFace repository. Required for some community models. Only enable for repositories you trust. |
 
 ## How it works
 
diff --git a/docs/commands/eval.md b/docs/commands/eval.md
index b27da4b0c..9615ad102 100644
--- a/docs/commands/eval.md
+++ b/docs/commands/eval.md
@@ -25,7 +25,7 @@ $ winml eval [options]
 | `--samples` | | `INTEGER` | `100` | Number of dataset samples to evaluate. |
 | `--split` | | `TEXT` | `validation` | Dataset split to use (e.g., `validation`, `test`, `train`). |
 | `--shuffle / --no-shuffle` | | flag | `shuffle` | Shuffle the dataset before sampling. Disable with `--no-shuffle` for reproducible sample ordering. |
-| `--streaming` | | flag | `false` | Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. |
+| `--streaming/--no-streaming` | | flag | `false` | Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. |
 | `--column` | | `TEXT` (multiple) | — | Column mapping as `key=value` pairs (e.g., `--column input_column=image`). Can be specified multiple times. |
 | `--label-mapping` | | `PATH` | — | Path to a JSON file mapping label names to integer IDs: `{"label_name": id}`. |
 | `--output` | `-o` | `PATH` | — | Output JSON file path for the evaluation results. |
diff --git a/docs/commands/export.md b/docs/commands/export.md
index 88e936ef8..a18bdebf1 100644
--- a/docs/commands/export.md
+++ b/docs/commands/export.md
@@ -20,14 +20,16 @@ $ winml export [options]
 |---|---|---|---|---|
 | `--model` | `-m` | string | *(required)* | Hugging Face model name or local path (e.g., `prajjwal1/bert-tiny`). |
 | `--output` | `-o` | path | *(required)* | Output ONNX file path (e.g., `model.onnx`). |
-| `--with-report` | | flag | `false` | Generate full export reports: Markdown, JSON, and a console tree. |
-| `--clean-onnx` / `--no-hierarchy` | | flag | `false` | Skip embedding `hierarchy_tag` metadata in ONNX nodes, producing a clean ONNX file. |
-| `--dynamo` | | flag | `false` | Enable PyTorch 2.9+ dynamo export for richer node metadata. (Experimental — currently logs a warning.) |
+| `--with-report/--no-with-report` | | flag | `false` | Generate full export reports: Markdown, JSON, and a console tree. |
+| `--hierarchy/--no-hierarchy` | | flag | `true` | Preserve `hierarchy_tag` metadata in ONNX nodes (use `--no-hierarchy` for a clean ONNX file). |
+| `--dynamo/--no-dynamo` | | flag | `false` | Enable PyTorch 2.9+ dynamo export for richer node metadata. (Experimental — currently logs a warning.) |
 | `--torch-module` | | string | `None` | Comma-separated list of `torch.nn` module types to include in hierarchy (e.g., `LayerNorm,Embedding`). (Experimental — currently logs a warning.) |
 | `--input-specs` | | path | `None` | JSON file with explicit input tensor specifications. Auto-generated when omitted. |
 | `--task` | `-t` | string | `None` | Override auto-detected Hugging Face task (e.g., `image-feature-extraction`). |
 | `--export-config` | | path | `None` | JSON file with ONNX export parameters such as `opset_version` and `do_constant_folding`. |
 | `--shape-config` | | path | `None` | JSON object mapping symbolic dimension names to concrete sizes (e.g., `{"sequence_length": 2048}`). Ignored when `--input-specs` is provided. |
+| `--trust-remote-code/--no-trust-remote-code` | | flag | `false` | Allow executing custom code from model repositories during export. Use only with trusted sources. |
+| `--allow-unsupported-nodes/--no-allow-unsupported-nodes` | | flag | `false` | Allow unsupported nodes to remain in the exported graph instead of failing export. |
 | `--help` | `-h` | flag | | Show this message and exit. |
 
 ## How it works
@@ -38,7 +40,7 @@ generation, module-hierarchy tracing, TorchScript ONNX export, node-tagger
 creation, per-node tagging, tag injection into ONNX `metadata_props`, and
 optional report generation. The hierarchy metadata allows downstream tools to
 reason about operators grouped by their originating module rather than flat
-graph position. When `--clean-onnx` is specified, hierarchy steps are bypassed
+graph position. When `--no-hierarchy` is specified, hierarchy steps are bypassed
 and a bare ONNX file is written, useful for third-party tools that do not
 understand custom metadata.
 
@@ -78,7 +80,7 @@ winml export -m bert-base-uncased -o bert.onnx --input-specs inputs.json
 
 ```bash
 # Produce clean ONNX without hierarchy metadata (for third-party optimizers)
-winml export -m microsoft/resnet-50 -o resnet50_clean.onnx --clean-onnx
+winml export -m microsoft/resnet-50 -o resnet50_clean.onnx --no-hierarchy
 ```
 
 ## See also
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
index 8f92de579..4877ee2a9 100644
--- a/docs/commands/inspect.md
+++ b/docs/commands/inspect.md
@@ -22,7 +22,7 @@ $ winml inspect -m <model_id> [options]
 | `--model` | `-m` | string | **required** | HuggingFace model ID (e.g. `openai/clip-vit-base-patch32`). Required unless `--list-tasks` or `--help` is used. |
 | `--format` | `-f` | `table` \| `json` | `table` | Output format. `table` renders rich panels; `json` emits a machine-readable object. |
 | `--task` | `-t` | string | `null` | Override the auto-detected task (e.g. `image-classification`, `feature-extraction`). |
-| `--hierarchy` | `-H` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
+| `--hierarchy/--no-hierarchy` | `-H/-N` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
 | `--verbose` | `-v` | flag | `false` | Show full configuration details. |
 | `--list-tasks` | | flag | `false` | List all known tasks and exit. Does not require `--model`. |
 | `--model-type` | | string | `null` | Override model type (e.g. `bert`, `resnet`). Can be used without `--model`. |
diff --git a/docs/commands/perf.md b/docs/commands/perf.md
index f88230aac..962488996 100644
--- a/docs/commands/perf.md
+++ b/docs/commands/perf.md
@@ -26,11 +26,11 @@ $ winml perf [options]
 | `--output` | `-o` | `PATH` | `~/.cache/winml/perf/<slug>/<timestamp>.json` | Output JSON file path for the benchmark report. |
 | `--batch-size` | | `INTEGER` | `1` | Batch size used when generating synthetic input tensors. |
 | `--shape-config` | | `PATH` | — | Path to a JSON file containing shape overrides (e.g., `{"height": 480, "width": 480}`). Ignored for pre-exported ONNX files and in `--module` mode. |
-| `--no-quantize` | | flag | `false` | Skip quantization during model build. Useful for measuring the fp32 baseline. |
-| `--rebuild` | | flag | `false` | Force model rebuild even if a cached artifact already exists. |
-| `--ignore-cache` | | flag | `false` | Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies `--rebuild`. |
+| `--quantize/--no-quantize` | | flag | `true` | Run quantization during model build (use `--no-quantize` to skip it). Useful for measuring the fp32 baseline. |
+| `--rebuild/--no-rebuild` | | flag | `false` | Force model rebuild even if a cached artifact already exists. |
+| `--ignore-cache/--no-ignore-cache` | | flag | `false` | Build from scratch in a temporary folder and discard the artifact after benchmarking. Implies `--rebuild`. |
 | `--module` | | `TEXT` | — | PyTorch module class name for per-module benchmarking (e.g., `BertAttention`). Builds and times each matching instance separately. See [Load and export](../concepts/load-and-export.md). |
-| `--monitor` | | flag | `false` | Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report. |
+| `--monitor/--no-monitor` | | flag | `false` | Show a live NPU/CPU utilization chart while the benchmark runs and include hardware metrics in the JSON report. |
 
 ## How it works
 
diff --git a/docs/commands/quantize.md b/docs/commands/quantize.md
index cfa9c5e1f..046723b0e 100644
--- a/docs/commands/quantize.md
+++ b/docs/commands/quantize.md
@@ -26,8 +26,8 @@ $ winml quantize [options]
 | `--method` | | choice | `minmax` | Calibration algorithm: `minmax`, `entropy`, or `percentile`. |
 | `--weight-type` | | choice | `None` | Per-tensor type for weights: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. |
 | `--activation-type` | | choice | `None` | Per-tensor type for activations: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. |
-| `--per-channel` | | flag | `false` | Apply per-channel (rather than per-tensor) quantization to weight tensors. |
-| `--symmetric` | | flag | `false` | Use symmetric quantization (zero-point fixed at 0). |
+| `--per-channel/--no-per-channel` | | flag | `false` | Apply per-channel (rather than per-tensor) quantization to weight tensors. |
+| `--symmetric/--no-symmetric` | | flag | `false` | Use symmetric quantization (zero-point fixed at 0). |
 | `--help` | `-h` | flag | | Show this message and exit. |
 
 ## How it works
diff --git a/docs/concepts/load-and-export.md b/docs/concepts/load-and-export.md
index cd28c3a79..fa72e5931 100644
--- a/docs/concepts/load-and-export.md
+++ b/docs/concepts/load-and-export.md
@@ -20,7 +20,7 @@ Some community models host custom Python code in their repositories. The loader
 
 By default the exporter runs an eight-step process that includes hierarchy tracing and tag injection. Every ONNX node carries a `winml.hierarchy.tag` metadata entry recording the PyTorch module path it came from (e.g. `/BertModel/BertEncoder/BertLayer.3/BertAttention`), plus a companion `winml.hierarchy.depth` integer. The model itself also carries `winml.io.inputs` and `winml.io.outputs` JSON metadata describing the I/O tensor specs. Together these power per-module benchmarking with `winml perf --module`, inspector views with `winml inspect --hierarchy`, and optimizer scoping.
 
-If you need a clean, standard-compliant ONNX without custom metadata — to hand off to a third-party tool, for example — pass `--no-hierarchy` (alias `--clean-onnx`). The graph behaviour is unchanged, but hierarchy-dependent features will not work against that file.
+If you need a clean, standard-compliant ONNX without custom metadata — to hand off to a third-party tool, for example — pass `--no-hierarchy`. (The old `--clean-onnx` spelling remains as a deprecated hidden alias.) The graph behaviour is unchanged, but hierarchy-dependent features will not work against that file.
 
 Use `--with-report` to generate companion markdown and JSON reports alongside the output.
 
diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
index 5093dfaa9..5cb0f2ea9 100644
--- a/docs/samples/convnext-primitives.md
+++ b/docs/samples/convnext-primitives.md
@@ -62,7 +62,7 @@ Success! Model exported to: convnext.onnx
 ```
 
 !!! note "Hierarchy metadata"
-    By default `winml export` embeds `hierarchy_tag` metadata in each ONNX node, recording which PyTorch module the node originated from. This lets downstream tools like `winml perf --module` and `winml analyze` reason about operator groups rather than flat graph positions. To skip the metadata and produce a clean ONNX file, add `--clean-onnx`. For more detail see [Load and export](../concepts/load-and-export.md).
+    By default `winml export` embeds `hierarchy_tag` metadata in each ONNX node, recording which PyTorch module the node originated from. This lets downstream tools like `winml perf --module` and `winml analyze` reason about operator groups rather than flat graph positions. To skip the metadata and produce a clean ONNX file, add `--no-hierarchy`. For more detail see [Load and export](../concepts/load-and-export.md).
 
 ## Step 4: Quantize
 
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index 0e4d46a1f..ebbc8b7e1 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -81,7 +81,7 @@ uv run winml export -m facebook/convnext-tiny-224 -o convnext.onnx
 This runs an eight-stage export pipeline: model preparation, input generation, hierarchy building, ONNX conversion, node tagging, tag injection, and metadata generation. The result is a standards-compliant ONNX file with winml-cli's Hierarchy-preserving Tags Protocol (HTP) metadata embedded in node `metadata_props`. That metadata is what lets downstream tools make architecture-aware optimization decisions without hardcoded model knowledge.
 
 !!! note "What we just did"
-    The default export embeds hierarchy tags — a tree of source module names mapped onto ONNX nodes — so that the optimizer and analyzer can reason about the graph in terms of the original model structure rather than flat node lists. If you need a clean ONNX without that metadata (for compatibility with other tools), add `--clean-onnx`. See [Concepts → Load and export](../concepts/load-and-export.md) for what hierarchy preservation adds and when it matters.
+    The default export embeds hierarchy tags — a tree of source module names mapped onto ONNX nodes — so that the optimizer and analyzer can reason about the graph in terms of the original model structure rather than flat node lists. If you need a clean ONNX without that metadata (for compatibility with other tools), add `--no-hierarchy`. See [Concepts → Load and export](../concepts/load-and-export.md) for what hierarchy preservation adds and when it matters.
 
 ---
 

From 5c2471149f02e61b1021c799b6bf57816519e9e0 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:35:33 +0800
Subject: [PATCH 106/143] docs: fix catalog flag short forms (-t is for --task,
 not --model-type)

---
 docs/commands/catalog.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/commands/catalog.md b/docs/commands/catalog.md
index fd9b00468..12cd64c0c 100644
--- a/docs/commands/catalog.md
+++ b/docs/commands/catalog.md
@@ -19,8 +19,8 @@ $ winml catalog [options]
 
 | Flag | Short | Type | Default | Description |
 |------|-------|------|---------|-------------|
-| `--model-type` | `-t` | string | `null` | Filter the catalog by model architecture (case-insensitive). Examples: `bert`, `roberta`, `vit`. |
-| `--task` | `-k` | string | `null` | Filter by HuggingFace task (case-insensitive). Examples: `text-classification`, `image-segmentation`. |
+| `--model-type` | | string | `null` | Filter the catalog by model architecture (case-insensitive). Examples: `bert`, `roberta`, `vit`. |
+| `--task` | `-t` | string | `null` | Filter by HuggingFace task (case-insensitive). Examples: `text-classification`, `image-segmentation`. |
 | `--ep` | | string | `null` | Filter by execution provider (e.g., `qnn`, `dml`). If not specified, shows all EPs. |
 | `--device` | | string | `null` | Filter by target device (e.g., `npu`, `gpu`). If not specified, shows all devices. |
 | `--output` | `-o` | path | `null` | Save the displayed results to a JSON file. |

From 61123688cafad487c5a85c4e9199713415b57369 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:42:10 +0800
Subject: [PATCH 107/143] =?UTF-8?q?docs:=20remove=20incorrect=20-k=20pitfa?=
 =?UTF-8?q?ll=20and=20fix=20-k=20=E2=86=92=20-t=20for=20--task?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/commands/catalog.md           | 3 ---
 docs/reference/supported-models.md | 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/docs/commands/catalog.md b/docs/commands/catalog.md
index 12cd64c0c..2717aaa2a 100644
--- a/docs/commands/catalog.md
+++ b/docs/commands/catalog.md
@@ -78,9 +78,6 @@ $ winml catalog --task image-classification --output results/image_catalog.json
 
 ## Common pitfalls
 
-- **`--task` short flag is `-k`, not `-t`.** The `-t` short flag is taken by
-  `--model-type`. Using `-t text-classification` will set the architecture filter,
-  not the task filter. Use `-k` or the full `--task` flag.
 - **The catalog reflects a point-in-time snapshot.** Models listed in the catalog
   were validated against a specific version of winml-cli, ONNX Runtime, and the
   relevant EP driver. Accuracy and latency may differ on your hardware or with
diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index 95215cc32..8b5cba023 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -12,7 +12,7 @@ lists what's validated and how to discover model support.
 uv run winml catalog
 
 # Filter by task
-uv run winml catalog -k image-classification
+uv run winml catalog -t image-classification
 
 # Check if a specific model is supported
 uv run winml inspect -m microsoft/resnet-50

From 141181c698ed3d21837d659cbde7f78638e2d07b Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 17:58:23 +0800
Subject: [PATCH 108/143] docs: add winml analyze to primitives-and-pipeline
 page

---
 docs/concepts/primitives-and-pipeline.md | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/docs/concepts/primitives-and-pipeline.md b/docs/concepts/primitives-and-pipeline.md
index a14ffbaad..ffff606e1 100644
--- a/docs/concepts/primitives-and-pipeline.md
+++ b/docs/concepts/primitives-and-pipeline.md
@@ -2,10 +2,11 @@
 
 winml-cli exposes two ways to turn a Hugging Face model or ONNX file into a
 Windows ML-ready artifact. You can invoke each stage of the pipeline as an
-individual primitive command — `winml export`, `winml optimize`, `winml quantize`,
-`winml compile`, `winml perf`, `winml eval` — running one step at a time with
-full control over inputs and outputs. Alternatively, `winml build` wraps all of
-those stages into a single command driven by a `WinMLBuildConfig` JSON file.
+individual primitive command — `winml export`, `winml analyze`, `winml optimize`,
+`winml quantize`, `winml compile`, `winml perf`, `winml eval` — running one step
+at a time with full control over inputs and outputs. Alternatively, `winml build`
+wraps all of those stages into a single command driven by a `WinMLBuildConfig`
+JSON file.
 
 Understanding when to reach for a primitive versus the pipeline wrapper is the
 central workflow decision in winml-cli. Both paths produce the same artifacts;
@@ -21,6 +22,9 @@ file that the next stage consumes:
 - **`winml export`** — loads a Hugging Face model, traces it with PyTorch and the
   Optimum exporter, and writes a portable float32 ONNX file with no EP-specific
   nodes.
+- **`winml analyze`** — runs compatibility and runtime checks on the exported ONNX
+  graph, detecting unsupported operators, QDQ issues, and device-specific
+  constraints before further pipeline stages.
 - **`winml optimize`** — applies graph transformations (operator fusion, constant
   folding, graph pruning) and runs an autoconf loop to maximize EP-compatible
   coverage.
@@ -56,8 +60,8 @@ achieves the same effect at runtime without editing the file.
 
 When the model argument points to an existing ONNX file instead of a Hugging Face
 ID, `winml build` detects this and skips the export stage, running
-optimize → quantize → compile directly. This mirrors how each primitive command
-handles the same case.
+analyze → optimize → quantize → compile directly. This mirrors how each primitive
+command handles the same case.
 
 `winml build` also accepts `--use-cache` in place of `-o`/`--output-dir`, routing
 artifacts to the winml-cli global cache at `~/.cache/winml/` instead of a local

From 018adee8ce0469b9899efc2dd41e4bd825a6caf8 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 18:05:42 +0800
Subject: [PATCH 109/143] docs: fix EP context description - not QNN-family,
 separate vendors

---
 docs/concepts/compile-and-epcontext.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/concepts/compile-and-epcontext.md b/docs/concepts/compile-and-epcontext.md
index da2288f33..f5dda50b5 100644
--- a/docs/concepts/compile-and-epcontext.md
+++ b/docs/concepts/compile-and-epcontext.md
@@ -8,7 +8,7 @@ Compilation is an offline, one-time step. The artifact it creates is what you sh
 
 For EPs that are fully integrated into ONNX Runtime — CPU, DirectML, and similar providers — the compile step writes a new `.onnx` file that the runtime loads directly. The ONNX graph has been prepared and, in some cases, partitioned so that the EP's session initializer has less work to do when the application starts.
 
-For QNN-family EPs (the `--ep qnn` and `--ep vitisai` targets used for NPU inference), the compiler goes further. QNN takes the ONNX graph and produces a binary artifact — the **EP context blob** — that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the QNN EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
+For EPs that support ahead-of-time compilation (e.g. `--ep qnn` for Qualcomm NPUs and `--ep vitisai` for AMD NPUs), the compiler goes further. It takes the ONNX graph and produces a binary artifact — the **EP context blob** — that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
 
 The default compiler backend is `ort` (ONNX Runtime).
 

From 2a4e34cab9be4de8484a1933935de4c58bd6c698 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 18:11:45 +0800
Subject: [PATCH 110/143] docs: clarify ort compiler backend uses ORT built-in
 EP context

---
 docs/concepts/compile-and-epcontext.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/concepts/compile-and-epcontext.md b/docs/concepts/compile-and-epcontext.md
index f5dda50b5..e8ad9523b 100644
--- a/docs/concepts/compile-and-epcontext.md
+++ b/docs/concepts/compile-and-epcontext.md
@@ -10,7 +10,7 @@ For EPs that are fully integrated into ONNX Runtime — CPU, DirectML, and simil
 
 For EPs that support ahead-of-time compilation (e.g. `--ep qnn` for Qualcomm NPUs and `--ep vitisai` for AMD NPUs), the compiler goes further. It takes the ONNX graph and produces a binary artifact — the **EP context blob** — that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
 
-The default compiler backend is `ort` (ONNX Runtime).
+The default compiler backend is `ort` — it uses ONNX Runtime's built-in EP context generation to produce the compiled blob, with no external SDK required.
 
 ## Embedded vs external EPContext
 

From c2346e3a7c7f282e38f348c1a7a0274dd8a2eb95 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 18:12:29 +0800
Subject: [PATCH 111/143] Revert "docs: clarify ort compiler backend uses ORT
 built-in EP context"

This reverts commit 2a4e34cab9be4de8484a1933935de4c58bd6c698.
---
 docs/concepts/compile-and-epcontext.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/concepts/compile-and-epcontext.md b/docs/concepts/compile-and-epcontext.md
index e8ad9523b..f5dda50b5 100644
--- a/docs/concepts/compile-and-epcontext.md
+++ b/docs/concepts/compile-and-epcontext.md
@@ -10,7 +10,7 @@ For EPs that are fully integrated into ONNX Runtime — CPU, DirectML, and simil
 
 For EPs that support ahead-of-time compilation (e.g. `--ep qnn` for Qualcomm NPUs and `--ep vitisai` for AMD NPUs), the compiler goes further. It takes the ONNX graph and produces a binary artifact — the **EP context blob** — that encodes the fully compiled, hardware-ready version of the network. This blob is then associated with the ONNX model file. On subsequent loads, the EP reads the blob rather than re-compiling the graph, which makes session creation dramatically faster.
 
-The default compiler backend is `ort` — it uses ONNX Runtime's built-in EP context generation to produce the compiled blob, with no external SDK required.
+The default compiler backend is `ort` (ONNX Runtime).
 
 ## Embedded vs external EPContext
 

From 7f781a039be6bdfe0a404171f462e1ad30c851df Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 18:54:48 +0800
Subject: [PATCH 112/143] docs: align tutorials, fix nav categorization,
 simplify EP tables

- Move analyze/optimize to Build group in nav and command map
- Align build-from-onnx Section A/B with npu-convnext structure
- Add tabbed compile/benchmark examples in build-from-onnx
- Simplify npu-convnext compile tabs (Qualcomm/Intel/AMD/CPU)
- Mark config as optional in build sections, remove hardcoded -c
- Change EPs Tested to All EPs in supported-models
- Clarify config schema is accepted by all pipeline commands
- Fix EP context description (not QNN-family, separate vendors)
---
 docs/commands/overview.md          |   4 +-
 docs/reference/index.md            |   9 +-
 docs/reference/supported-models.md |  40 +++----
 docs/tutorials/build-from-onnx.md  | 172 ++++++++++++-----------------
 docs/tutorials/npu-convnext.md     |  33 ++++--
 mkdocs.yml                         |   4 +-
 6 files changed, 126 insertions(+), 136 deletions(-)

diff --git a/docs/commands/overview.md b/docs/commands/overview.md
index e867e1622..473a5bb73 100644
--- a/docs/commands/overview.md
+++ b/docs/commands/overview.md
@@ -26,10 +26,10 @@ measure speed and accuracy.
 | [`sys`](sys.md) | Discover | Inspect your machine — devices, EPs, and runtime versions at a glance. |
 | [`inspect`](inspect.md) | Discover | Inspect a model's tasks, classes, and hierarchy before committing to an export. |
 | [`catalog`](catalog.md) | Discover | Browse the curated winml-cli catalog of validated models and benchmarks. |
-| [`analyze`](analyze.md) | Discover | Verify an ONNX model is compatible with a target execution provider before deployment. |
 | [`config`](config.md) | Configure | Generate a reusable build configuration for a Hugging Face model or ONNX file. |
-| [`optimize`](optimize.md) | Configure | Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed. |
 | [`export`](export.md) | Build | Convert a PyTorch / Hugging Face model to ONNX, preserving module hierarchy. |
+| [`analyze`](analyze.md) | Build | Verify an ONNX model is compatible with a target execution provider before deployment. |
+| [`optimize`](optimize.md) | Build | Apply graph optimizations and fusions to an ONNX model to reduce node count and improve inference speed. |
 | [`quantize`](quantize.md) | Build | Quantize an ONNX model with QDQ insertion and calibration-based scaling. |
 | [`compile`](compile.md) | Build | Compile an ONNX model to an EP-specific format for fast runtime loading. |
 | [`build`](build.md) | Build | Run the entire winml-cli pipeline (export → optimize → quantize → compile) in one command. |
diff --git a/docs/reference/index.md b/docs/reference/index.md
index 9c98c2751..3c57085b3 100644
--- a/docs/reference/index.md
+++ b/docs/reference/index.md
@@ -1,8 +1,13 @@
 # Reference — Config Schema
 
 This page documents the full schema for `WinMLBuildConfig`, the JSON configuration
-file that drives `winml build` and related commands. Generate a config with
-`winml config`, then customize it before feeding it to `winml build -c config.json`.
+file that drives the winml-cli pipeline. Generate a config with
+`winml config`, then pass it to any command with `-c config.json`.
+
+The config is accepted by **all pipeline commands** — not just `winml build`. For
+example, `winml export -c config.json`, `winml quantize -c config.json`, and
+`winml compile -c config.json` each read the relevant section of the same config
+file. This lets you use a single config as the source of truth across all stages.
 
 ## Top-Level Structure
 
diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index 8b5cba023..add514ea3 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -75,47 +75,47 @@ testing. Use `winml catalog` to browse the full list interactively.
 
 ### Image Classification
 
-| Model | Architecture | EPs Tested |
+| Model | Architecture | EPs |
 |-------|-------------|------------|
-| `microsoft/resnet-50` | ResNet | CPU, QNN (GPU/NPU), OpenVINO |
-| `facebook/convnext-tiny-224` | ConvNeXt | CPU, QNN (GPU/NPU), OpenVINO |
-| `google/vit-base-patch16-224` | ViT | CPU, QNN (GPU/NPU), OpenVINO |
+| `microsoft/resnet-50` | ResNet | All EPs |
+| `facebook/convnext-tiny-224` | ConvNeXt | All EPs |
+| `google/vit-base-patch16-224` | ViT | All EPs |
 
 ### Text Classification & NLU
 
-| Model | Architecture | EPs Tested |
+| Model | Architecture | EPs |
 |-------|-------------|------------|
-| `bert-base-uncased` | BERT | CPU, QNN (GPU/NPU), OpenVINO |
-| `FacebookAI/roberta-base` | RoBERTa | CPU, QNN, OpenVINO |
-| `FacebookAI/xlm-roberta-base` | XLM-RoBERTa | CPU, QNN, OpenVINO |
+| `bert-base-uncased` | BERT | All EPs |
+| `FacebookAI/roberta-base` | RoBERTa | All EPs |
+| `FacebookAI/xlm-roberta-base` | XLM-RoBERTa | All EPs |
 
 ### Feature Extraction & Embeddings
 
-| Model | Architecture | EPs Tested |
+| Model | Architecture | EPs |
 |-------|-------------|------------|
-| `BAAI/bge-base-en-v1.5` | BERT | CPU, QNN (GPU/NPU), OpenVINO |
-| `BAAI/bge-small-en-v1.5` | BERT | CPU, QNN (GPU/NPU), OpenVINO |
-| `sentence-transformers/all-MiniLM-L6-v2` | BERT | CPU, QNN, OpenVINO |
+| `BAAI/bge-base-en-v1.5` | BERT | All EPs |
+| `BAAI/bge-small-en-v1.5` | BERT | All EPs |
+| `sentence-transformers/all-MiniLM-L6-v2` | BERT | All EPs |
 
 ### Vision-Language
 
-| Model | Architecture | EPs Tested |
+| Model | Architecture | EPs |
 |-------|-------------|------------|
-| `openai/clip-vit-base-patch32` | CLIP | CPU, QNN, OpenVINO |
-| `openai/clip-vit-large-patch14` | CLIP | CPU, QNN, OpenVINO |
+| `openai/clip-vit-base-patch32` | CLIP | All EPs |
+| `openai/clip-vit-large-patch14` | CLIP | All EPs |
 
 ### Segmentation
 
-| Model | Architecture | EPs Tested |
+| Model | Architecture | EPs |
 |-------|-------------|------------|
-| `nvidia/segformer-b0-finetuned-ade-512-512` | Segformer | CPU, QNN, OpenVINO |
-| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` | Segformer | CPU, QNN, OpenVINO |
+| `nvidia/segformer-b0-finetuned-ade-512-512` | Segformer | All EPs |
+| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` | Segformer | All EPs |
 
 ### Object Detection
 
-| Model | Architecture | EPs Tested |
+| Model | Architecture | EPs |
 |-------|-------------|------------|
-| `microsoft/table-transformer-detection` | Table-Transformer | CPU, OpenVINO |
+| `microsoft/table-transformer-detection` | Table-Transformer | All EPs |
 
 ---
 
diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index 0fcfa55f9..69d181032 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -43,28 +43,9 @@ The output shows per-EP compatibility results:
 
 ```text
 ══════════════════════════════════════════════════════════════════════════
-📊 OP CHECK
+ ANALYSIS SUMMARY
 ══════════════════════════════════════════════════════════════════════════
-   📚 Model: my_model.onnx
-   🔺 Opset: 17  Producer: pytorch v2.12.0
-   📏 Operators: 122 total, 7 unique types
-   🏗️ Analysis targets: QNNExecutionProvider (NPU), QNNExecutionProvider (GPU)
-────────────────────────────────────────────────────────────────────────
-👻 EP 1: QNNExecutionProvider on NPU
-────────────────────────────────────────────────────────────────────────
- Op Type                       S/P/U/Unk
- 🃓 Conv (53)                  53/0/0/0
- 🃓 Relu (49)                  49/0/0/0
- 🃓 Add (16)                   16/0/0/0
- 🃓 MaxPool (1)                1/0/0/0
- 🃓 GlobalAveragePool (1)      1/0/0/0
- 🃓 Flatten (1)                1/0/0/0
- 🃓 Gemm (1)                   1/0/0/0
- TOTAL (122)                   122/0/0/0
-══════════════════════════════════════════════════════════════════════════
-📊 ANALYSIS SUMMARY
-══════════════════════════════════════════════════════════════════════════
-   🃓 QNNExecutionProvider (NPU): 122/0/0/0
+   QNNExecutionProvider (NPU): 122/0/0/0
       Ready to deploy
 ```
 
@@ -75,7 +56,7 @@ If the analyzer detects fusible patterns (GeLU, LayerNorm, etc.), they will appe
 
 ---
 
-### Step 2: Optimize with the generated config
+### Step 2: Optimize the graph
 
 Pass the analyzer's output config directly to the optimizer:
 
@@ -83,14 +64,11 @@ Pass the analyzer's output config directly to the optimizer:
 uv run winml optimize -m my_model.onnx -c optim_config.json -o my_model_optimized.onnx
 ```
 
-The optimizer applies the fusions specified in the config. Output:
+The optimizer applies the fusions specified in the config and reports how many nodes it reduced:
 
 ```text
 Input: my_model.onnx
 Output: my_model_optimized.onnx
-Loading model...
-Running optimizer...
-Saving optimized model...
 
 Success! Model optimized: my_model_optimized.onnx
 Nodes: 122 -> 122 (0.0% reduction)
@@ -99,12 +77,6 @@ Nodes: 122 -> 122 (0.0% reduction)
 !!! tip
     The node reduction depends on your model's architecture. Simple models like ResNet (only Conv, Relu, Add) have no fusible patterns. Transformer-based models (BERT, ViT) typically see 10–30% node reduction from GeLU, LayerNorm, and Attention fusions.
 
-To see all available optimization capabilities:
-
-```bash
-uv run winml optimize --list-capabilities
-```
-
 !!! note "What we just did"
     Graph optimization fuses multi-node patterns (like the 5-node GeLU/Erf sequence) into single high-level operators that EPs can execute more efficiently. The optimizer is purely a graph transformation — it doesn't change the model's numerical behavior or require calibration data. Running it before quantization is important: calibration should be performed on the already-fused topology, not the verbose original graph.
 
@@ -125,85 +97,96 @@ If the original analysis found fusible patterns that were optimized away, this r
 
 ---
 
-### Step 4: Benchmark the optimized model
+### Step 4 (optional): Quantize
 
-Measure the performance improvement from optimization:
+Insert QDQ (Quantize-Dequantize) nodes into the optimized graph using static calibration:
 
 ```bash
-uv run winml perf -m my_model_optimized.onnx --device cpu --warmup 5 --iterations 50
+uv run winml quantize -m my_model_optimized.onnx -o my_model_int8.onnx --precision int8 --samples 32
 ```
 
-For NPU (if you have the compiled model from a later step):
+The quantizer generates 32 random calibration samples, runs them through the model to collect activation statistics, and uses those statistics to set the quantization scale and zero-point for each tensor.
 
-```bash
-uv run winml perf -m my_model_optimized.onnx --device npu --warmup 5 --iterations 50
-```
+!!! note "What we just did"
+    `--precision int8` sets both weights and activations to 8-bit integers, which is the precision most NPU compilers expect. The output model still contains standard `QuantizeLinear` and `DequantizeLinear` ONNX nodes, so it is portable and can run on any ONNX Runtime backend. See [Concepts → Quantization and QDQ](../concepts/quantization.md) for calibration methods and per-channel options.
 
 ---
 
-### Step 5 (optional): Quantize and compile for NPU
+### Step 5 (optional): Compile for the target EP
 
-If your target is NPU deployment, continue the pipeline with quantization and compilation:
+Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time:
 
-```bash
-# Quantize (INT8, QDQ format)
-uv run winml quantize -m my_model_optimized.onnx -o my_model_int8.onnx --precision int8 --samples 32
+=== "Qualcomm NPU"
 
-# Compile for NPU
-uv run winml compile -m my_model_int8.onnx --device npu
-```
+    ```bash
+    uv run winml compile -m my_model_int8.onnx --device npu --ep qnn
+    ```
 
-Then benchmark the final compiled artifact:
+=== "Intel NPU"
 
-```bash
-uv run winml perf -m my_model_int8_npu_ctx.onnx --device npu --iterations 50 --monitor
-```
+    ```bash
+    uv run winml compile -m my_model_int8.onnx --device npu --ep openvino
+    ```
+
+=== "AMD NPU"
+
+    ```bash
+    uv run winml compile -m my_model_int8.onnx --device npu --ep vitisai
+    ```
+
+=== "CPU"
+
+    ```bash
+    uv run winml compile -m my_model_int8.onnx --device cpu
+    ```
+
+!!! note "What we just did"
+    Compilation embeds EP context — the compiled binary — inside or alongside the ONNX file using the `EPContext` node convention. At inference time the runtime loads the pre-compiled binary directly rather than re-compiling from the ONNX graph. See [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) for details.
 
 ---
 
-## Section B — One-shot with `winml build`
+### Step 6: Benchmark
 
-Once you understand the analyze → optimize → re-analyze loop (which you now do), you can let `winml build` handle everything in one command. When you pass a `.onnx` file, winml-cli auto-detects it and skips the export stage — running the optimization loop, quantization, and compilation automatically.
+Measure the performance of your model:
 
-### CPU target (optimize only)
+=== "Optimized (CPU)"
 
-```bash
-uv run winml build -m my_model.onnx -d cpu -o output/
-```
+    ```bash
+    uv run winml perf -m my_model_optimized.onnx --device cpu --warmup 5 --iterations 50
+    ```
 
-Since `-d cpu` resolves to fp16 precision (no quantization) and compilation is off by default, this just runs the analyze–optimize convergence loop:
+=== "Compiled (NPU)"
 
-```text
-output/
-├── model.onnx                     ← Deploy this
-├── my_model.onnx                  ← Copy of your input
-├── my_model_optimized.onnx        ← After graph optimization
-├── winml_build_config.json        ← Auto-generated build config
-└── analyze_result.json            ← Final analysis output
-```
+    ```bash
+    uv run winml perf -m my_model_int8_npu_ctx.onnx --device npu --iterations 50 --monitor
+    ```
 
-### NPU target (full pipeline)
+!!! note "What we just did"
+    `winml perf` generates random inputs matching the model's I/O spec, runs warmup iterations (excluded from statistics), then the benchmark iterations, and reports full latency percentiles alongside throughput. The `--monitor` flag activates live hardware utilization polling. See [Concepts → Perf and monitoring](../concepts/perf-and-monitoring.md) for details.
 
-To get a quantized, compiled model for NPU in one shot, pass `--compile`:
+---
 
-```bash
-uv run winml build -m my_model.onnx -d npu --compile -o output/
-```
+## Section B — One-shot with `winml build`
 
-Or generate a config first for more control:
+Once you understand the analyze → optimize → re-analyze loop (which you now do), you can let `winml build` handle everything in one command. When you pass a `.onnx` file, winml-cli auto-detects it and skips the export stage — running the optimization loop, quantization, and compilation automatically.
 
 ```bash
-uv run winml config --onnx my_model.onnx -d npu --precision int8 -o config.json
-uv run winml build -m my_model.onnx -c config.json -o output/
+uv run winml build -m my_model.onnx -o output/ --device npu --precision int8
 ```
 
-The pipeline runs: **analyze → optimize → (re-analyze → re-optimize if needed) → quantize → compile → model.onnx**.
+!!! tip "Config file is optional"
+    The `-c config.json` flag is optional. Without it, `winml build` auto-generates an internal config from the flags you pass (like `--device` and `--precision`). If you need a reusable config, generate one with [`winml config`](../commands/config.md):
+
+    ```bash
+    uv run winml config --onnx my_model.onnx -d npu --precision int8 -o config.json
+    uv run winml build -m my_model.onnx -c config.json -o output/
+    ```
 
-The output directory for a full NPU build looks like:
+The pipeline runs: **analyze → optimize → (re-analyze → re-optimize if needed) → quantize → compile → model.onnx**. The output directory looks like:
 
 ```text
 output/
-├── model.onnx                     ← FINAL: compiled NPU artifact
+├── model.onnx                     ← FINAL: deploy this
 ├── my_model.onnx                  ← Copy of your input
 ├── my_model_optimized.onnx        ← After optimization loop converged
 ├── my_model_quantized.onnx        ← After INT8 quantization
@@ -212,36 +195,25 @@ output/
 └── analyze_result.json            ← Analysis from optimize stage
 ```
 
-!!! note "What we just did"
-    `winml build` with an ONNX input runs the same analyze → optimize → re-analyze convergence loop from Section A, but automatically. It reads the analyzer's recommendations, applies them, re-runs the analyzer, and repeats until no new recommendations appear (max 3 iterations by default). The config file specifies device, precision, and EP — so `--device npu --precision int8` in the config causes quantize and compile stages to run automatically.
-
-### Selectively skip stages
+You can selectively skip stages using the override flags:
 
-By default when auto-generating config (no `-c` flag):
+- `--no-optimize` — skip graph optimization (rarely needed; useful if you have a pre-optimized ONNX)
+- `--no-quant` — skip quantization (produces a floating-point compiled model)
+- `--no-compile` — skip compilation (produces a quantized but not device-locked ONNX)
 
-- **Compilation is OFF** — pass `--compile` to enable it
-- **Quantization depends on device**:
-    - `-d npu` → quantization ON (w8a16 precision by default)
-    - `-d gpu` / `-d cpu` → quantization OFF (fp16, no quantization)
-
-Override flags:
-
-- `--no-quant` — force skip quantization (even on NPU)
-- `--compile` — force enable compilation
-- `--no-compile` — force skip compilation (default when no config file)
+For example, to produce an optimized model without quantization or compilation:
 
 ```bash
-# NPU: optimize + quantize (w8a16), skip compilation
-uv run winml build -m my_model.onnx -d npu -o output/
+uv run winml build -m my_model.onnx -o output/ --device cpu
+```
 
-# NPU: full pipeline including compilation
-uv run winml build -m my_model.onnx -d npu --compile -o output/
+!!! note "What we just did"
+    `winml build` is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary.
 
-# NPU: optimize only, no quantize, no compile
-uv run winml build -m my_model.onnx -d npu --no-quant -o output/
+Once the build completes, benchmark the final artifact:
 
-# CPU/GPU: optimize only (quantize and compile are already off)
-uv run winml build -m my_model.onnx -d cpu -o output/
+```bash
+uv run winml perf -m output/model.onnx --device npu --iterations 50 --monitor
 ```
 
 ---
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index ebbc8b7e1..b59d288fe 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -140,22 +140,32 @@ The quantizer generates 32 random calibration samples, runs them through the mod
 
 Compilation converts the portable quantized ONNX into an EP-specific binary format that the execution provider can load directly, skipping JIT compilation at inference time. This is the step that produces a device-locked artifact tied to the selected EP.
 
-The examples below use the default compiler backend:
+The examples below use the default compiler backend (`--compiler ort`), which uses ONNX Runtime's built-in EP context compiler:
 
-- **`--compiler ort`** (default) — uses ONNX Runtime's built-in EP context compiler.
-
-=== "QNN via ORT (default)"
+=== "Qualcomm NPU"
 
     ```bash
-    uv run winml compile -m convnext_int8.onnx --device npu
+    uv run winml compile -m convnext_int8.onnx --device npu --ep qnn
     ```
 
-=== "OpenVINO (Intel CPU/GPU/NPU)"
+=== "Intel NPU"
 
     ```bash
     uv run winml compile -m convnext_int8.onnx --device npu --ep openvino
     ```
 
+=== "AMD NPU"
+
+    ```bash
+    uv run winml compile -m convnext_int8.onnx --device npu --ep vitisai
+    ```
+
+=== "CPU"
+
+    ```bash
+    uv run winml compile -m convnext_int8.onnx --device cpu
+    ```
+
 The compiled output file appears in the same directory as the input model. The file name follows the pattern `convnext_int8_npu_ctx.onnx` (using the resolved device string `npu`, not the EP name) and an accompanying `.bin` context binary is written alongside it (unless `--embed` is passed, which embeds the binary inside the ONNX file). CPU builds do not produce a new artifact — the compile step validates EP compatibility but writes no output file; use `convnext_int8.onnx` directly for CPU inference.
 
 !!! note "What we just did"
@@ -228,12 +238,15 @@ The `--model-id` flag is required when passing an ONNX file, because the evaluat
 
 ## Section B — One-shot with `winml build`
 
-Once you understand what each primitive stage does (which you now do), you can collapse the entire pipeline into a single command. `winml build` orchestrates export, optimize, quantize, and compile in sequence using the config file you generated in Step 2.
+Once you understand what each primitive stage does (which you now do), you can collapse the entire pipeline into a single command. `winml build` orchestrates export, optimize, quantize, and compile in sequence.
 
 ```bash
-uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/
+uv run winml build -m facebook/convnext-tiny-224 -o convnext_out/ --device npu --precision int8
 ```
 
+!!! tip "Config file is optional"
+    The `-c config.json` flag is optional. Without it, `winml build` auto-generates an internal config from the flags you pass (like `--device` and `--precision`). If you need a reusable config, generate one with [`winml config`](../commands/config.md).
+
 The command downloads the pretrained weights, runs all four pipeline stages, and writes every intermediate and final artifact into `convnext_out/`. The stage timing is printed as each stage completes, and the final line tells you the path of the compiled model.
 
 You can selectively skip stages using the override flags:
@@ -245,11 +258,11 @@ You can selectively skip stages using the override flags:
 For example, to produce an optimized and quantized model without the compile step:
 
 ```bash
-uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/ --no-compile
+uv run winml build -m facebook/convnext-tiny-224 -o convnext_out/ --device npu --precision int8 --no-compile
 ```
 
 !!! note "What we just did"
-    `winml build` is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary. The config file you pass with `-c` fully specifies the device target, precision, and EP — so you get an NPU-targeted INT8 compiled model without needing to repeat those flags on every primitive command.
+    `winml build` is the production workflow. It guarantees that stages run in the correct order, passes intermediate artifacts through the pipeline automatically, and records which stages completed or were skipped in the result summary.
 
 Once the build completes, benchmark the final artifact from `convnext_out/`:
 
diff --git a/mkdocs.yml b/mkdocs.yml
index 61d086e59..ab7fbc39f 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -105,12 +105,12 @@ nav:
           - sys: commands/sys.md
           - inspect: commands/inspect.md
           - catalog: commands/catalog.md
-          - analyze: commands/analyze.md
       - Configure:
           - config: commands/config.md
-          - optimize: commands/optimize.md
       - Build:
           - export: commands/export.md
+          - analyze: commands/analyze.md
+          - optimize: commands/optimize.md
           - quantize: commands/quantize.md
           - compile: commands/compile.md
           - build: commands/build.md

From 662f173bd955db3a3c6671f4c48551bde40129f3 Mon Sep 17 00:00:00 2001
From: "Ziyuan Guo (WE TEAM)" <ziyuanguo@microsoft.com>
Date: Tue, 9 Jun 2026 19:04:13 +0800
Subject: [PATCH 113/143] docs: correct option flags and aliases in command
 tables

- catalog: add --execution-provider alias for --ep, add -d alias for --device
- export: simplify --hierarchy entry to --no-hierarchy
- inspect: simplify --hierarchy flag to single -H form
---
 docs/commands/catalog.md | 4 ++--
 docs/commands/export.md  | 2 +-
 docs/commands/inspect.md | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/commands/catalog.md b/docs/commands/catalog.md
index 2717aaa2a..87039b061 100644
--- a/docs/commands/catalog.md
+++ b/docs/commands/catalog.md
@@ -21,8 +21,8 @@ $ winml catalog [options]
 |------|-------|------|---------|-------------|
 | `--model-type` | | string | `null` | Filter the catalog by model architecture (case-insensitive). Examples: `bert`, `roberta`, `vit`. |
 | `--task` | `-t` | string | `null` | Filter by HuggingFace task (case-insensitive). Examples: `text-classification`, `image-segmentation`. |
-| `--ep` | | string | `null` | Filter by execution provider (e.g., `qnn`, `dml`). If not specified, shows all EPs. |
-| `--device` | | string | `null` | Filter by target device (e.g., `npu`, `gpu`). If not specified, shows all devices. |
+| `--ep/--execution-provider` | | string | `null` | Filter by execution provider (e.g., `qnn`, `dml`). If not specified, shows all EPs. |
+| `--device` | `-d` | string | `null` | Filter by target device (e.g., `npu`, `gpu`). If not specified, shows all devices. |
 | `--output` | `-o` | path | `null` | Save the displayed results to a JSON file. |
 | `--help` | `-h` | flag | — | Show help and exit. |
 
diff --git a/docs/commands/export.md b/docs/commands/export.md
index a18bdebf1..db20e917f 100644
--- a/docs/commands/export.md
+++ b/docs/commands/export.md
@@ -21,7 +21,7 @@ $ winml export [options]
 | `--model` | `-m` | string | *(required)* | Hugging Face model name or local path (e.g., `prajjwal1/bert-tiny`). |
 | `--output` | `-o` | path | *(required)* | Output ONNX file path (e.g., `model.onnx`). |
 | `--with-report/--no-with-report` | | flag | `false` | Generate full export reports: Markdown, JSON, and a console tree. |
-| `--hierarchy/--no-hierarchy` | | flag | `true` | Preserve `hierarchy_tag` metadata in ONNX nodes (use `--no-hierarchy` for a clean ONNX file). |
+| `--no-hierarchy` | | flag | `true` | Preserve `hierarchy_tag` metadata in ONNX nodes (use `--no-hierarchy` for a clean ONNX file). |
 | `--dynamo/--no-dynamo` | | flag | `false` | Enable PyTorch 2.9+ dynamo export for richer node metadata. (Experimental — currently logs a warning.) |
 | `--torch-module` | | string | `None` | Comma-separated list of `torch.nn` module types to include in hierarchy (e.g., `LayerNorm,Embedding`). (Experimental — currently logs a warning.) |
 | `--input-specs` | | path | `None` | JSON file with explicit input tensor specifications. Auto-generated when omitted. |
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
index 4877ee2a9..8f92de579 100644
--- a/docs/commands/inspect.md
+++ b/docs/commands/inspect.md
@@ -22,7 +22,7 @@ $ winml inspect -m <model_id> [options]
 | `--model` | `-m` | string | **required** | HuggingFace model ID (e.g. `openai/clip-vit-base-patch32`). Required unless `--list-tasks` or `--help` is used. |
 | `--format` | `-f` | `table` \| `json` | `table` | Output format. `table` renders rich panels; `json` emits a machine-readable object. |
 | `--task` | `-t` | string | `null` | Override the auto-detected task (e.g. `image-classification`, `feature-extraction`). |
-| `--hierarchy/--no-hierarchy` | `-H/-N` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
+| `--hierarchy` | `-H` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
 | `--verbose` | `-v` | flag | `false` | Show full configuration details. |
 | `--list-tasks` | | flag | `false` | List all known tasks and exit. Does not require `--model`. |
 | `--model-type` | | string | `null` | Override model type (e.g. `bert`, `resnet`). Can be used without `--model`. |

From 744720dd2977e9b5250e37a63667560936fd0e7d Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Tue, 9 Jun 2026 19:13:30 +0800
Subject: [PATCH 114/143] docs: restore --hierarchy/--no-hierarchy pairs per
 merged PR #844

---
 docs/commands/export.md  | 2 +-
 docs/commands/inspect.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/commands/export.md b/docs/commands/export.md
index db20e917f..a18bdebf1 100644
--- a/docs/commands/export.md
+++ b/docs/commands/export.md
@@ -21,7 +21,7 @@ $ winml export [options]
 | `--model` | `-m` | string | *(required)* | Hugging Face model name or local path (e.g., `prajjwal1/bert-tiny`). |
 | `--output` | `-o` | path | *(required)* | Output ONNX file path (e.g., `model.onnx`). |
 | `--with-report/--no-with-report` | | flag | `false` | Generate full export reports: Markdown, JSON, and a console tree. |
-| `--no-hierarchy` | | flag | `true` | Preserve `hierarchy_tag` metadata in ONNX nodes (use `--no-hierarchy` for a clean ONNX file). |
+| `--hierarchy/--no-hierarchy` | | flag | `true` | Preserve `hierarchy_tag` metadata in ONNX nodes (use `--no-hierarchy` for a clean ONNX file). |
 | `--dynamo/--no-dynamo` | | flag | `false` | Enable PyTorch 2.9+ dynamo export for richer node metadata. (Experimental — currently logs a warning.) |
 | `--torch-module` | | string | `None` | Comma-separated list of `torch.nn` module types to include in hierarchy (e.g., `LayerNorm,Embedding`). (Experimental — currently logs a warning.) |
 | `--input-specs` | | path | `None` | JSON file with explicit input tensor specifications. Auto-generated when omitted. |
diff --git a/docs/commands/inspect.md b/docs/commands/inspect.md
index 8f92de579..df0f18d85 100644
--- a/docs/commands/inspect.md
+++ b/docs/commands/inspect.md
@@ -22,7 +22,7 @@ $ winml inspect -m <model_id> [options]
 | `--model` | `-m` | string | **required** | HuggingFace model ID (e.g. `openai/clip-vit-base-patch32`). Required unless `--list-tasks` or `--help` is used. |
 | `--format` | `-f` | `table` \| `json` | `table` | Output format. `table` renders rich panels; `json` emits a machine-readable object. |
 | `--task` | `-t` | string | `null` | Override the auto-detected task (e.g. `image-classification`, `feature-extraction`). |
-| `--hierarchy` | `-H` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
+| `--hierarchy/--no-hierarchy` | `-H` | flag | `false` | Print the PyTorch module tree. Instantiates the model with random weights — no weight download required. |
 | `--verbose` | `-v` | flag | `false` | Show full configuration details. |
 | `--list-tasks` | | flag | `false` | List all known tasks and exit. Does not require `--model`. |
 | `--model-type` | | string | `null` | Override model type (e.g. `bert`, `resnet`). Can be used without `--model`. |

From 32644c2d66b27a70bb3fcec2511af6e9b8bec480 Mon Sep 17 00:00:00 2001
From: Chao Zhang <zhangchao@microsoft.com>
Date: Wed, 10 Jun 2026 15:08:20 +0800
Subject: [PATCH 115/143] adjust config-and-build.md order

---
 docs/README.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/README.md b/docs/README.md
index 6fa5d68bf..040fb3423 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -21,9 +21,8 @@ docs/
 ├── concepts/                         ← 12 conceptual pages in two sub-groups
 │   ├── how-it-works.md, graphs-and-ir.md, weight-and-activation.md,
 │   │     eps-and-devices.md, quantization.md         (Fundamentals)
-│   └── primitives-and-pipeline.md, load-and-export.md, analyze-and-optimize.md,
-│         compile-and-epcontext.md, perf-and-monitoring.md, eval-and-datasets.md,
-│         config-and-build.md                         (WinML CLI workflows)
+│   └── primitives-and-pipeline.md, config-and-build.md, load-and-export.md, analyze-and-optimize.md,
+│         compile-and-epcontext.md, perf-and-monitoring.md, eval-and-datasets.md                         (WinML CLI workflows)
 ├── commands/                         ← per-command reference (overview + 12 commands)
 ├── samples/                          ← reference-style walkthroughs
 ├── tutorials/                        ← classroom-style walkthroughs

From c846a7dfc35eb3d4e156d9fc5480ae149878406a Mon Sep 17 00:00:00 2001
From: Brenda Bai <yiba@microsoft.com>
Date: Wed, 10 Jun 2026 18:33:08 +0800
Subject: [PATCH 116/143] docs: update quickstart and index landing page

---
 docs/getting-started/quickstart.md | 63 ++++++++++++++++++------------
 docs/index.md                      | 18 +++++----
 2 files changed, 48 insertions(+), 33 deletions(-)

diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
index 06bc5a16b..7b33d98d2 100644
--- a/docs/getting-started/quickstart.md
+++ b/docs/getting-started/quickstart.md
@@ -1,9 +1,6 @@
 # Quickstart
 
-This page proves your winml-cli install works end-to-end. You will inspect a
-Hugging Face image classifier, then export it to ONNX. No quantization, no
-execution-provider selection — just the commands you need to confirm everything
-is wired up correctly. Estimated time: 5 minutes.
+This guide walks you through verifying your install, inspecting a model from Hugging Face, running a full build pipeline to produce an optimized ONNX, and benchmarking the model on your device. Estimated time: 5 minutes.
 
 ## Verify the install
 
@@ -14,14 +11,12 @@ on your machine:
 uv run winml sys --list-device --list-ep
 ```
 
-`--list-device` and `--list-ep` print only the hardware and EP inventory,
-skipping runtime-version and Python environment details that plain `winml sys`
-would include. If the command exits without error, your winml-cli install is
+`--list-device` and `--list-ep` print only the hardware and EP inventory. If the command exits without error, your winml-cli install is
 ready. See [`winml sys`](../commands/sys.md) for the full flag reference.
 
 ## Inspect the model
 
-Before downloading any weights, confirm that winml-cli recognises the model:
+Before downloading any models, confirm that winml-cli recognises the model:
 
 ```bash
 uv run winml inspect -m microsoft/resnet-50
@@ -40,34 +35,52 @@ uv run winml inspect -m microsoft/resnet-50
 !!! note "What just happened"
     `winml inspect` read only the model's `config.json` from Hugging Face Hub —
     no weights downloaded — and confirmed that `microsoft/resnet-50` maps to a
-    supported task, a known model class, and a compatible ONNX exporter. Always
-    inspect before export to catch unsupported architectures early. See
-    [`winml inspect`](../commands/inspect.md) for output-format and hierarchy
-    options.
+    supported task, a known model class, and a compatible ONNX exporter.
 
-## Export the model
+!!! tip
+    Always inspect before build to catch unsupported architectures early.
+
+## Build the model
 
 ```bash
-uv run winml export -m microsoft/resnet-50 -o resnet50.onnx
+uv run winml build -m microsoft/resnet-50 -o resnet_out/ --no-quant
 ```
 
-!!! note "What just happened"
-    winml-cli downloaded the `microsoft/resnet-50` weights from Hugging Face,
-    ran the eight-step Hierarchy-preserving Tags Protocol (HTP) to trace the
-    PyTorch module tree, and wrote an ONNX file to `resnet50.onnx`. Each ONNX
-    node carries a `hierarchy_tag` metadata property recording its full PyTorch
-    ancestry, which downstream quantization and compilation steps use to reason
-    about the graph. See [`winml export`](../commands/export.md) for the full
-    flag reference.
+`winml build` runs all pipeline steps in sequence — export, optimize, quantize (when an NPU is detected on your device), and compile (disabled by default). You can start a model build without a config file, or provide one to configure each step in the sequence (see [`winml config`](../commands/config.md) to customize).
+All intermediate artifacts land in `resnet_out/`, so you can reuse any stage independently.
+
+After a successful build, you will find the following outputs in `resnet_out/`:
+
+- **A standard ONNX file for each completed stage** — load, inspect, or pass any of these to a downstream tool independently.
+- **`analyze_result.json`** — detailed model compatibility insights for each Windows ML EP, including supported, partially supported, and unsupported operators, detected optimization patterns, and recommended optimization workflows.
+- **A declarative `winml_build_config` file** — automatically generated after the build step to capture the full workflow end-to-end.
+
+!!! tip "CI/CD integration"
+    The declarative `winml_build_config` makes it easy to integrate the model build workflow into CI/CD pipelines — the same file drives reproducible, portable build workflows across environments.
+
+!!! note "--no-quant"
+    `--no-quant` tells the pipeline to skip the quantize stage. Quantization is a valuable step for NPU targets, but skipping it here for the output model run on any device.
+
+!!! note "Why compile is disabled by default"
+    Compilation embeds a pre-compiled binary optimized for your specific device. Skip this step to keep the ONNX output portable — it will run on any device using just-in-time (JIT) compilation.
+
+## Benchmark the model
+
+```bash
+uv run winml perf -m resnet_out/model.onnx --device auto --iterations 50 --monitor
+```
+
+`--device auto` lets the CLI resolve the best available device on your machine — NPU first, then GPU, then CPU.
 
 ## What's next
 
-- **[End-to-End walkthrough](end-to-end.md)** — full pipeline from Hugging Face to NPU.
 - **[How winml-cli Works](../concepts/how-it-works.md)** — understand what each command does under the hood.
 - **[ConvNeXt primitives sample](../samples/convnext-primitives.md)** — see every pipeline stage in detail with a representative model.
 
 ## See also
 
-- [`winml export`](../commands/export.md)
+- [`winml build`](../commands/build.md)
 - [`winml inspect`](../commands/inspect.md)
-- [Load and export](../concepts/load-and-export.md)
+- [`winml perf`](../commands/perf.md)
+- [`winml sys`](../commands/sys.md)
+
diff --git a/docs/index.md b/docs/index.md
index ef792d612..0484a1841 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,24 +1,26 @@
 # winml-cli
 
-winml-cli is a CLI toolkit to build portable, performant, and high-quality models for [Windows ML](https://learn.microsoft.com/en-us/windows/ai/windows-ml/).
+Windows ML CLI is a command line tool for building portable, performant, and high-quality AI models for Windows ML. It takes you from a source model — whether from Hugging Face or your own pipeline — to a hardware-optimized artifact in a reproducible workflow.
+
+Purpose-built for Windows hardware diversity, the CLI handles conversion, graph optimization, and compilation across AMD, Intel, NVIDIA, and Qualcomm targets. The CLI fits naturally into CI/CD pipelines so teams can validate and ship models easily.
 
 ## What you can do
 
-- **Build once, run anywhere.** Compose your own workflow from primitive commands (`export`, `analyze`, `optimize`, `quantize`, `compile`), or hand a config to the built-in pipeline. Same portable ONNX, two complementary paths.
-- **Drill into the details.** Inspect operators, pinpoint compatibility errors, and trace performance bottlenecks at any stage of the pipeline.
-- **AI-ready.** Built-in agent skills work with mainstream coding agents — let the agent drive the pipeline for you.
+- **Build once, run across devices and EPs.** Compose your own workflow from primitive commands (`export`, `analyze`, `optimize`, `quantize`, `compile`), or hand a config to the built-in pipeline. Same portable ONNX, two complementary paths — with a repeatable and traceable workflow.
+- **Drill into the details.** Deep insights into operator compatibility, shape mismatches, graph optimizations, and EP-aware tuning at any stage of the pipeline.
+- **AI-ready.** CLI-driven tools with built-in skills, friendly to work with mainstream agents.
 
 ## What you get out of the box
 
-- **One toolkit, every EP.** All [supported execution providers](concepts/eps-and-devices.md#eps-winml-cli-supports) live behind the same commands.
-- **Repeatable and traceable.** Configs are deterministic; every pipeline run records inputs, outputs, and decisions at each stage.
-- **Quality gates built in.** The analyzer catches operator-compatibility issues before deployment and suggests fixes automatically.
+- **All Windows ML EPs supported.** Every [supported execution provider](concepts/eps-and-devices.md#eps-winml-cli-supports) is available behind the same commands.
+- **Curated model catalog.** A verified set of models that run across all Windows ML EPs — a reliable starting point.
+- **Repeatable and traceable workflow.** Configs are auto-generated — no hand-crafting required. Every pipeline run records inputs, outputs, and decisions at each stage.
+- **Bring your own ONNX.** Not only for converting from PyTorch — bring an existing ONNX model to get operator-compatibility insights and optimize it based on the analysis.
 
 ## Where to start
 
 - **[Installation](getting-started/installation.md)** — get the `winml` CLI running locally.
 - **[Quickstart](getting-started/quickstart.md)** — export a Hugging Face model in five minutes.
-- **[End-to-End Tour](getting-started/end-to-end.md)** — full pipeline targeting whatever hardware you have (NPU / GPU / CPU).
 
 ## Learn the model
 

From 9be4f9f96a0519811cfbf476e58c50374603be02 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Wed, 10 Jun 2026 18:52:28 +0800
Subject: [PATCH 117/143] docs: update supported-models to match actual catalog
 data

---
 docs/reference/supported-models.md | 148 +++++++++++++++++++++--------
 1 file changed, 110 insertions(+), 38 deletions(-)

diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index add514ea3..78a144f81 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -70,52 +70,124 @@ winml-cli recognizes **35 task types** across vision, NLP, audio, and multimodal
 
 ## Validated Model Catalog
 
-The following architectures have been validated end-to-end with EP compatibility
+The following models have been validated end-to-end with EP compatibility
 testing. Use `winml catalog` to browse the full list interactively.
 
 ### Image Classification
 
-| Model | Architecture | EPs |
-|-------|-------------|------------|
-| `microsoft/resnet-50` | ResNet | All EPs |
-| `facebook/convnext-tiny-224` | ConvNeXt | All EPs |
-| `google/vit-base-patch16-224` | ViT | All EPs |
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `AdamCodd/vit-base-nsfw-detector` | ViT | 83.4 MB |
+| `Falconsai/nsfw_image_detection` | ViT | 82.8 MB |
+| `amunchet/rorshark-vit-base` | ViT | 82.8 MB |
+| `apple/mobilevit-small` | MobileViT | 6.1 MB |
+| `dima806/fairface_age_image_detection` | ViT | 82.8 MB |
+| `google/vit-base-patch16-224` | ViT | 83.6 MB |
+| `microsoft/resnet-18` | ResNet | 11.2 MB |
+| `rizvandwiki/gender-classification` | ViT | 82.8 MB |
+
+### Image Feature Extraction
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `facebook/dino-vitb16` | ViT | 83.4 MB |
+| `facebook/dino-vits16` | ViT | 21.6 MB |
+| `facebook/dinov2-base` | DINOv2 | 82.8 MB |
+| `facebook/dinov2-large` | DINOv2 | 291.4 MB |
+| `facebook/dinov2-small` | DINOv2 | 21.4 MB |
+| `google/vit-base-patch16-224-in21k` | ViT | 83.4 MB |
+| `microsoft/rad-dino` | DINOv2 | 84.4 MB |
+
+### Feature Extraction (Text)
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` | CLIP | 85.4 MB |
+| `openai/clip-vit-base-patch16` | CLIP | 85.5 MB |
+| `openai/clip-vit-base-patch32` | CLIP | 85.5 MB |
+| `sentence-transformers/all-MiniLM-L6-v2` | BERT | 33.2 MB |
+| `sentence-transformers/all-mpnet-base-v2` | MPNet | 133.4 MB |
+| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet | 133.4 MB |
+
+### Sentence Similarity
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `BAAI/bge-large-en-v1.5` | BERT | 351.8 MB |
+| `BAAI/bge-small-en-v1.5` | BERT | 43.9 MB |
+| `sentence-transformers/all-MiniLM-L6-v2` | BERT | 33.4 MB |
+| `sentence-transformers/all-mpnet-base-v2` | MPNet | 134.0 MB |
+| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet | 134.0 MB |
+| `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | BERT | 204.7 MB |
+| `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | XLM-RoBERTa | 450.3 MB |
+
+### Fill-Mask
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `FacebookAI/roberta-base` | RoBERTa | 194.7 MB |
+| `FacebookAI/xlm-roberta-base` | XLM-RoBERTa | 634.4 MB |
+| `distilbert/distilbert-base-uncased` | DistilBERT | 109.5 MB |
+| `google-bert/bert-base-multilingual-cased` | BERT | 346.5 MB |
+| `google-bert/bert-base-multilingual-uncased` | BERT | 316.4 MB |
+| `google-bert/bert-base-uncased` | BERT | 150.5 MB |
+| `sentence-transformers/all-mpnet-base-v2` | MPNet | 156.5 MB |
+| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet | 156.5 MB |
+
+### Text Classification
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `cardiffnlp/twitter-roberta-base-sentiment-latest` | RoBERTa | 157.7 MB |
+| `cross-encoder/ms-marco-MiniLM-L4-v2` | BERT | 29.9 MB |
+| `cross-encoder/ms-marco-MiniLM-L6-v2` | BERT | 33.4 MB |
+| `distilbert/distilbert-base-uncased-finetuned-sst-2-english` | DistilBERT | 87.0 MB |
+
+### Token Classification
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `Isotonic/distilbert_finetuned_ai4privacy_v2` | DistilBERT | 86.6 MB |
+| `Jean-Baptiste/camembert-ner-with-dates` | CamemBERT | 130.4 MB |
+| `kredor/punctuate-all` | XLM-RoBERTa | 449.7 MB |
+| `w11wo/indonesian-roberta-base-posp-tagger` | RoBERTa | 157.2 MB |
+
+### Question Answering
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `ahotrod/electra_large_discriminator_squad2_512` | Electra | 350.9 MB |
+| `deepset/bert-large-uncased-whole-word-masking-squad2` | BERT | 350.9 MB |
+| `deepset/roberta-base-squad2` | RoBERTa | 157.2 MB |
+| `deepset/tinyroberta-squad2` | RoBERTa | 116.2 MB |
+| `distilbert/distilbert-base-cased-distilled-squad` | DistilBERT | 84.2 MB |
+| `distilbert/distilbert-base-uncased-distilled-squad` | DistilBERT | 86.5 MB |
+| `monologg/koelectra-small-v2-distilled-korquad-384` | Electra | 17.7 MB |
+
+### Zero-Shot Classification
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `lxyuan/distilbert-base-multilingual-cased-sentiments-student` | DistilBERT | 217.5 MB |
+
+### Zero-Shot Image Classification
+
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` | CLIP | 170.1 MB |
 
-### Text Classification & NLU
-
-| Model | Architecture | EPs |
-|-------|-------------|------------|
-| `bert-base-uncased` | BERT | All EPs |
-| `FacebookAI/roberta-base` | RoBERTa | All EPs |
-| `FacebookAI/xlm-roberta-base` | XLM-RoBERTa | All EPs |
-
-### Feature Extraction & Embeddings
-
-| Model | Architecture | EPs |
-|-------|-------------|------------|
-| `BAAI/bge-base-en-v1.5` | BERT | All EPs |
-| `BAAI/bge-small-en-v1.5` | BERT | All EPs |
-| `sentence-transformers/all-MiniLM-L6-v2` | BERT | All EPs |
-
-### Vision-Language
-
-| Model | Architecture | EPs |
-|-------|-------------|------------|
-| `openai/clip-vit-base-patch32` | CLIP | All EPs |
-| `openai/clip-vit-large-patch14` | CLIP | All EPs |
-
-### Segmentation
+### Object Detection
 
-| Model | Architecture | EPs |
-|-------|-------------|------------|
-| `nvidia/segformer-b0-finetuned-ade-512-512` | Segformer | All EPs |
-| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` | Segformer | All EPs |
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `hustvl/yolos-small` | YOLOS | 38.1 MB |
+| `valentinafeve/yolos-fashionpedia` | YOLOS | 38.1 MB |
 
-### Object Detection
+### Depth Estimation
 
-| Model | Architecture | EPs |
-|-------|-------------|------------|
-| `microsoft/table-transformer-detection` | Table-Transformer | All EPs |
+| Model | Architecture | Size |
+|-------|-------------|------|
+| `Intel/dpt-hybrid-midas` | DPT | 117.9 MB |
 
 ---
 

From 2f383b2f8c0d89ad6082bd6399e495bd751c76c9 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Wed, 10 Jun 2026 18:54:26 +0800
Subject: [PATCH 118/143] docs: remove size column from supported-models

---
 docs/reference/supported-models.md | 160 ++++++++++++++---------------
 1 file changed, 80 insertions(+), 80 deletions(-)

diff --git a/docs/reference/supported-models.md b/docs/reference/supported-models.md
index 78a144f81..28d0ca9ae 100644
--- a/docs/reference/supported-models.md
+++ b/docs/reference/supported-models.md
@@ -75,119 +75,119 @@ testing. Use `winml catalog` to browse the full list interactively.
 
 ### Image Classification
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `AdamCodd/vit-base-nsfw-detector` | ViT | 83.4 MB |
-| `Falconsai/nsfw_image_detection` | ViT | 82.8 MB |
-| `amunchet/rorshark-vit-base` | ViT | 82.8 MB |
-| `apple/mobilevit-small` | MobileViT | 6.1 MB |
-| `dima806/fairface_age_image_detection` | ViT | 82.8 MB |
-| `google/vit-base-patch16-224` | ViT | 83.6 MB |
-| `microsoft/resnet-18` | ResNet | 11.2 MB |
-| `rizvandwiki/gender-classification` | ViT | 82.8 MB |
+| Model | Architecture |
+|-------|-------------|
+| `AdamCodd/vit-base-nsfw-detector` | ViT |
+| `Falconsai/nsfw_image_detection` | ViT |
+| `amunchet/rorshark-vit-base` | ViT |
+| `apple/mobilevit-small` | MobileViT |
+| `dima806/fairface_age_image_detection` | ViT |
+| `google/vit-base-patch16-224` | ViT |
+| `microsoft/resnet-18` | ResNet |
+| `rizvandwiki/gender-classification` | ViT |
 
 ### Image Feature Extraction
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `facebook/dino-vitb16` | ViT | 83.4 MB |
-| `facebook/dino-vits16` | ViT | 21.6 MB |
-| `facebook/dinov2-base` | DINOv2 | 82.8 MB |
-| `facebook/dinov2-large` | DINOv2 | 291.4 MB |
-| `facebook/dinov2-small` | DINOv2 | 21.4 MB |
-| `google/vit-base-patch16-224-in21k` | ViT | 83.4 MB |
-| `microsoft/rad-dino` | DINOv2 | 84.4 MB |
+| Model | Architecture |
+|-------|-------------|
+| `facebook/dino-vitb16` | ViT |
+| `facebook/dino-vits16` | ViT |
+| `facebook/dinov2-base` | DINOv2 |
+| `facebook/dinov2-large` | DINOv2 |
+| `facebook/dinov2-small` | DINOv2 |
+| `google/vit-base-patch16-224-in21k` | ViT |
+| `microsoft/rad-dino` | DINOv2 |
 
 ### Feature Extraction (Text)
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` | CLIP | 85.4 MB |
-| `openai/clip-vit-base-patch16` | CLIP | 85.5 MB |
-| `openai/clip-vit-base-patch32` | CLIP | 85.5 MB |
-| `sentence-transformers/all-MiniLM-L6-v2` | BERT | 33.2 MB |
-| `sentence-transformers/all-mpnet-base-v2` | MPNet | 133.4 MB |
-| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet | 133.4 MB |
+| Model | Architecture |
+|-------|-------------|
+| `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` | CLIP |
+| `openai/clip-vit-base-patch16` | CLIP |
+| `openai/clip-vit-base-patch32` | CLIP |
+| `sentence-transformers/all-MiniLM-L6-v2` | BERT |
+| `sentence-transformers/all-mpnet-base-v2` | MPNet |
+| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet |
 
 ### Sentence Similarity
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `BAAI/bge-large-en-v1.5` | BERT | 351.8 MB |
-| `BAAI/bge-small-en-v1.5` | BERT | 43.9 MB |
-| `sentence-transformers/all-MiniLM-L6-v2` | BERT | 33.4 MB |
-| `sentence-transformers/all-mpnet-base-v2` | MPNet | 134.0 MB |
-| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet | 134.0 MB |
-| `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | BERT | 204.7 MB |
-| `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | XLM-RoBERTa | 450.3 MB |
+| Model | Architecture |
+|-------|-------------|
+| `BAAI/bge-large-en-v1.5` | BERT |
+| `BAAI/bge-small-en-v1.5` | BERT |
+| `sentence-transformers/all-MiniLM-L6-v2` | BERT |
+| `sentence-transformers/all-mpnet-base-v2` | MPNet |
+| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet |
+| `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` | BERT |
+| `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | XLM-RoBERTa |
 
 ### Fill-Mask
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `FacebookAI/roberta-base` | RoBERTa | 194.7 MB |
-| `FacebookAI/xlm-roberta-base` | XLM-RoBERTa | 634.4 MB |
-| `distilbert/distilbert-base-uncased` | DistilBERT | 109.5 MB |
-| `google-bert/bert-base-multilingual-cased` | BERT | 346.5 MB |
-| `google-bert/bert-base-multilingual-uncased` | BERT | 316.4 MB |
-| `google-bert/bert-base-uncased` | BERT | 150.5 MB |
-| `sentence-transformers/all-mpnet-base-v2` | MPNet | 156.5 MB |
-| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet | 156.5 MB |
+| Model | Architecture |
+|-------|-------------|
+| `FacebookAI/roberta-base` | RoBERTa |
+| `FacebookAI/xlm-roberta-base` | XLM-RoBERTa |
+| `distilbert/distilbert-base-uncased` | DistilBERT |
+| `google-bert/bert-base-multilingual-cased` | BERT |
+| `google-bert/bert-base-multilingual-uncased` | BERT |
+| `google-bert/bert-base-uncased` | BERT |
+| `sentence-transformers/all-mpnet-base-v2` | MPNet |
+| `sentence-transformers/multi-qa-mpnet-base-dot-v1` | MPNet |
 
 ### Text Classification
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `cardiffnlp/twitter-roberta-base-sentiment-latest` | RoBERTa | 157.7 MB |
-| `cross-encoder/ms-marco-MiniLM-L4-v2` | BERT | 29.9 MB |
-| `cross-encoder/ms-marco-MiniLM-L6-v2` | BERT | 33.4 MB |
-| `distilbert/distilbert-base-uncased-finetuned-sst-2-english` | DistilBERT | 87.0 MB |
+| Model | Architecture |
+|-------|-------------|
+| `cardiffnlp/twitter-roberta-base-sentiment-latest` | RoBERTa |
+| `cross-encoder/ms-marco-MiniLM-L4-v2` | BERT |
+| `cross-encoder/ms-marco-MiniLM-L6-v2` | BERT |
+| `distilbert/distilbert-base-uncased-finetuned-sst-2-english` | DistilBERT |
 
 ### Token Classification
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `Isotonic/distilbert_finetuned_ai4privacy_v2` | DistilBERT | 86.6 MB |
-| `Jean-Baptiste/camembert-ner-with-dates` | CamemBERT | 130.4 MB |
-| `kredor/punctuate-all` | XLM-RoBERTa | 449.7 MB |
-| `w11wo/indonesian-roberta-base-posp-tagger` | RoBERTa | 157.2 MB |
+| Model | Architecture |
+|-------|-------------|
+| `Isotonic/distilbert_finetuned_ai4privacy_v2` | DistilBERT |
+| `Jean-Baptiste/camembert-ner-with-dates` | CamemBERT |
+| `kredor/punctuate-all` | XLM-RoBERTa |
+| `w11wo/indonesian-roberta-base-posp-tagger` | RoBERTa |
 
 ### Question Answering
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `ahotrod/electra_large_discriminator_squad2_512` | Electra | 350.9 MB |
-| `deepset/bert-large-uncased-whole-word-masking-squad2` | BERT | 350.9 MB |
-| `deepset/roberta-base-squad2` | RoBERTa | 157.2 MB |
-| `deepset/tinyroberta-squad2` | RoBERTa | 116.2 MB |
-| `distilbert/distilbert-base-cased-distilled-squad` | DistilBERT | 84.2 MB |
-| `distilbert/distilbert-base-uncased-distilled-squad` | DistilBERT | 86.5 MB |
-| `monologg/koelectra-small-v2-distilled-korquad-384` | Electra | 17.7 MB |
+| Model | Architecture |
+|-------|-------------|
+| `ahotrod/electra_large_discriminator_squad2_512` | Electra |
+| `deepset/bert-large-uncased-whole-word-masking-squad2` | BERT |
+| `deepset/roberta-base-squad2` | RoBERTa |
+| `deepset/tinyroberta-squad2` | RoBERTa |
+| `distilbert/distilbert-base-cased-distilled-squad` | DistilBERT |
+| `distilbert/distilbert-base-uncased-distilled-squad` | DistilBERT |
+| `monologg/koelectra-small-v2-distilled-korquad-384` | Electra |
 
 ### Zero-Shot Classification
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `lxyuan/distilbert-base-multilingual-cased-sentiments-student` | DistilBERT | 217.5 MB |
+| Model | Architecture |
+|-------|-------------|
+| `lxyuan/distilbert-base-multilingual-cased-sentiments-student` | DistilBERT |
 
 ### Zero-Shot Image Classification
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` | CLIP | 170.1 MB |
+| Model | Architecture |
+|-------|-------------|
+| `laion/CLIP-ViT-B-32-laion2B-s34B-b79K` | CLIP |
 
 ### Object Detection
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `hustvl/yolos-small` | YOLOS | 38.1 MB |
-| `valentinafeve/yolos-fashionpedia` | YOLOS | 38.1 MB |
+| Model | Architecture |
+|-------|-------------|
+| `hustvl/yolos-small` | YOLOS |
+| `valentinafeve/yolos-fashionpedia` | YOLOS |
 
 ### Depth Estimation
 
-| Model | Architecture | Size |
-|-------|-------------|------|
-| `Intel/dpt-hybrid-midas` | DPT | 117.9 MB |
+| Model | Architecture |
+|-------|-------------|
+| `Intel/dpt-hybrid-midas` | DPT |
 
 ---
 

From 375d4656b938644eec70ff40ab4f0364a17d4af4 Mon Sep 17 00:00:00 2001
From: Brenda Bai <yiba@microsoft.com>
Date: Wed, 10 Jun 2026 19:04:57 +0800
Subject: [PATCH 119/143] docs: add UI quickstart and update index landing page

---
 docs/getting-started/ui-quickstart.md | 13 +++++++++++++
 docs/index.md                         |  7 +++----
 2 files changed, 16 insertions(+), 4 deletions(-)
 create mode 100644 docs/getting-started/ui-quickstart.md

diff --git a/docs/getting-started/ui-quickstart.md b/docs/getting-started/ui-quickstart.md
new file mode 100644
index 000000000..e8ac68d6c
--- /dev/null
+++ b/docs/getting-started/ui-quickstart.md
@@ -0,0 +1,13 @@
+# Try Windows ML CLI with a UI
+
+If you prefer a graphical interface, you can use the **Foundry Toolkit** extension for VS Code to run Windows ML CLI model conversion without typing commands.
+
+## Quick reference
+
+1. **Install [Visual Studio Code](https://code.visualstudio.com/)**
+2. **Install the Foundry Toolkit extension** — search for `Foundry Toolkit` in the VS Code Extensions view
+3. **Open the Model Conversion tool** — in the Foundry Toolkit panel, select **Model Conversion**
+4. **Choose your model** — pick a model from Hugging Face, provide a local path, or select from the built-in model catalog filtered by Windows ML CLI
+5. **Run the build** — the extension invokes Windows ML CLI and streams the output to the VS Code terminal
+
+For a full walkthrough, see [Build with Windows ML CLI (Preview)](https://code.visualstudio.com/docs/intelligentapps/modelconversion#_build-with-windows-ml-cli-preview) in the VS Code documentation.
diff --git a/docs/index.md b/docs/index.md
index 0484a1841..6136d7927 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,16 +6,15 @@ Purpose-built for Windows hardware diversity, the CLI handles conversion, graph
 
 ## What you can do
 
-- **Build once, run across devices and EPs.** Compose your own workflow from primitive commands (`export`, `analyze`, `optimize`, `quantize`, `compile`), or hand a config to the built-in pipeline. Same portable ONNX, two complementary paths — with a repeatable and traceable workflow.
+- **Build once, run across hardwares.** Compose your own workflow from primitive commands (`export`, `analyze`, `optimize`, `quantize`, `compile`), or use an auto-generated config with `winml build` — both produce portable models that run across hardware.
 - **Drill into the details.** Deep insights into operator compatibility, shape mismatches, graph optimizations, and EP-aware tuning at any stage of the pipeline.
 - **AI-ready.** CLI-driven tools with built-in skills, friendly to work with mainstream agents.
 
 ## What you get out of the box
 
 - **All Windows ML EPs supported.** Every [supported execution provider](concepts/eps-and-devices.md#eps-winml-cli-supports) is available behind the same commands.
-- **Curated model catalog.** A verified set of models that run across all Windows ML EPs — a reliable starting point.
-- **Repeatable and traceable workflow.** Configs are auto-generated — no hand-crafting required. Every pipeline run records inputs, outputs, and decisions at each stage.
-- **Bring your own ONNX.** Not only for converting from PyTorch — bring an existing ONNX model to get operator-compatibility insights and optimize it based on the analysis.
+- **Curated model catalog.** A [verified set of models](reference/supported-models.md) that run across all Windows ML EPs — a reliable starting point.
+- **Bring your own ONNX.** Not only for converting from PyTorch — bring an [existing ONNX model](tutorials/build-from-onnx.md) to get operator-compatibility insights and optimize it based on the analysis.
 
 ## Where to start
 

From cb1215cdeb4249a92d01c52ef489481c1e075c3a Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Wed, 10 Jun 2026 19:11:26 +0800
Subject: [PATCH 120/143] docs: remove End-to-End Tour page and update
 references

---
 docs/getting-started/end-to-end.md   | 208 ---------------------------
 docs/getting-started/installation.md |   4 +-
 docs/samples/convnext-primitives.md  |   2 +-
 docs/tutorials/npu-convnext.md       |   2 +-
 mkdocs.yml                           |   1 -
 5 files changed, 4 insertions(+), 213 deletions(-)
 delete mode 100644 docs/getting-started/end-to-end.md

diff --git a/docs/getting-started/end-to-end.md b/docs/getting-started/end-to-end.md
deleted file mode 100644
index 2b1d36fa0..000000000
--- a/docs/getting-started/end-to-end.md
+++ /dev/null
@@ -1,208 +0,0 @@
-# End-to-End Tour
-
-This page walks the full winml-cli pipeline using `--device auto`. The CLI
-resolves to the best available device on your machine — NPU first, then GPU,
-then CPU — so the four commands below are identical regardless of whether you
-have a Copilot+ PC with a Qualcomm NPU, a DirectML-capable GPU, or a plain
-laptop with no accelerator at all. You do not need to think about device flags
-after Step 0.
-
-The vehicle for this tour is `facebook/convnext-tiny-224`, a compact image
-classifier whose operator mix exercises every stage of the pipeline: export,
-optimize, quantize, and compile. Estimated time is 15–25 minutes, most of
-which is the Hugging Face model download and the compile stage. At the end you
-will have a compiled ONNX artifact targeted at your hardware and a real latency
-reading from that device.
-
-## Prerequisites
-
-- Windows 11 24H2 (required for NPU; earlier versions work for CPU/GPU)
-- winml-cli installed (see [Installation](installation.md))
-
-!!! note "NPU users only"
-    To target an NPU you also need:
-
-    - A device with an NPU (e.g., Qualcomm Snapdragon X, Intel Core Ultra)
-
-    Everything else on this page works without it.
-
-## Step 0: See what your machine has
-
-```bash
-uv run winml sys --list-device --list-ep
-```
-
-This lists every hardware device detected and the execution providers (EPs)
-that can target each one. When you pass `--device auto` in the steps below,
-winml-cli resolves that to the highest-priority device shown here: NPU first,
-then GPU, then CPU.
-
-=== "Copilot+ PC (NPU available)"
-
-    ```text
-    Available Devices (priority order)
-      #1  NPU   Qualcomm(R) AI Accelerator
-                 Driver: 31.0.0.6978 | Manufacturer: Qualcomm Technologies, Inc.
-      #2  GPU   NVIDIA GeForce RTX 4060 Laptop GPU
-                 Driver: 31.0.15.5107 | Manufacturer: NVIDIA
-      #3  CPU   Snapdragon X Elite - X1E-80-100 - Oryon
-                 Cores: 12 | Threads: 12 | Architecture: ARM64
-
-    Available Execution Providers
-      QNNExecutionProvider              -> NPU/GPU
-      DmlExecutionProvider              -> GPU
-      CPUExecutionProvider              -> CPU
-    ```
-
-=== "Regular Windows laptop (no NPU)"
-
-    ```text
-    Available Devices (priority order)
-      #1  GPU   Intel(R) Iris(R) Xe Graphics
-                 Driver: 31.0.101.5382 | Manufacturer: Intel Corporation
-      #2  CPU   12th Gen Intel(R) Core(TM) i7-1260P
-                 Cores: 12 | Threads: 16 | Architecture: x86_64
-
-    Available Execution Providers
-      DmlExecutionProvider              -> GPU
-      CPUExecutionProvider              -> CPU
-    ```
-
-## Step 1: Generate the build config
-
-```bash
-uv run winml config -m facebook/convnext-tiny-224 --device auto -o convnext_config.json
-```
-
-`winml config` queries Hugging Face, auto-detects the task and model type, and
-produces a `WinMLBuildConfig` JSON. Passing `--device auto` tells the config
-generator to resolve the target device at generation time — it inspects your
-hardware and writes the winning device (NPU, GPU, or CPU) together with
-matching precision and compile settings into `convnext_config.json`. You can
-open the file to see exactly what was picked before committing to a full build.
-
-!!! tip "Config as CI/CD artifact"
-    The generated `convnext_config.json` is a self-contained, reproducible pipeline specification. Check it into version control and use it in CI/CD pipelines (`winml build -c convnext_config.json -m ... -o ...`) to guarantee identical model processing across machines and runs. Set `"auto": false` in the config for fully deterministic builds (disables the autoconf discovery loop). See [Why version a config](../concepts/config-and-build.md#why-version-a-config) for details.
-
-For a field-by-field explanation of every section in the generated JSON and how
-the `quant` and `compile` blocks interact, see
-[Config and build](../concepts/config-and-build.md).
-
-## Step 2: Run the build
-
-```bash
-uv run winml build -c convnext_config.json -m facebook/convnext-tiny-224 -o convnext_out/
-```
-
-This single command runs all four pipeline stages in sequence — export,
-optimize, quantize, and compile — reading the device and precision settings
-recorded in `convnext_config.json`. The compile stage targets whichever device
-the config captured: it calls the QNN backend and embeds a pre-compiled Hexagon
-binary on NPU, or it compiles a DirectML graph on GPU, or it produces a
-standard optimized ONNX for CPU. All intermediate artifacts land in
-`convnext_out/`, so you can inspect or reuse any stage independently.
-
-You can also pass `--no-quant` or `--no-compile` to stop the pipeline early,
-or `--rebuild` to force re-running even when cached artifacts exist. For a
-deeper look at how each stage works, see
-[Concepts → How winml-cli works](../concepts/how-it-works.md) and
-[Config and Build](../concepts/config-and-build.md).
-
-## Step 3: Benchmark on your device
-
-```bash
-uv run winml perf -m convnext_out/<artifact>.onnx --device auto --iterations 50 --monitor
-```
-
-Replace `<artifact>` with the filename written to `convnext_out/` by the build.
-For NPU builds the compiled artifact is named `model.onnx` in the output
-directory (the `_npu_ctx.onnx` suffix applies only when the compile stage
-produces an EPContext file, which requires `enable_ep_context=True` in the
-compile config). You can check the directory listing or read the compiled
-artifact path from the build output to get the exact name.
-
-=== "NPU (QNN)"
-
-    ```text
-    Device:      npu
-    Precision:   auto
-    Task:        image-classification
-    Iterations:  50 (+ 10 warmup)
-    Batch Size:  1
-
-    Latency (ms)
-      Avg    P50    P90    P95    P99    Min    Max    Std
-     3.87   3.82   4.21   4.38   4.71   3.51   5.04   0.21
-
-    Throughput: 258.14 samples/sec
-
-    Results saved to: model_perf.json
-    ```
-
-=== "GPU (DirectML)"
-
-    ```text
-    Device:      gpu
-    Precision:   auto
-    Task:        image-classification
-    Iterations:  50 (+ 10 warmup)
-    Batch Size:  1
-
-    Latency (ms)
-      Avg    P50    P90    P95    P99    Min    Max    Std
-    12.43  12.18  13.74  14.11  15.02  11.27  16.55   0.89
-
-    Throughput: 80.45 samples/sec
-    ```
-
-## See also
-
-- [Config Schema](../reference/index.md) — full field-by-field config reference
-- [Output Layout](../reference/output-layout.md) — what each output file contains
-- [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview
-
-=== "CPU"
-
-    ```text
-    Device:      cpu
-    Precision:   auto
-    Task:        image-classification
-    Iterations:  50 (+ 10 warmup)
-    Batch Size:  1
-
-    Latency (ms)
-      Avg    P50    P90    P95    P99    Min    Max    Std
-    48.31  47.85  52.14  53.77  57.40  44.62  61.23   2.94
-
-    Throughput: 20.70 samples/sec
-    ```
-
-The `--monitor` flag opens a live chart of device utilization while the
-benchmark runs — most meaningful on NPU or GPU where it confirms the workload
-actually hit the accelerator rather than falling back to CPU. After the run
-finishes, a JSON file named `{model_slug}_perf.json` is written to the current
-directory; you can load it programmatically to compare results across runs or
-across machines.
-
-## Cross-device comparison
-
-Each artifact produced by `winml build` is compiled for the specific device
-recorded in the config — a QNN EPContext binary will not execute on DirectML,
-and vice versa. If you want to measure NPU vs. GPU vs. CPU latency on the same
-model and the same machine you need to generate a separate config and artifact
-for each EP. The
-[ConvNeXt — Primitives Walkthrough](../samples/convnext-primitives.md) sample
-does exactly that: it builds a separate compiled artifact for each execution
-provider and benchmarks them side by side so you can compare the numbers
-directly.
-
-## Where to go next
-
-- [ConvNeXt on NPU tutorial](../tutorials/npu-convnext.md) — full primitives
-  walkthrough plus the `winml build` one-shot wrapper, going deeper than this
-  page on NPU-specific tuning
-- [ConvNeXt — Primitives Walkthrough sample](../samples/convnext-primitives.md)
-  — CPU/GPU/NPU comparison on the same model built with explicit per-device
-  configs
-- [Concepts → How winml-cli works](../concepts/how-it-works.md) — what each
-  stage of the build pipeline does and how they chain together
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
index 7d96b875d..68c35f494 100644
--- a/docs/getting-started/installation.md
+++ b/docs/getting-started/installation.md
@@ -11,7 +11,7 @@
 | Version control | `git` |
 
 !!! note "No NPU?"
-    You can follow most of these docs without NPU hardware. All winml-cli commands accept `--device auto` and fall back to CPU or DirectML automatically. The end-to-end tutorial documents an explicit CPU fallback path.
+    You can follow most of these docs without NPU hardware. All winml-cli commands accept `--device auto` and fall back to CPU or DirectML automatically. The tutorials document explicit CPU fallback paths.
 
 ## Install
 
@@ -70,5 +70,5 @@ This command enumerates available compute devices and execution providers on you
 ## Next steps
 
 - **[Quickstart](quickstart.md)** — export your first model in 5 minutes.
-- **[End-to-End Tour](end-to-end.md)** — full pipeline targeting whatever hardware you have (NPU / GPU / CPU).
+- **[End-to-End Tour](quickstart.md)** — full pipeline targeting whatever hardware you have (NPU / GPU / CPU).
 - **[How winml-cli Works](../concepts/how-it-works.md)** — the mental model.
diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
index 5cb0f2ea9..7a030404d 100644
--- a/docs/samples/convnext-primitives.md
+++ b/docs/samples/convnext-primitives.md
@@ -3,7 +3,7 @@
 !!! info "Pick the right ConvNeXt page"
     - **This sample** — primitives on CPU, GPU (DirectML), and NPU (QNN) side-by-side. Best when you want to compare devices.
     - **[ConvNeXt on NPU](../tutorials/npu-convnext.md)** — the canonical NPU production tutorial with both QNN and OpenVINO, plus the `winml build` one-shot.
-    - **[End-to-End Tour](../getting-started/end-to-end.md)** — short Getting Started tour.
+    - **[Quickstart](../getting-started/quickstart.md)** — short Getting Started tour.
 
 ConvNeXt Tiny is a compact convolutional image classifier trained on ImageNet-1k. At roughly 28 million parameters it is small enough to export and quantize in minutes on a developer laptop, yet representative enough that the latency and accuracy numbers you observe reflect real-world deployment trade-offs. Its straightforward architecture — no attention mechanisms, no dynamic control flow — makes it an ideal first model for learning the winml-cli pipeline.
 
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index b59d288fe..4cb1e0d4d 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -5,7 +5,7 @@
 
     - **This tutorial** — the canonical deep-dive: full pipeline with both QNN and OpenVINO NPU backends, plus the `winml build` one-shot. Start here if you want to ship to NPU.
     - **[ConvNeXt — Primitives Walkthrough](../samples/convnext-primitives.md)** — a CPU vs GPU vs NPU comparison using the primitive commands. Start here if you want to compare devices on the same model.
-    - **[End-to-End Tour](../getting-started/end-to-end.md)** — the short Getting Started introduction. Start here for a 15-minute taste.
+    - **[End-to-End Tour](../getting-started/quickstart.md)** — the short Getting Started introduction. Start here for a 15-minute taste.
 
 This tutorial walks you through the complete journey from a pretrained Hugging Face model — `facebook/convnext-tiny-224` — to a quantized, compiled artifact running on an NPU. By the end you will have benchmarked the model on your device and measured real inference latency. Nothing is skipped, and every command produces a file you can inspect or reuse.
 
diff --git a/mkdocs.yml b/mkdocs.yml
index ab7fbc39f..7b2518e90 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -82,7 +82,6 @@ nav:
   - Getting Started:
       - Installation: getting-started/installation.md
       - Quickstart: getting-started/quickstart.md
-      - End-to-End Tour: getting-started/end-to-end.md
       - Use with AI Agent: getting-started/agent-skill.md
   - Concepts:
       - Fundamentals:

From e40ad9c41b2adbed6c9bb26c78f68c411105c1c5 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Wed, 10 Jun 2026 19:12:41 +0800
Subject: [PATCH 121/143] docs: add UI Quickstart to nav

---
 mkdocs.yml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mkdocs.yml b/mkdocs.yml
index 7b2518e90..6a30bea1e 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -82,6 +82,7 @@ nav:
   - Getting Started:
       - Installation: getting-started/installation.md
       - Quickstart: getting-started/quickstart.md
+      - UI Quickstart: getting-started/ui-quickstart.md
       - Use with AI Agent: getting-started/agent-skill.md
   - Concepts:
       - Fundamentals:

From 6c8f2c3a591b8f2e7b93c2d3227bcfe775413f2c Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Wed, 10 Jun 2026 19:24:14 +0800
Subject: [PATCH 122/143] docs: remove ConvNeXt primitives page and fix all
 references

---
 docs/concepts/primitives-and-pipeline.md |   2 +-
 docs/getting-started/quickstart.md       |   3 +-
 docs/index.md                            |   2 +-
 docs/samples/bert-config-build.md        |   2 +-
 docs/samples/clip-composite.md           |   2 +-
 docs/samples/convnext-primitives.md      | 175 -----------------------
 docs/tutorials/npu-convnext.md           |   6 +-
 mkdocs.yml                               |   1 -
 8 files changed, 7 insertions(+), 186 deletions(-)
 delete mode 100644 docs/samples/convnext-primitives.md

diff --git a/docs/concepts/primitives-and-pipeline.md b/docs/concepts/primitives-and-pipeline.md
index ffff606e1..664555e6b 100644
--- a/docs/concepts/primitives-and-pipeline.md
+++ b/docs/concepts/primitives-and-pipeline.md
@@ -105,5 +105,5 @@ tune fusion flags and calibration — and then encode the final settings into a
 - [Config and build](config-and-build.md) — generating and versioning a
   `WinMLBuildConfig`
 - [winml build command reference](../commands/build.md)
-- [ConvNeXT primitives sample](../samples/convnext-primitives.md) — worked example
+- [ConvNeXt on NPU tutorial](../tutorials/npu-convnext.md) — worked example
   using primitive commands end-to-end
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
index 7b33d98d2..4a8e5ec8f 100644
--- a/docs/getting-started/quickstart.md
+++ b/docs/getting-started/quickstart.md
@@ -75,7 +75,7 @@ uv run winml perf -m resnet_out/model.onnx --device auto --iterations 50 --monit
 ## What's next
 
 - **[How winml-cli Works](../concepts/how-it-works.md)** — understand what each command does under the hood.
-- **[ConvNeXt primitives sample](../samples/convnext-primitives.md)** — see every pipeline stage in detail with a representative model.
+- **[BERT sample](../samples/bert-config-build.md)** — see the config + build + perf workflow in detail with a representative model.
 
 ## See also
 
@@ -83,4 +83,3 @@ uv run winml perf -m resnet_out/model.onnx --device auto --iterations 50 --monit
 - [`winml inspect`](../commands/inspect.md)
 - [`winml perf`](../commands/perf.md)
 - [`winml sys`](../commands/sys.md)
-
diff --git a/docs/index.md b/docs/index.md
index 6136d7927..fbddfd2f1 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -25,7 +25,7 @@ Purpose-built for Windows hardware diversity, the CLI handles conversion, graph
 
 - **[How winml-cli Works](concepts/how-it-works.md)** — the pipeline from a PyTorch model to an EP-compiled artifact.
 - **[Commands](commands/overview.md)** — reference for all 12 `winml` subcommands.
-- **[Samples](samples/convnext-primitives.md)** — end-to-end walkthroughs for ConvNeXt, BERT, and CLIP.
+- **[Samples](samples/bert-config-build.md)** — walkthroughs for BERT and CLIP.
 
 ## License
 
diff --git a/docs/samples/bert-config-build.md b/docs/samples/bert-config-build.md
index 3b391e323..da5a5ad09 100644
--- a/docs/samples/bert-config-build.md
+++ b/docs/samples/bert-config-build.md
@@ -2,7 +2,7 @@
 
 BERT (`bert-base-uncased`) is a canonical text model that exercises every stage of the winml-cli pipeline: it has multiple input tensors, benefits from graph fusion (GeLU, LayerNorm, MatMul+Add), and produces quantizable activations that run well on NPU. That combination makes it a useful reference point for teams deploying transformer encoders on Windows.
 
-This sample walks through the production-style workflow: generate a reusable `WinMLBuildConfig` JSON file with `winml config`, run the full export → optimize → quantize → compile pipeline in one shot with `winml build`, and measure the result with `winml perf`. If you want to understand each pipeline stage individually before running the all-in-one command, read the [ConvNeXt primitives sample](convnext-primitives.md) first.
+This sample walks through the production-style workflow: generate a reusable `WinMLBuildConfig` JSON file with `winml config`, run the full export → optimize → quantize → compile pipeline in one shot with `winml build`, and measure the result with `winml perf`. If you want to understand each pipeline stage individually before running the all-in-one command, read the [ConvNeXt on NPU tutorial](../tutorials/npu-convnext.md) first.
 
 ## Prerequisites
 
diff --git a/docs/samples/clip-composite.md b/docs/samples/clip-composite.md
index 04b12e610..e48dd951f 100644
--- a/docs/samples/clip-composite.md
+++ b/docs/samples/clip-composite.md
@@ -156,6 +156,6 @@ The same composite model pattern is used for:
 ## See also
 
 - [BERT — Config + Build + Perf](bert-config-build.md) — single-model workflow
-- [ConvNeXt — Primitive commands](convnext-primitives.md) — step-by-step pipeline
+- [ConvNeXt on NPU](../tutorials/npu-convnext.md) — step-by-step pipeline
 - [Supported Models](../reference/supported-models.md) — full list of validated architectures
 - [Config and build](../concepts/config-and-build.md) — concept overview
diff --git a/docs/samples/convnext-primitives.md b/docs/samples/convnext-primitives.md
deleted file mode 100644
index 7a030404d..000000000
--- a/docs/samples/convnext-primitives.md
+++ /dev/null
@@ -1,175 +0,0 @@
-# ConvNeXt — Primitives Walkthrough
-
-!!! info "Pick the right ConvNeXt page"
-    - **This sample** — primitives on CPU, GPU (DirectML), and NPU (QNN) side-by-side. Best when you want to compare devices.
-    - **[ConvNeXt on NPU](../tutorials/npu-convnext.md)** — the canonical NPU production tutorial with both QNN and OpenVINO, plus the `winml build` one-shot.
-    - **[Quickstart](../getting-started/quickstart.md)** — short Getting Started tour.
-
-ConvNeXt Tiny is a compact convolutional image classifier trained on ImageNet-1k. At roughly 28 million parameters it is small enough to export and quantize in minutes on a developer laptop, yet representative enough that the latency and accuracy numbers you observe reflect real-world deployment trade-offs. Its straightforward architecture — no attention mechanisms, no dynamic control flow — makes it an ideal first model for learning the winml-cli pipeline.
-
-This walkthrough drives the full pipeline using the primitive commands directly: `winml inspect`, `winml config`, `winml export`, `winml quantize`, `winml compile`, `winml perf`, and `winml eval`. Running the steps individually rather than through `winml build` exposes what each command does and how its output feeds the next stage. The walkthrough covers three execution providers: CPU, GPU (DirectML), and NPU (Qualcomm QNN).
-
-## Prerequisites
-
-- winml-cli installed and `winml` available on your PATH — see [Installation](../getting-started/installation.md).
-- Internet access so HuggingFace Hub can download the model weights on first run.
-
-## Step 1: Inspect the model
-
-Before touching weights, confirm that winml-cli recognises the model and knows which task, loader class, and exporter to use.
-
-```bash
-winml inspect -m facebook/convnext-tiny-224
-```
-
-```text
-+------------------------- facebook/convnext-tiny-224 --------------------------+
-| Task          image-classification                                             |
-| Model Class   ConvNextForImageClassification                                   |
-| Exporter      OptimumExporter                                                  |
-| WinML Class   WinMLImageClassificationModel                                    |
-| Status        Supported                                                        |
-+-------------------------------------------------------------------------------+
-```
-
-!!! note "What we just did"
-    `winml inspect` fetched only the model's `config.json` from HuggingFace Hub — no weights — and confirmed that `facebook/convnext-tiny-224` maps to a supported task (`image-classification`), a known model class, and a compatible ONNX exporter.
-
-## Step 2: Generate a config (optional)
-
-```bash
-winml config -m facebook/convnext-tiny-224 -o convnext_config.json
-```
-
-Generating a config file is optional when running the primitives individually, but it is good practice: the JSON captures the auto-detected loader, export, quantization, and compile settings in one reproducible artifact. You can check it into source control, diff it against future versions of the model, or hand-edit individual fields before passing it to `winml build`. For a full description of every field, see [Config and build](../concepts/config-and-build.md).
-
-## Step 3: Export to ONNX
-
-Download the model weights and convert the PyTorch graph to a portable ONNX file.
-
-```bash
-winml export -m facebook/convnext-tiny-224 -o convnext.onnx
-```
-
-```text
-Model: facebook/convnext-tiny-224
-Output: convnext.onnx
-
-Starting HTP export...
-  Detected task: image-classification
-
-Success! Model exported to: convnext.onnx
-```
-
-!!! note "Hierarchy metadata"
-    By default `winml export` embeds `hierarchy_tag` metadata in each ONNX node, recording which PyTorch module the node originated from. This lets downstream tools like `winml perf --module` and `winml analyze` reason about operator groups rather than flat graph positions. To skip the metadata and produce a clean ONNX file, add `--no-hierarchy`. For more detail see [Load and export](../concepts/load-and-export.md).
-
-## Step 4: Quantize
-
-Insert QDQ (Quantize/Dequantize) nodes using 32 calibration samples drawn from the task-default dataset.
-
-```bash
-winml quantize -m convnext.onnx -o convnext_int8.onnx --precision int8 --samples 32
-```
-
-```text
-Calibrating: 32 samples [minmax]
-Inserting QDQ nodes...
-Saved: convnext_int8.onnx
-```
-
-!!! note "Calibration"
-    Static quantization needs representative inputs to estimate each tensor's value range before baking scale and zero-point constants into the QDQ nodes. The `--samples` flag controls how many calibration inputs are used; 32 is a reasonable starting point for vision classifiers. If you see accuracy regression after quantization, try increasing `--samples` or switching to `--method entropy`. See [Quantization & QDQ](../concepts/quantization.md) for the full trade-off discussion.
-
-## Step 5: Compile for each EP
-
-Compilation pre-bakes an EP-specific binary cache into the ONNX graph so the runtime can skip per-session JIT compilation. The examples below use the default `ort` compiler backend, which uses ONNX Runtime's built-in compiler.
-
-=== "CPU"
-
-    ```bash
-    winml compile -m convnext_int8.onnx --output-dir . --device cpu
-    ```
-
-=== "GPU"
-
-    ```bash
-    winml compile -m convnext_int8.onnx --output-dir . --device gpu
-    ```
-
-=== "NPU (ORT, default)"
-
-    ```bash
-    winml compile -m convnext_int8.onnx --output-dir . --device npu
-    ```
-
-!!! note "NPU compiler backend"
-    The default `--compiler ort` backend uses ONNX Runtime's built-in compilation. For a full explanation of how EPs relate to device targets see [ONNX & Execution Providers](../concepts/eps-and-devices.md).
-
-Only the NPU invocation writes a new compiled artifact — `convnext_int8_npu_ctx.onnx` — which contains an EPContext node embedding the pre-compiled binary. CPU and GPU compile with `enable_ep_context=False` by default: the compile step validates the model against the target EP but does not produce a new file. For CPU and GPU perf benchmarks (Step 6), use the quantized `convnext_int8.onnx` directly.
-
-## Step 6: Benchmark
-
-Measure latency and throughput on each device. Pass the compiled ONNX directly so the benchmark uses the pre-compiled artifact.
-
-```bash
-winml perf -m convnext_int8.onnx --device cpu --iterations 200
-```
-
-```text
-Device:      cpu
-Precision:   auto
-Task:        image-classification
-Iterations:  200 (+ 10 warmup)
-Batch Size:  1
-
-Latency (ms)
-  Avg    P50    P90    P95    P99    Min    Max    Std
- 8.41   8.35   9.02   9.31  10.14   7.88  12.63   0.48
-
-Throughput: 118.91 samples/sec
-```
-
-```bash
-winml perf -m convnext_int8.onnx --device gpu --iterations 200
-winml perf -m convnext_int8_npu_ctx.onnx --device npu --iterations 200
-```
-
-The NPU variant typically delivers the lowest latency and highest power efficiency on Qualcomm Snapdragon hardware. Use the JSON output written by `--output` to compare runs programmatically.
-
-## Step 7: Evaluate
-
-Measure top-1 accuracy on 100 samples from the ImageNet-1k validation split. When passing an ONNX file, supply `--model-id` so the command knows which preprocessor and label vocabulary to use.
-
-```bash
-winml eval -m convnext_int8.onnx --model-id facebook/convnext-tiny-224 \
-    --dataset imagenet-1k --split validation --samples 100 --device cpu
-```
-
-```text
-Task:     image-classification
-Dataset:  imagenet-1k (validation, 100 samples)
-Device:   cpu
-
-Accuracy: 81.00%
-
-Results saved to: convnext_int8_eval.json
-```
-
-To compare quantized accuracy against the floating-point baseline, run the same command with `convnext.onnx` and compare the two JSON outputs.
-
-## What you learned
-
-- `winml inspect` checks task detection and exporter compatibility from the model's `config.json` alone — no weight download needed.
-- `winml config` captures the full pipeline configuration as a reproducible JSON file.
-- `winml export` converts the PyTorch model to a portable ONNX graph and embeds hierarchy metadata for downstream analysis.
-- `winml quantize` inserts QDQ nodes using calibration data; `--precision int8` and `--samples` control the precision and calibration budget.
-- `winml compile` pre-bakes an EP-specific binary cache for NPU (producing `convnext_int8_npu_ctx.onnx`); CPU and GPU compile steps validate EP compatibility but produce no new artifact — use the quantized `convnext_int8.onnx` for those devices.
-- `winml perf` and `winml eval` consume the final artifact without modifying it — benchmark first, then validate accuracy before shipping.
-
-## See also
-
-- [BERT — Config + Build + Perf](bert-config-build.md) — the same pipeline driven through `winml build` with a config file
-- [How winml-cli Works](../concepts/how-it-works.md) — pipeline overview and stage descriptions
-- [Quantization & QDQ](../concepts/quantization.md) — calibration methods and accuracy trade-offs
-- [ONNX & Execution Providers](../concepts/eps-and-devices.md) — EP selection and device flags
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index 4cb1e0d4d..ed461d36c 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -1,11 +1,10 @@
 # ConvNeXt on NPU
 
 !!! info "Pick the right ConvNeXt page"
-    Three pages use ConvNeXt as their vehicle, each with a different teaching purpose:
+    Two pages use ConvNeXt as their vehicle:
 
     - **This tutorial** — the canonical deep-dive: full pipeline with both QNN and OpenVINO NPU backends, plus the `winml build` one-shot. Start here if you want to ship to NPU.
-    - **[ConvNeXt — Primitives Walkthrough](../samples/convnext-primitives.md)** — a CPU vs GPU vs NPU comparison using the primitive commands. Start here if you want to compare devices on the same model.
-    - **[End-to-End Tour](../getting-started/quickstart.md)** — the short Getting Started introduction. Start here for a 15-minute taste.
+    - **[Quickstart](../getting-started/quickstart.md)** — the short Getting Started introduction. Start here for a 15-minute taste.
 
 This tutorial walks you through the complete journey from a pretrained Hugging Face model — `facebook/convnext-tiny-224` — to a quantized, compiled artifact running on an NPU. By the end you will have benchmarked the model on your device and measured real inference latency. Nothing is skipped, and every command produces a file you can inspect or reuse.
 
@@ -278,7 +277,6 @@ The result should match what you saw in Step 8, confirming that the `winml build
 
 - [Concepts → How winml-cli works](../concepts/how-it-works.md) — the full mental model for the pipeline
 - [Concepts → Compile and EPContext](../concepts/compile-and-epcontext.md) — understanding the compiled artifact format
-- [Samples → ConvNeXt primitives walkthrough](../samples/convnext-primitives.md) — a side-by-side CPU vs. GPU vs. NPU device comparison using the same model
 - [Commands → Overview](../commands/overview.md) — quick reference for every flag on every command
 
 ## See also
diff --git a/mkdocs.yml b/mkdocs.yml
index 6a30bea1e..bbcdfa772 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -118,7 +118,6 @@ nav:
           - perf: commands/perf.md
           - eval: commands/eval.md
   - Samples:
-      - ConvNeXt — Primitives Walkthrough: samples/convnext-primitives.md
       - BERT — Config + Build + Perf: samples/bert-config-build.md
       - CLIP — Composite Models: samples/clip-composite.md
   - Tutorials:

From 570f9b498e02ef379c762002df4f7e46d5128b24 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Wed, 10 Jun 2026 19:35:44 +0800
Subject: [PATCH 123/143] docs: rename ConvNeXt tutorial, remove site logo icon

---
 docs/concepts/primitives-and-pipeline.md | 2 +-
 docs/samples/bert-config-build.md        | 2 +-
 docs/samples/clip-composite.md           | 2 +-
 docs/tutorials/build-from-onnx.md        | 4 ++--
 docs/tutorials/index.md                  | 2 +-
 docs/tutorials/npu-convnext.md           | 2 +-
 mkdocs.yml                               | 3 ++-
 7 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/docs/concepts/primitives-and-pipeline.md b/docs/concepts/primitives-and-pipeline.md
index 664555e6b..f9d820d3a 100644
--- a/docs/concepts/primitives-and-pipeline.md
+++ b/docs/concepts/primitives-and-pipeline.md
@@ -105,5 +105,5 @@ tune fusion flags and calibration — and then encode the final settings into a
 - [Config and build](config-and-build.md) — generating and versioning a
   `WinMLBuildConfig`
 - [winml build command reference](../commands/build.md)
-- [ConvNeXt on NPU tutorial](../tutorials/npu-convnext.md) — worked example
+- [Hugging Face Model to NPU tutorial](../tutorials/npu-convnext.md) — worked example
   using primitive commands end-to-end
diff --git a/docs/samples/bert-config-build.md b/docs/samples/bert-config-build.md
index da5a5ad09..d65d23328 100644
--- a/docs/samples/bert-config-build.md
+++ b/docs/samples/bert-config-build.md
@@ -2,7 +2,7 @@
 
 BERT (`bert-base-uncased`) is a canonical text model that exercises every stage of the winml-cli pipeline: it has multiple input tensors, benefits from graph fusion (GeLU, LayerNorm, MatMul+Add), and produces quantizable activations that run well on NPU. That combination makes it a useful reference point for teams deploying transformer encoders on Windows.
 
-This sample walks through the production-style workflow: generate a reusable `WinMLBuildConfig` JSON file with `winml config`, run the full export → optimize → quantize → compile pipeline in one shot with `winml build`, and measure the result with `winml perf`. If you want to understand each pipeline stage individually before running the all-in-one command, read the [ConvNeXt on NPU tutorial](../tutorials/npu-convnext.md) first.
+This sample walks through the production-style workflow: generate a reusable `WinMLBuildConfig` JSON file with `winml config`, run the full export → optimize → quantize → compile pipeline in one shot with `winml build`, and measure the result with `winml perf`. If you want to understand each pipeline stage individually before running the all-in-one command, read the [Hugging Face Model to NPU tutorial](../tutorials/npu-convnext.md) first.
 
 ## Prerequisites
 
diff --git a/docs/samples/clip-composite.md b/docs/samples/clip-composite.md
index e48dd951f..4f09d833a 100644
--- a/docs/samples/clip-composite.md
+++ b/docs/samples/clip-composite.md
@@ -156,6 +156,6 @@ The same composite model pattern is used for:
 ## See also
 
 - [BERT — Config + Build + Perf](bert-config-build.md) — single-model workflow
-- [ConvNeXt on NPU](../tutorials/npu-convnext.md) — step-by-step pipeline
+- [Hugging Face Model to NPU](../tutorials/npu-convnext.md) — step-by-step pipeline
 - [Supported Models](../reference/supported-models.md) — full list of validated architectures
 - [Config and build](../concepts/config-and-build.md) — concept overview
diff --git a/docs/tutorials/build-from-onnx.md b/docs/tutorials/build-from-onnx.md
index 69d181032..7dbe6fc42 100644
--- a/docs/tutorials/build-from-onnx.md
+++ b/docs/tutorials/build-from-onnx.md
@@ -2,7 +2,7 @@
 
 This tutorial walks you through the complete workflow for optimizing, analyzing, and deploying an ONNX model you already have — whether you exported it yourself (`torch.onnx.export`, ONNX Runtime tools), received it from a teammate, or downloaded it from the ONNX Model Zoo.
 
-Unlike the [ConvNeXt on NPU](npu-convnext.md) tutorial which starts from a HuggingFace model ID, this tutorial assumes you already have a `.onnx` file on disk and want to make it run faster on your target hardware.
+Unlike the [Hugging Face Model to NPU](npu-convnext.md) tutorial which starts from a HuggingFace model ID, this tutorial assumes you already have a `.onnx` file on disk and want to make it run faster on your target hardware.
 
 The tutorial is split into two sections. Section A walks through the analyze → optimize → re-analyze loop using primitive commands, teaching you how the optimization feedback cycle works. Section B shows how `winml build` automates that same loop in a single command, optionally targeting NPU with quantization.
 
@@ -261,7 +261,7 @@ print(f"Final model: {result.final_onnx_path}")
 
 ## Where to go next
 
-- [ConvNeXt on NPU](npu-convnext.md) — the same pipeline starting from HuggingFace (includes export stage)
+- [Hugging Face Model to NPU](npu-convnext.md) — the same pipeline starting from HuggingFace (includes export stage)
 - [Output Layout](../reference/output-layout.md) — what each output file contains and the `analyze_result.json` schema
 - [Concepts → Analyze and optimize](../concepts/analyze-and-optimize.md) — how the convergence loop works internally
 - [Build Config Schema](../reference/index.md) — customize quantization, compilation, and optimization settings
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
index 7f4d23713..cae936b5d 100644
--- a/docs/tutorials/index.md
+++ b/docs/tutorials/index.md
@@ -6,7 +6,7 @@ Tutorials are linear, prescriptive, end-to-end walkthroughs that guide you throu
 
 | Tutorial | What you'll build | Hardware |
 |---|---|---|
-| [ConvNeXt on NPU](npu-convnext.md) | A quantized ConvNeXt image classifier compiled for Snapdragon NPU (with CPU/DirectML fallback) | Copilot+PC NPU primary; CPU works as fallback |
+| [Hugging Face Model to NPU](npu-convnext.md) | A quantized ConvNeXt image classifier compiled for Snapdragon NPU (with CPU/DirectML fallback) | Copilot+PC NPU primary; CPU works as fallback |
 | [Bring Your Own ONNX Model](build-from-onnx.md) | Optimize and deploy an ONNX file you already have, using the analyze → optimize → re-analyze feedback loop | Any (CPU, NPU, GPU) |
 
 More tutorials are coming, covering additional model families, execution providers, and deployment scenarios. Check back as the `winml-cli` documentation expands.
diff --git a/docs/tutorials/npu-convnext.md b/docs/tutorials/npu-convnext.md
index ed461d36c..4c037cf3a 100644
--- a/docs/tutorials/npu-convnext.md
+++ b/docs/tutorials/npu-convnext.md
@@ -1,4 +1,4 @@
-# ConvNeXt on NPU
+# Hugging Face Model to NPU
 
 !!! info "Pick the right ConvNeXt page"
     Two pages use ConvNeXt as their vehicle:
diff --git a/mkdocs.yml b/mkdocs.yml
index bbcdfa772..b62d86296 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -19,6 +19,7 @@ extra:
 
 theme:
   name: material
+  logo: ""
   features:
     - navigation.instant
     - navigation.tracking
@@ -122,7 +123,7 @@ nav:
       - CLIP — Composite Models: samples/clip-composite.md
   - Tutorials:
       - Overview: tutorials/index.md
-      - ConvNeXt on NPU: tutorials/npu-convnext.md
+      - Hugging Face Model to NPU: tutorials/npu-convnext.md
       - Bring Your Own ONNX Model: tutorials/build-from-onnx.md
   - Reference:
       - Config Schema: reference/index.md

From bcdd42121df3386c70c031b643ab00bdfff2b804 Mon Sep 17 00:00:00 2001
From: Brenda Bai <yiba@microsoft.com>
Date: Wed, 10 Jun 2026 19:44:20 +0800
Subject: [PATCH 124/143] docs: expand What you learned section in BERT sample

---
 docs/samples/bert-config-build.md | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/docs/samples/bert-config-build.md b/docs/samples/bert-config-build.md
index d65d23328..e3b25c6e3 100644
--- a/docs/samples/bert-config-build.md
+++ b/docs/samples/bert-config-build.md
@@ -7,7 +7,6 @@ This sample walks through the production-style workflow: generate a reusable `Wi
 ## Prerequisites
 
 - winml-cli installed and `winml` on your PATH.
-- A network connection to download `bert-base-uncased` weights from HuggingFace on first run.
 - A target device (NPU or GPU recommended; CPU also works).
 
 ## Step 1: Generate a build config
@@ -22,33 +21,28 @@ This writes a `WinMLBuildConfig` JSON file to `bert_config.json`. The file captu
 {
   "loader": {
     "task": "text-classification",
+    "model_class": "AutoModelForSequenceClassification",
     "model_type": "bert"
   },
   "export": {
     "opset_version": 17,
     "batch_size": 1
+    .. // truncated: input_tensors, output_tensors
   },
-  "optim": {
-    "gelu_fusion": true,
-    "layer_norm_fusion": true,
-    "matmul_add_fusion": true
+   "optim": {
+    "clamp_constant_values": true
   },
   "quant": {
     "mode": "qdq",
     "weight_type": "uint8",
-    "activation_type": "uint8",
+    "activation_type": "uint16",
     "samples": 10,
     "calibration_method": "minmax",
     "task": "text-classification",
     "model_name": "bert-base-uncased"
     ... // truncated: per_channel, symmetric, distribution, ...
   },
-  "compile": {
-    "execution_provider": "qnn",
-    "enable_ep_context": true,
-    "compiler": "ort"
-    ... // truncated: provider_options, embed_context, validate, ...
-  }
+  "compile": null
 }
 ```
 
@@ -116,10 +110,10 @@ winml config -m bert-base-uncased -t text-classification --precision fp16 -o ber
 
 Alternatively, edit `bert_config.json` directly: set `quant.weight_type` and `quant.activation_type` to `"int8"` or `"uint16"`, or set `quant` to `null` to skip quantization entirely.
 
-**Disable a stage at build time.** You can suppress a stage for a single run without touching the config file using the `--no-quant` or `--no-compile` flags:
+**Disable a stage at build time.** You can suppress a stage for a single run without touching the config file using the `--no-quant` flags:
 
 ```bash
-winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/ --no-quant
+winml build -c bert_config.json -m bert-base-uncased --output-dir bert_out/ --no-quant 
 ```
 
 This is useful for measuring the fp32 baseline before committing to a quantized build. The `quant` section in `bert_config.json` is unchanged; the flag only affects this invocation. See [Config and build](../concepts/config-and-build.md) for the full list of configurable fields.
@@ -129,9 +123,9 @@ This is useful for measuring the fp32 baseline before committing to a quantized
 - `winml config` generates a complete, version-controllable `WinMLBuildConfig` JSON from a HuggingFace model ID in one command.
 - `winml build` orchestrates the full export → optimize → quantize → compile pipeline from a single config file and model ID.
 - The autoconf loop inside the optimize stage adjusts graph fusion flags automatically to maximize EP compatibility.
-- JSON fields (`quant`, `compile`) and CLI flags (`--no-quant`, `--no-compile`) are interchangeable ways to skip stages; CLI flags win for one-off experiments without modifying the file.
 - `winml perf` gives a latency and throughput baseline on the built artifact in seconds.
 
+
 ## See also
 
 - [winml config](../commands/config.md)

From 621ff23fd97292f8875ea91a9c4365feff4e2b0b Mon Sep 17 00:00:00 2001
From: Brenda Bai <yiba@microsoft.com>
Date: Wed, 10 Jun 2026 19:54:51 +0800
Subject: [PATCH 125/143] docs: add repo access link to index and tutorials
 pages

---
 docs/index.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/index.md b/docs/index.md
index fbddfd2f1..6f639b662 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -27,6 +27,10 @@ Purpose-built for Windows hardware diversity, the CLI handles conversion, graph
 - **[Commands](commands/overview.md)** — reference for all 12 `winml` subcommands.
 - **[Samples](samples/bert-config-build.md)** — walkthroughs for BERT and CLIP.
 
+## Repository access
+
+To request access to the Windows ML CLI repository, visit [aka.ms/winml-cli](https://aka.ms/winml-cli).
+
 ## License
 
 MIT. See [LICENSE](https://github.com/microsoft/winml-cli/blob/main/LICENSE.txt).

From e4ecb6c7b859b4c73c4b1a599736425c69c1acb7 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Wed, 10 Jun 2026 19:56:10 +0800
Subject: [PATCH 126/143] docs: rename site to Windows ML CLI, hide logo icon

---
 docs/stylesheets/extra.css | 3 +++
 mkdocs.yml                 | 6 ++++--
 2 files changed, 7 insertions(+), 2 deletions(-)
 create mode 100644 docs/stylesheets/extra.css

diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
new file mode 100644
index 000000000..3f386a8da
--- /dev/null
+++ b/docs/stylesheets/extra.css
@@ -0,0 +1,3 @@
+.md-header__button.md-logo {
+  display: none;
+}
diff --git a/mkdocs.yml b/mkdocs.yml
index b62d86296..f96d83619 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,4 +1,4 @@
-site_name: winml-cli
+site_name: Windows ML CLI
 site_description: A CLI toolkit to build portable, performant, and high-quality models for Windows ML.
 site_url: https://microsoft.github.io/winml-cli/
 repo_url: https://github.com/microsoft/winml-cli
@@ -17,9 +17,11 @@ extra:
     provider: mike
     default: latest
 
+extra_css:
+  - stylesheets/extra.css
+
 theme:
   name: material
-  logo: ""
   features:
     - navigation.instant
     - navigation.tracking

From 5484ab641f97d78a4491222be372cd71ec9bb4fd Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 08:51:28 +0800
Subject: [PATCH 127/143] docs: expand hierarchy tagging section in
 load-and-export

---
 docs/concepts/load-and-export.md | 45 +++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/docs/concepts/load-and-export.md b/docs/concepts/load-and-export.md
index fa72e5931..20b6183d1 100644
--- a/docs/concepts/load-and-export.md
+++ b/docs/concepts/load-and-export.md
@@ -18,7 +18,50 @@ Some community models host custom Python code in their repositories. The loader
 
 `winml export` converts the loaded model to ONNX. The conversion uses TorchScript tracing by default, which follows actual execution paths and tends to produce compact, inference-oriented graphs. A `--dynamo` flag exists for the PyTorch 2.x dynamo exporter; however, **Note:** the `--dynamo` flag is reserved for the PyTorch 2.x dynamo exporter but is **not yet functional** in the current release — passing it logs a warning and the flag is ignored.
 
-By default the exporter runs an eight-step process that includes hierarchy tracing and tag injection. Every ONNX node carries a `winml.hierarchy.tag` metadata entry recording the PyTorch module path it came from (e.g. `/BertModel/BertEncoder/BertLayer.3/BertAttention`), plus a companion `winml.hierarchy.depth` integer. The model itself also carries `winml.io.inputs` and `winml.io.outputs` JSON metadata describing the I/O tensor specs. Together these power per-module benchmarking with `winml perf --module`, inspector views with `winml inspect --hierarchy`, and optimizer scoping.
+By default the exporter runs an eight-step process that includes hierarchy tracing and tag injection. The result is an ONNX file enriched with structural metadata that powers downstream features such as per-module benchmarking, inspector views, and optimizer scoping.
+
+### Hierarchy tagging in detail
+
+During export the HTP (Hierarchy-preserving Tags Protocol) exporter attaches two pieces of information to every ONNX graph node via `node.metadata_props`:
+
+| Key | Value | Example |
+|-----|-------|---------|
+| `winml.hierarchy.tag` | Full module path the node originated from | `/BertModel/BertEncoder/BertLayer.3/BertAttention` |
+| `winml.hierarchy.depth` | Number of path segments (integer as string) | `4` |
+
+**How tags are built.** The exporter registers forward hooks on each module in the model. When a module executes, a pre-hook pushes its class name onto a tag stack; the post-hook pops it. This produces hierarchical paths that mirror the PyTorch module tree. Only modules that are actually executed during tracing receive tags — unused modules are excluded. For example, a typical BERT-tiny model has 48 registered modules but only 18 are reached during a forward pass.
+
+**Node-to-module mapping.** After the ONNX graph is produced by `torch.onnx.export`, a 4-priority system assigns each ONNX node to the closest matching module:
+
+1. **Direct match** — the node's scope name maps exactly to a traced module.
+2. **Parent match** — walk up the scope hierarchy until a traced module is found.
+3. **Operation fallback** (optional) — find the most similar scope by common prefix.
+4. **Root fallback** — unmatched nodes receive the model root tag (e.g. `/BertModel`).
+
+This guarantees 100 % tag coverage: every node in the graph carries a non-empty tag.
+
+### Graph-level metadata
+
+Beyond per-node tags, the exporter also writes model-level metadata properties:
+
+| Key | Content |
+|-----|---------|
+| `winml.io.inputs` | JSON array of `InputTensorSpec` — name, shape, dtype, and optional `value_range` |
+| `winml.io.outputs` | JSON array of `OutputTensorSpec` — name, shape, dtype |
+
+These I/O specs enable tools like `winml perf` to generate correct dummy inputs for benchmarking and `winml inspect` to display tensor shapes without loading the model into a runtime.
+
+### Sidecar metadata file
+
+Alongside the `.onnx` file, the exporter writes a `*_htp_metadata.json` sidecar containing the full hierarchy mapping, tagging coverage statistics, tracing execution summary, and input/output specs in a single queryable JSON document.
+
+### Features that depend on tags
+
+- **`winml inspect --hierarchy`** — reconstructs the module tree from tags and displays it as a Rich tree in the terminal.
+- **`winml perf --module <ClassName>`** — isolates a submodule (e.g. `BertAttention`) and benchmarks it independently.
+- **Optimizer scoping** — the optimizer can target specific hierarchy subtrees.
+
+### Disabling tags
 
 If you need a clean, standard-compliant ONNX without custom metadata — to hand off to a third-party tool, for example — pass `--no-hierarchy`. (The old `--clean-onnx` spelling remains as a deprecated hidden alias.) The graph behaviour is unchanged, but hierarchy-dependent features will not work against that file.
 

From dfad05f5e7735862a132abb6567e527c70340e0d Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 09:04:24 +0800
Subject: [PATCH 128/143] docs: add concrete tag examples, mermaid diagram, and
 real export data

---
 docs/concepts/load-and-export.md | 67 ++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 8 deletions(-)

diff --git a/docs/concepts/load-and-export.md b/docs/concepts/load-and-export.md
index 20b6183d1..109e0bb2b 100644
--- a/docs/concepts/load-and-export.md
+++ b/docs/concepts/load-and-export.md
@@ -26,17 +26,61 @@ During export the HTP (Hierarchy-preserving Tags Protocol) exporter attaches two
 
 | Key | Value | Example |
 |-----|-------|---------|
-| `winml.hierarchy.tag` | Full module path the node originated from | `/BertModel/BertEncoder/BertLayer.3/BertAttention` |
+| `winml.hierarchy.tag` | Full module path the node originated from | `/BertModel/BertEncoder/BertLayer.0/BertAttention` |
 | `winml.hierarchy.depth` | Number of path segments (integer as string) | `4` |
 
-**How tags are built.** The exporter registers forward hooks on each module in the model. When a module executes, a pre-hook pushes its class name onto a tag stack; the post-hook pops it. This produces hierarchical paths that mirror the PyTorch module tree. Only modules that are actually executed during tracing receive tags — unused modules are excluded. For example, a typical BERT-tiny model has 48 registered modules but only 18 are reached during a forward pass.
+#### How tags are built
 
-**Node-to-module mapping.** After the ONNX graph is produced by `torch.onnx.export`, a 4-priority system assigns each ONNX node to the closest matching module:
+The exporter registers PyTorch forward hooks on each module. When a module executes, a pre-hook pushes its class name onto a tag stack; the post-hook pops it. This produces hierarchical paths that mirror the PyTorch module tree:
 
-1. **Direct match** — the node's scope name maps exactly to a traced module.
-2. **Parent match** — walk up the scope hierarchy until a traced module is found.
-3. **Operation fallback** (optional) — find the most similar scope by common prefix.
-4. **Root fallback** — unmatched nodes receive the model root tag (e.g. `/BertModel`).
+```mermaid
+flowchart LR
+    A[Register hooks] --> B[Run forward pass]
+    B --> C[Pre-hook pushes tag]
+    C --> D[Child modules execute]
+    D --> E[Post-hook pops tag]
+    E --> F[Tag stack → path]
+```
+
+Only modules that are actually executed during tracing receive tags — unused modules are excluded. For example, `prajjwal1/bert-tiny` has 48 registered modules but only 18 are reached during a forward pass.
+
+#### Concrete example: BERT-tiny
+
+Running `winml export -m prajjwal1/bert-tiny -o model.onnx -v` produces the following hierarchy tree (18 traced modules, 132 ONNX nodes, 100 % coverage):
+
+```
+BertModel (132 nodes)
+├── BertEmbeddings: embeddings (7 nodes)
+├── BertEncoder: encoder (106 nodes)
+│   ├── BertLayer: encoder.layer.0 (53 nodes)
+│   │   ├── BertAttention: encoder.layer.0.attention (39 nodes)
+│   │   │   ├── BertSelfOutput: encoder.layer.0.attention.output (4 nodes)
+│   │   │   └── BertSdpaSelfAttention: encoder.layer.0.attention.self (35 nodes)
+│   │   ├── BertIntermediate: encoder.layer.0.intermediate (10 nodes)
+│   │   │   └── GELUActivation: encoder.layer.0.intermediate.intermediate_act_fn (8 nodes)
+│   │   └── BertOutput: encoder.layer.0.output (4 nodes)
+│   └── BertLayer: encoder.layer.1 (53 nodes)
+│       └── ... (same structure)
+└── BertPooler: pooler (0 nodes)
+```
+
+Each ONNX node gets its tag from the module it belongs to. Here are a few examples from the actual exported model:
+
+| ONNX node name | Assigned tag |
+|---------------|--------------|
+| `/embeddings/word_embeddings/Gather` | `/BertModel/BertEmbeddings` |
+| `/encoder/layer.0/attention/self/query/MatMul` | `/BertModel/BertEncoder/BertLayer.0/BertAttention/BertSdpaSelfAttention` |
+| `/encoder/layer.0/intermediate/intermediate_act_fn/Mul` | `/BertModel/BertEncoder/BertLayer.0/BertIntermediate/GELUActivation` |
+| `/Unsqueeze` (no scope) | `/BertModel` (root fallback) |
+
+#### Node-to-module mapping
+
+After the ONNX graph is produced by `torch.onnx.export`, a 4-priority system assigns each ONNX node to the closest matching module:
+
+1. **Direct match** (61 %) — the node's scope name maps exactly to a traced module.
+2. **Parent match** (24 %) — walk up the scope hierarchy until a traced module is found.
+3. **Operation fallback** (optional, off by default) — find the most similar scope by common prefix.
+4. **Root fallback** (14 %) — unmatched nodes receive the model root tag (e.g. `/BertModel`).
 
 This guarantees 100 % tag coverage: every node in the graph carries a non-empty tag.
 
@@ -53,7 +97,14 @@ These I/O specs enable tools like `winml perf` to generate correct dummy inputs
 
 ### Sidecar metadata file
 
-Alongside the `.onnx` file, the exporter writes a `*_htp_metadata.json` sidecar containing the full hierarchy mapping, tagging coverage statistics, tracing execution summary, and input/output specs in a single queryable JSON document.
+Alongside the `.onnx` file, the exporter writes a `*_htp_metadata.json` sidecar containing:
+
+- **`nodes`** — complete mapping of every ONNX node name → hierarchy tag
+- **`modules`** — traced module information (class name, tag, execution order)
+- **`statistics`** — export time, node counts, coverage percentage
+- **`outputs`** — I/O tensor specifications
+
+Use `--with-report` to additionally generate a human-readable markdown report (`*_htp_export_report.md`).
 
 ### Features that depend on tags
 

From 6d4f4d4214a9360648108f8c7173e982dfaa39b3 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 09:15:42 +0800
Subject: [PATCH 129/143] docs: fix inaccuracies in load-and-export tagging
 section

---
 docs/concepts/load-and-export.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docs/concepts/load-and-export.md b/docs/concepts/load-and-export.md
index 109e0bb2b..fd432ab77 100644
--- a/docs/concepts/load-and-export.md
+++ b/docs/concepts/load-and-export.md
@@ -108,7 +108,7 @@ Use `--with-report` to additionally generate a human-readable markdown report (`
 
 ### Features that depend on tags
 
-- **`winml inspect --hierarchy`** — reconstructs the module tree from tags and displays it as a Rich tree in the terminal.
+- **`winml inspect --hierarchy`** — traces the model with random weights and displays the resulting module tree in the terminal. This is a lightweight preview of what tags will look like after a full export.
 - **`winml perf --module <ClassName>`** — isolates a submodule (e.g. `BertAttention`) and benchmarks it independently.
 - **Optimizer scoping** — the optimizer can target specific hierarchy subtrees.
 
@@ -116,8 +116,6 @@ Use `--with-report` to additionally generate a human-readable markdown report (`
 
 If you need a clean, standard-compliant ONNX without custom metadata — to hand off to a third-party tool, for example — pass `--no-hierarchy`. (The old `--clean-onnx` spelling remains as a deprecated hidden alias.) The graph behaviour is unchanged, but hierarchy-dependent features will not work against that file.
 
-Use `--with-report` to generate companion markdown and JSON reports alongside the output.
-
 ## Where it goes wrong
 
 Most export failures fall into three categories.

From 191a8e9ed380c3fe0c226b29c8150da53169971b Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 09:47:05 +0800
Subject: [PATCH 130/143] docs: move Load and export before Primitives in nav

---
 mkdocs.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mkdocs.yml b/mkdocs.yml
index f96d83619..c6c21debe 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -95,8 +95,8 @@ nav:
           - Datatype and Quantization: concepts/quantization.md
           - EP and Device: concepts/eps-and-devices.md
       - WinML CLI:
-          - Primitives and pipeline: concepts/primitives-and-pipeline.md
           - Load and export: concepts/load-and-export.md
+          - Primitives and pipeline: concepts/primitives-and-pipeline.md
           - Analyze and optimize: concepts/analyze-and-optimize.md
           - Compile and EPContext: concepts/compile-and-epcontext.md
           - Perf and monitoring: concepts/perf-and-monitoring.md

From cbe63fcd6e493530d36030f0ef9f4e6da36bdac5 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 09:52:13 +0800
Subject: [PATCH 131/143] docs: remove unimplemented optimizer scoping claim

---
 docs/concepts/load-and-export.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/docs/concepts/load-and-export.md b/docs/concepts/load-and-export.md
index fd432ab77..b195adcd6 100644
--- a/docs/concepts/load-and-export.md
+++ b/docs/concepts/load-and-export.md
@@ -110,7 +110,6 @@ Use `--with-report` to additionally generate a human-readable markdown report (`
 
 - **`winml inspect --hierarchy`** — traces the model with random weights and displays the resulting module tree in the terminal. This is a lightweight preview of what tags will look like after a full export.
 - **`winml perf --module <ClassName>`** — isolates a submodule (e.g. `BertAttention`) and benchmarks it independently.
-- **Optimizer scoping** — the optimizer can target specific hierarchy subtrees.
 
 ### Disabling tags
 

From facf95e50ccbd4a491f117c7d7dc42169d5a4f4f Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 10:50:48 +0800
Subject: [PATCH 132/143] docs: enrich perf-and-monitoring with real output,
 flag table, JSON example

---
 docs/concepts/perf-and-monitoring.md | 95 ++++++++++++++++++++++++----
 1 file changed, 82 insertions(+), 13 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index f507e179a..02049c4d5 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -6,37 +6,106 @@ Because `winml perf` accepts both HuggingFace model IDs and local `.onnx` files,
 
 ## What perf measures
 
-At its core, `winml perf` runs a configurable number of inference iterations and reports latency statistics: p50, p90, and mean latency in milliseconds, plus throughput in inferences per second. Warmup iterations (controlled by `--warmup`, defaulting to 10) are excluded from the statistics so that JIT and cache effects do not skew the numbers.
-
-You can control the run length with `--iterations` and the input shape with `--batch-size` or a `--shape-config` JSON file for models with dynamic axes. The `--device` flag selects the target EP — `cpu`, `gpu`, `npu`, or `auto` (default) — allowing you to collect numbers on each target with the same command and compare them directly. For fine-grained EP control, `--ep` lets you name a specific provider such as `qnn` or `dml`.
-
-The results are written to a JSON file at `~/.cache/winml/perf/<slug>/<timestamp>.json` (or a custom path via `--output`) so they can be archived and compared across builds.
+At its core, `winml perf` runs a configurable number of inference iterations and reports latency statistics. Here is a real example benchmarking `bert-tiny` on CPU:
+
+```
+$ winml perf -m bert-tiny.onnx --device cpu --iterations 50 --warmup 5
+
+Device:      cpu / CPUExecutionProvider
+Model Precision:   fp32
+Inputs:      input_ids            [1, 512]               int32
+             attention_mask       [1, 512]               int32
+             token_type_ids       [1, 512]               int32
+Outputs:     last_hidden_state    [1, 512, 128]
+
+Latency (ms)
+┏━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┓
+┃  Avg ┃  P50 ┃  P90 ┃  P95 ┃  P99 ┃  Min ┃  Max ┃  Std ┃
+┡━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━┩
+│ 5.53 │ 5.40 │ 6.55 │ 6.87 │ 7.65 │ 4.89 │ 7.65 │ 0.58 │
+└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
+  Warmup: 14.14 ms avg (first 5 iterations)
+
+Throughput: 180.72 samples/sec
+```
+
+Key parameters:
+
+| Flag | Purpose | Default |
+|------|---------|---------|
+| `--iterations` | Number of benchmark iterations | 100 |
+| `--warmup` | Warmup iterations excluded from statistics | 10 |
+| `--batch-size` | Batch size for input generation | 1 |
+| `-d, --device` | Target device: `auto`, `cpu`, `gpu`, `npu` | `auto` |
+| `--ep` | Specific execution provider (e.g. `qnn`, `dml`, `openvino`) | auto-resolved from device |
+| `--precision` | Precision mode: `auto`, `fp32`, `fp16`, `int8`, `int16`, or `w{x}a{y}` | `auto` |
+| `--quantize/--no-quantize` | Include quantization during model build | `--quantize` |
+| `--skip-build/--no-skip-build` | Skip the build pipeline for ONNX inputs | `--skip-build` |
+
+### Output format
+
+Add `-f json` to emit structured JSON to stdout, suitable for CI pipelines or automated comparisons:
+
+```json
+{
+  "benchmark_info": {
+    "model_id": "bert-tiny.onnx",
+    "device": "cpu",
+    "ep": "CPUExecutionProvider",
+    "iterations": 50,
+    "warmup": 5,
+    "batch_size": 1
+  },
+  "latency_ms": {
+    "avg": 5.53, "p50": 5.40, "p90": 6.55,
+    "p95": 6.87, "p99": 7.65, "min": 4.89, "max": 7.65
+  },
+  "throughput": { "samples_per_sec": 180.72 },
+  "raw_samples_ms": [5.12, 5.40, ...]
+}
+```
+
+Results are also saved automatically to `~/.cache/winml/perf/<model_slug>/<timestamp>.json` for later comparison. Override the path with `--output`.
 
 ## Live monitoring
 
 Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
 
-The `--monitor` flag adds a live terminal chart that streams hardware utilisation for whichever device is being benchmarked. The chart updates in place during the iteration loop so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN or DirectML hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, the model is not executing on the expected device.
+The `--monitor` flag adds a live terminal chart (powered by plotext + Rich Live) that streams hardware utilisation for whichever device is being benchmarked. The chart auto-refreshes in a background thread so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN or DirectML hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, the model is not executing on the expected device.
+
+```
+winml perf -m model.onnx --device npu --monitor
+```
 
 `--monitor` has no effect on the measured latency statistics — it is a passive observer.
 
 ## Per-operator tracing
 
-When end-to-end latency is higher than expected, per-operator tracing lets you find the operators that are responsible. This capability is available via a hidden `--op-tracing` flag (not shown in `--help`) intended for advanced diagnostics. Two levels are available:
+When end-to-end latency is higher than expected, per-operator tracing lets you find the operators that are responsible. This capability is available via a hidden `--op-tracing` flag (not shown in `--help`) and requires `onnxruntime-qnn` to be installed.
 
-`--op-tracing basic` collects cumulative time per operator type and reports a ranked list. This is usually enough to identify whether, say, a sequence of Attention nodes or a large MatMul is dominating the runtime.
+Two levels are available:
 
-`--op-tracing detail` goes further, collecting timing for every individual operator node in the graph. This is useful when the same operator type appears in different parts of the model with very different costs — for instance, early-layer convolutions versus late-layer convolutions in a ResNet-style architecture.
+- **`--op-tracing basic`** — collects cumulative time per operator type and reports a ranked list. Usually enough to identify whether a sequence of Attention nodes or a large MatMul is dominating the runtime.
+- **`--op-tracing detail`** — collects timing for every individual operator node in the graph. Useful when the same operator type appears in different parts of the model with very different costs.
 
-If tracing is unavailable, `winml-cli` will tell you at startup rather than silently running without tracing.
+```
+winml perf -m model.onnx --op-tracing basic
+```
+
+!!! note
+    Op-tracing currently works only with the QNN execution provider. Running it on CPU or DML will produce an error indicating the requirement.
 
 ## Per-module benchmarking
 
-Large Transformer-family models contain many repeated module instances — attention blocks, feed-forward layers, encoder stages. When you want to understand the cost of one type of block rather than the full network, `--module <substring>` isolates and benchmarks matching modules from the HuggingFace model hierarchy.
+Large Transformer-family models contain many repeated module instances — attention blocks, feed-forward layers, encoder stages. When you want to understand the cost of one type of block rather than the full network, `--module <ClassName>` isolates and benchmarks matching modules from the HuggingFace model hierarchy.
+
+```
+winml perf -m bert-base-uncased --module BertAttention
+```
 
-`winml perf -m bert-base-uncased --module BertAttention`, for example, builds and benchmarks each `BertAttention` instance separately and reports per-instance statistics. This is faster to iterate on than benchmarking the full model when you are tuning a specific layer, and it makes the attribution of latency to architectural decisions much clearer.
+This builds and benchmarks each `BertAttention` instance separately and reports per-instance statistics. The `--module` argument must be a **class name** (e.g. `BertAttention`), not a dotted module path (e.g. not `encoder.layer.0.attention`).
 
-The module hierarchy that `--module` navigates is built at export time: every ONNX node carries a `winml.hierarchy.tag` metadata entry recording the PyTorch module path it came from. `winml perf --module` matches against those tags, builds a separate ONNX for each match, and benchmarks them in isolation. See `winml inspect --hierarchy` to view the tree for an exported model, or [Load and export](load-and-export.md) for how the metadata is written.
+The module hierarchy that `--module` navigates is built at export time: every ONNX node carries a `winml.hierarchy.tag` metadata entry recording the PyTorch module path it came from. `winml perf --module` matches against those tags, builds a separate ONNX for each match, and benchmarks them in isolation. See [Load and export](load-and-export.md) for how the metadata is written.
 
 ## See also
 

From 94b6ac0c34422e079a0a782df3f82a25be42347d Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:01:41 +0800
Subject: [PATCH 133/143] docs: fix perf output table rendering

---
 docs/concepts/perf-and-monitoring.md | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index 02049c4d5..906c3d394 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -12,20 +12,21 @@ At its core, `winml perf` runs a configurable number of inference iterations and
 $ winml perf -m bert-tiny.onnx --device cpu --iterations 50 --warmup 5
 
 Device:      cpu / CPUExecutionProvider
-Model Precision:   fp32
-Inputs:      input_ids            [1, 512]               int32
-             attention_mask       [1, 512]               int32
-             token_type_ids       [1, 512]               int32
+Precision:   fp32
+Inputs:      input_ids            [1, 512]    int32
+             attention_mask       [1, 512]    int32
+             token_type_ids       [1, 512]    int32
 Outputs:     last_hidden_state    [1, 512, 128]
+```
+
+Output latency table:
 
-Latency (ms)
-┏━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┓
-┃  Avg ┃  P50 ┃  P90 ┃  P95 ┃  P99 ┃  Min ┃  Max ┃  Std ┃
-┡━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━┩
-│ 5.53 │ 5.40 │ 6.55 │ 6.87 │ 7.65 │ 4.89 │ 7.65 │ 0.58 │
-└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
-  Warmup: 14.14 ms avg (first 5 iterations)
+| Avg | P50 | P90 | P95 | P99 | Min | Max | Std |
+|-----|-----|-----|-----|-----|-----|-----|-----|
+| 5.53 | 5.40 | 6.55 | 6.87 | 7.65 | 4.89 | 7.65 | 0.58 |
 
+```
+Warmup: 14.14 ms avg (first 5 iterations)
 Throughput: 180.72 samples/sec
 ```
 

From 29c8da64209a88d8b9f66f50d7981c131efe2be3 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:04:08 +0800
Subject: [PATCH 134/143] docs: remove per-operator tracing section (not ready)

---
 docs/concepts/perf-and-monitoring.md | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index 906c3d394..a7de45962 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -80,22 +80,6 @@ winml perf -m model.onnx --device npu --monitor
 
 `--monitor` has no effect on the measured latency statistics — it is a passive observer.
 
-## Per-operator tracing
-
-When end-to-end latency is higher than expected, per-operator tracing lets you find the operators that are responsible. This capability is available via a hidden `--op-tracing` flag (not shown in `--help`) and requires `onnxruntime-qnn` to be installed.
-
-Two levels are available:
-
-- **`--op-tracing basic`** — collects cumulative time per operator type and reports a ranked list. Usually enough to identify whether a sequence of Attention nodes or a large MatMul is dominating the runtime.
-- **`--op-tracing detail`** — collects timing for every individual operator node in the graph. Useful when the same operator type appears in different parts of the model with very different costs.
-
-```
-winml perf -m model.onnx --op-tracing basic
-```
-
-!!! note
-    Op-tracing currently works only with the QNN execution provider. Running it on CPU or DML will produce an error indicating the requirement.
-
 ## Per-module benchmarking
 
 Large Transformer-family models contain many repeated module instances — attention blocks, feed-forward layers, encoder stages. When you want to understand the cost of one type of block rather than the full network, `--module <ClassName>` isolates and benchmarks matching modules from the HuggingFace model hierarchy.

From 6090cf87fe37d2eda738faab518ebbc31389b1be Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:13:56 +0800
Subject: [PATCH 135/143] docs: add memory measurement details to perf
 monitoring section

---
 docs/concepts/perf-and-monitoring.md | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index a7de45962..d968b8b04 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -68,7 +68,7 @@ Add `-f json` to emit structured JSON to stdout, suitable for CI pipelines or au
 
 Results are also saved automatically to `~/.cache/winml/perf/<model_slug>/<timestamp>.json` for later comparison. Override the path with `--output`.
 
-## Live monitoring
+## Live monitoring and memory measurement
 
 Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
 
@@ -78,6 +78,27 @@ The `--monitor` flag adds a live terminal chart (powered by plotext + Rich Live)
 winml perf -m model.onnx --device npu --monitor
 ```
 
+### Collected metrics
+
+When `--monitor` is active, the following metrics are sampled throughout the benchmark run and reported at the end:
+
+| Category | Metrics |
+|----------|---------|
+| **Device (NPU/GPU)** | Mean and peak utilization %, running time |
+| **Device memory** | Peak dedicated (local) MB, peak shared MB |
+| **CPU** | Mean and peak utilization % |
+| **RAM** | Current used MB, peak used MB |
+
+Example output (NPU device):
+
+```
+Hardware (during benchmark)
+  NPU: 87.3% avg, 100.0% peak  |  CPU: 12.1% avg  |  Mem: 1842 MB
+  Device Mem: 245/0 MB (local/shared)
+```
+
+These metrics are also included in the JSON results file under `hw_monitor`, enabling automated tracking of memory usage across model revisions.
+
 `--monitor` has no effect on the measured latency statistics — it is a passive observer.
 
 ## Per-module benchmarking

From e692180b63df0f9affc3fec7d92f22d575834c66 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:17:51 +0800
Subject: [PATCH 136/143] docs(perf): separate live monitoring and memory
 metrics sections

---
 docs/concepts/perf-and-monitoring.md | 35 +++++++++++++++++++---------
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index d968b8b04..c4ea23eee 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -68,7 +68,7 @@ Add `-f json` to emit structured JSON to stdout, suitable for CI pipelines or au
 
 Results are also saved automatically to `~/.cache/winml/perf/<model_slug>/<timestamp>.json` for later comparison. Override the path with `--output`.
 
-## Live monitoring and memory measurement
+## Live monitoring
 
 Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
 
@@ -78,16 +78,19 @@ The `--monitor` flag adds a live terminal chart (powered by plotext + Rich Live)
 winml perf -m model.onnx --device npu --monitor
 ```
 
-### Collected metrics
+`--monitor` has no effect on the measured latency statistics — it is a passive observer.
+
+## Memory and resource metrics
 
-When `--monitor` is active, the following metrics are sampled throughout the benchmark run and reported at the end:
+When `--monitor` is active, hardware metrics are sampled throughout the benchmark and reported at the end. These metrics help answer questions like "how much device memory does this model need?" and "is the model memory-bound?".
 
-| Category | Metrics |
-|----------|---------|
-| **Device (NPU/GPU)** | Mean and peak utilization %, running time |
-| **Device memory** | Peak dedicated (local) MB, peak shared MB |
-| **CPU** | Mean and peak utilization % |
-| **RAM** | Current used MB, peak used MB |
+| Category | Metrics | Description |
+|----------|---------|-------------|
+| **Device memory (local)** | Peak dedicated MB | VRAM or on-device memory exclusively allocated to the inference workload |
+| **Device memory (shared)** | Peak shared MB | System memory shared with the device (common on integrated GPUs and NPUs) |
+| **RAM** | Used MB, peak used MB | Process-level system memory consumption |
+| **CPU** | Mean %, peak % | CPU utilisation during the benchmark window |
+| **Device utilisation** | Mean %, peak % | NPU or GPU engine utilisation (hardware-reported) |
 
 Example output (NPU device):
 
@@ -97,9 +100,19 @@ Hardware (during benchmark)
   Device Mem: 245/0 MB (local/shared)
 ```
 
-These metrics are also included in the JSON results file under `hw_monitor`, enabling automated tracking of memory usage across model revisions.
+In JSON output (`-f json`), these metrics appear under the `hw_monitor` key:
 
-`--monitor` has no effect on the measured latency statistics — it is a passive observer.
+```json
+"hw_monitor": {
+  "device_kind": "npu",
+  "device_memory": { "local_peak_mb": 245, "shared_peak_mb": 0 },
+  "cpu": { "mean_pct": 12.1, "peak_pct": 34.5 },
+  "ram": { "used_mb": 1842, "peak_used_mb": 1910 },
+  "npu": { "mean_pct": 87.3, "peak_pct": 100.0 }
+}
+```
+
+This makes it straightforward to track memory consumption across model revisions or compare devices programmatically.
 
 ## Per-module benchmarking
 

From ddbdf0483204485a1cbeec6935fb2ed0d327f9cd Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:29:38 +0800
Subject: [PATCH 137/143] docs(perf): fix hw_monitor JSON to match actual
 output

---
 docs/concepts/perf-and-monitoring.md | 33 ++++++++++++++++++----------
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index c4ea23eee..e8a30a863 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -86,13 +86,20 @@ When `--monitor` is active, hardware metrics are sampled throughout the benchmar
 
 | Category | Metrics | Description |
 |----------|---------|-------------|
-| **Device memory (local)** | Peak dedicated MB | VRAM or on-device memory exclusively allocated to the inference workload |
-| **Device memory (shared)** | Peak shared MB | System memory shared with the device (common on integrated GPUs and NPUs) |
-| **RAM** | Used MB, peak used MB | Process-level system memory consumption |
-| **CPU** | Mean %, peak % | CPU utilisation during the benchmark window |
-| **Device utilisation** | Mean %, peak % | NPU or GPU engine utilisation (hardware-reported) |
+| **Device memory (local)** | `local_peak_mb` | VRAM or on-device memory exclusively allocated to the inference workload |
+| **Device memory (shared)** | `shared_peak_mb` | System memory shared with the device (common on integrated GPUs and NPUs) |
+| **RAM** | `used_mb`, `peak_mb` | Process-level system memory consumption |
+| **CPU** | `mean_pct`, `peak_pct` | CPU utilisation during the benchmark window |
+| **Device utilisation** | `mean_pct`, `peak_pct` | NPU or GPU engine utilisation (only present when `device_kind` is `npu` or `gpu`) |
 
-Example output (NPU device):
+Example terminal output (CPU device):
+
+```
+Hardware (during benchmark)
+  CPU: 8.3% avg  |  Mem: 644 MB
+```
+
+When running on NPU or GPU, device utilisation and device memory are also shown:
 
 ```
 Hardware (during benchmark)
@@ -104,14 +111,18 @@ In JSON output (`-f json`), these metrics appear under the `hw_monitor` key:
 
 ```json
 "hw_monitor": {
-  "device_kind": "npu",
-  "device_memory": { "local_peak_mb": 245, "shared_peak_mb": 0 },
-  "cpu": { "mean_pct": 12.1, "peak_pct": 34.5 },
-  "ram": { "used_mb": 1842, "peak_used_mb": 1910 },
-  "npu": { "mean_pct": 87.3, "peak_pct": 100.0 }
+  "monitor": "HWMonitor",
+  "device_kind": null,
+  "adapter_luid": null,
+  "cpu": { "mean_pct": 15.8, "peak_pct": 16.71, "sample_count": 2 },
+  "ram": { "used_mb": 640.21, "peak_mb": 640.21 },
+  "device_memory": { "local_peak_mb": 0.0, "shared_peak_mb": 0.0 },
+  "running_time_ns": 0
 }
 ```
 
+When a hardware accelerator is active, `device_kind` will be `"npu"` or `"gpu"`, and an additional key (e.g. `"npu"`) appears with the device utilisation percentages. The `running_time_ns` field reports the GPU/NPU engine running time as reported by the driver.
+
 This makes it straightforward to track memory consumption across model revisions or compare devices programmatically.
 
 ## Per-module benchmarking

From e43d8c3058f44eb06559abece6e83204f001cf6e Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:33:05 +0800
Subject: [PATCH 138/143] docs(perf): add per-device metrics breakdown
 (CPU/GPU/NPU)

---
 docs/concepts/perf-and-monitoring.md | 43 +++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index e8a30a863..9a508e7c3 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -84,22 +84,31 @@ winml perf -m model.onnx --device npu --monitor
 
 When `--monitor` is active, hardware metrics are sampled throughout the benchmark and reported at the end. These metrics help answer questions like "how much device memory does this model need?" and "is the model memory-bound?".
 
-| Category | Metrics | Description |
-|----------|---------|-------------|
-| **Device memory (local)** | `local_peak_mb` | VRAM or on-device memory exclusively allocated to the inference workload |
-| **Device memory (shared)** | `shared_peak_mb` | System memory shared with the device (common on integrated GPUs and NPUs) |
-| **RAM** | `used_mb`, `peak_mb` | Process-level system memory consumption |
-| **CPU** | `mean_pct`, `peak_pct` | CPU utilisation during the benchmark window |
-| **Device utilisation** | `mean_pct`, `peak_pct` | NPU or GPU engine utilisation (only present when `device_kind` is `npu` or `gpu`) |
+The metrics collected depend on the target device:
 
-Example terminal output (CPU device):
+| Metric | CPU | GPU | NPU |
+|--------|:---:|:---:|:---:|
+| CPU utilisation (mean/peak %) | ✓ | ✓ | ✓ |
+| RAM (used MB, peak MB) | ✓ | ✓ | ✓ |
+| Device utilisation (mean/peak %) | — | ✓ | ✓ |
+| Device memory local (peak MB) | — | ✓ | ✓ |
+| Device memory shared (peak MB) | — | ✓ | ✓ |
+| Engine running time (ns) | — | ✓ | ✓ |
+
+- **CPU**: Only system-level metrics (CPU %, RAM) are reported since there is no separate device memory.
+- **GPU**: Reports GPU engine utilisation plus dedicated VRAM (`local_peak_mb`) and shared system memory (`shared_peak_mb`) allocated by the GPU driver.
+- **NPU**: Same structure as GPU. NPU adapters register as Windows GPU Engine devices, so utilisation and memory are read via the same PDH counters. `local_peak_mb` represents on-chip memory; `shared_peak_mb` is system memory shared with the NPU.
+
+### Terminal output
+
+CPU device:
 
 ```
 Hardware (during benchmark)
   CPU: 8.3% avg  |  Mem: 644 MB
 ```
 
-When running on NPU or GPU, device utilisation and device memory are also shown:
+NPU or GPU device:
 
 ```
 Hardware (during benchmark)
@@ -107,6 +116,8 @@ Hardware (during benchmark)
   Device Mem: 245/0 MB (local/shared)
 ```
 
+### JSON structure
+
 In JSON output (`-f json`), these metrics appear under the `hw_monitor` key:
 
 ```json
@@ -121,7 +132,19 @@ In JSON output (`-f json`), these metrics appear under the `hw_monitor` key:
 }
 ```
 
-When a hardware accelerator is active, `device_kind` will be `"npu"` or `"gpu"`, and an additional key (e.g. `"npu"`) appears with the device utilisation percentages. The `running_time_ns` field reports the GPU/NPU engine running time as reported by the driver.
+When a hardware accelerator is active, `device_kind` will be `"npu"` or `"gpu"`, and an additional key (e.g. `"npu"`) appears with device utilisation:
+
+```json
+"hw_monitor": {
+  "device_kind": "npu",
+  "adapter_luid": "0x0000abcd12340000",
+  "cpu": { "mean_pct": 12.1, "peak_pct": 34.5, "sample_count": 50 },
+  "ram": { "used_mb": 1842.0, "peak_mb": 1910.0 },
+  "device_memory": { "local_peak_mb": 245.0, "shared_peak_mb": 0.0 },
+  "npu": { "mean_pct": 87.3, "peak_pct": 100.0, "sample_count": 50 },
+  "running_time_ns": 4820000000
+}
+```
 
 This makes it straightforward to track memory consumption across model revisions or compare devices programmatically.
 

From 8f69373a7681781684d1ba5319b68b67db8dcb2e Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:45:23 +0800
Subject: [PATCH 139/143] docs(perf): fix inaccuracies found during review

- JSON key 'avg' -> 'mean' (matches actual output)
- Add missing JSON fields: task, precision, timestamp, std, warmup_mean, batches_per_sec
- Fix terminal label 'Precision' -> 'Model Precision'
- Add missing 'Task:' line in terminal example
- Remove false claim about --module using ONNX hierarchy tags
  (it uses torchinfo to discover PyTorch submodules, not ONNX metadata)
- Remove 'per-operator timings' from intro (op-tracing not ready)
---
 docs/concepts/perf-and-monitoring.md | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index 9a508e7c3..f95106dd2 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -1,6 +1,6 @@
 # Perf and monitoring
 
-Knowing that a model produces correct outputs is necessary but not sufficient for a production deployment. You also need to know how fast it runs, how consistently it runs, and where the time goes when it does not run fast enough. `winml perf` is the primary tool in `winml-cli` for answering those questions. It synthesises end-to-end latency numbers, per-operator timings, and live hardware utilisation into a single benchmarking workflow.
+Knowing that a model produces correct outputs is necessary but not sufficient for a production deployment. You also need to know how fast it runs, how consistently it runs, and where the time goes when it does not run fast enough. `winml perf` is the primary tool in `winml-cli` for answering those questions. It synthesises end-to-end latency numbers and live hardware utilisation into a single benchmarking workflow.
 
 Because `winml perf` accepts both HuggingFace model IDs and local `.onnx` files, you can benchmark at any stage of the development cycle — from a freshly exported float model through to a compiled, quantized production artifact.
 
@@ -12,7 +12,8 @@ At its core, `winml perf` runs a configurable number of inference iterations and
 $ winml perf -m bert-tiny.onnx --device cpu --iterations 50 --warmup 5
 
 Device:      cpu / CPUExecutionProvider
-Precision:   fp32
+Task:        auto (auto-detected)
+Model Precision:   fp32
 Inputs:      input_ids            [1, 512]    int32
              attention_mask       [1, 512]    int32
              token_type_ids       [1, 512]    int32
@@ -51,17 +52,21 @@ Add `-f json` to emit structured JSON to stdout, suitable for CI pipelines or au
 {
   "benchmark_info": {
     "model_id": "bert-tiny.onnx",
+    "task": "auto-detected",
     "device": "cpu",
     "ep": "CPUExecutionProvider",
+    "precision": "auto",
     "iterations": 50,
     "warmup": 5,
-    "batch_size": 1
+    "batch_size": 1,
+    "timestamp": "2026-06-11T03:27:24+00:00"
   },
   "latency_ms": {
-    "avg": 5.53, "p50": 5.40, "p90": 6.55,
-    "p95": 6.87, "p99": 7.65, "min": 4.89, "max": 7.65
+    "mean": 5.53, "p50": 5.40, "p90": 6.55,
+    "p95": 6.87, "p99": 7.65, "min": 4.89, "max": 7.65,
+    "std": 0.58, "warmup_mean": 14.14
   },
-  "throughput": { "samples_per_sec": 180.72 },
+  "throughput": { "samples_per_sec": 180.72, "batches_per_sec": 180.72 },
   "raw_samples_ms": [5.12, 5.40, ...]
 }
 ```
@@ -158,7 +163,7 @@ winml perf -m bert-base-uncased --module BertAttention
 
 This builds and benchmarks each `BertAttention` instance separately and reports per-instance statistics. The `--module` argument must be a **class name** (e.g. `BertAttention`), not a dotted module path (e.g. not `encoder.layer.0.attention`).
 
-The module hierarchy that `--module` navigates is built at export time: every ONNX node carries a `winml.hierarchy.tag` metadata entry recording the PyTorch module path it came from. `winml perf --module` matches against those tags, builds a separate ONNX for each match, and benchmarks them in isolation. See [Load and export](load-and-export.md) for how the metadata is written.
+Internally, `--module` uses `torchinfo` to discover all submodule instances matching the given class name in the HuggingFace model. For each match it generates a separate build config, exports an isolated ONNX file, and benchmarks it independently. This requires a HuggingFace model ID (not a local `.onnx` file) because it needs access to the PyTorch module tree.
 
 ## See also
 

From c29a69d2b3a89cd8d2964ba9a8982ad9977a3866 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Thu, 11 Jun 2026 11:49:26 +0800
Subject: [PATCH 140/143] docs(perf): address review findings

- Add model_info block to JSON example (always emitted)
- Soften --monitor 'no effect' to acknowledge small system overhead
- Change 'not executing' to 'strong signal to investigate'
- Add 'monitor' field to NPU JSON example
- Fix 'on-chip memory' -> 'dedicated adapter memory'
- Note that JSON always includes device_memory even for CPU (zeroed)
---
 docs/concepts/perf-and-monitoring.md | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/docs/concepts/perf-and-monitoring.md b/docs/concepts/perf-and-monitoring.md
index f95106dd2..235018e0f 100644
--- a/docs/concepts/perf-and-monitoring.md
+++ b/docs/concepts/perf-and-monitoring.md
@@ -61,6 +61,13 @@ Add `-f json` to emit structured JSON to stdout, suitable for CI pipelines or au
     "batch_size": 1,
     "timestamp": "2026-06-11T03:27:24+00:00"
   },
+  "model_info": {
+    "input_names": ["input_ids", "attention_mask", "token_type_ids"],
+    "input_shapes": [[1, 512], [1, 512], [1, 512]],
+    "input_types": ["int32", "int32", "int32"],
+    "output_names": ["last_hidden_state"],
+    "output_shapes": [[1, 512, 128]]
+  },
   "latency_ms": {
     "mean": 5.53, "p50": 5.40, "p90": 6.55,
     "p95": 6.87, "p99": 7.65, "min": 4.89, "max": 7.65,
@@ -77,13 +84,13 @@ Results are also saved automatically to `~/.cache/winml/perf/<model_slug>/<times
 
 Latency numbers alone do not tell you whether the hardware is actually being used. A slow NPU inference could mean the model is running on the NPU and hitting a memory bottleneck, or it could mean the EP silently fell back to CPU and is not using the NPU at all.
 
-The `--monitor` flag adds a live terminal chart (powered by plotext + Rich Live) that streams hardware utilisation for whichever device is being benchmarked. The chart auto-refreshes in a background thread so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN or DirectML hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, the model is not executing on the expected device.
+The `--monitor` flag adds a live terminal chart (powered by plotext + Rich Live) that streams hardware utilisation for whichever device is being benchmarked. The chart updates once per iteration so you can see whether utilisation is sustained, bursty, or absent. This is particularly useful when commissioning a new model on QNN or DirectML hardware, where EP fallback can be hard to detect from latency numbers alone. If the chart stays near zero while the benchmark runs, it is a strong signal that the model may not be executing on the expected device — investigate further with EP-specific tools.
 
 ```
 winml perf -m model.onnx --device npu --monitor
 ```
 
-`--monitor` has no effect on the measured latency statistics — it is a passive observer.
+Display updates are not included in the timed inference call, but monitoring may introduce small system overhead from background PDH polling.
 
 ## Memory and resource metrics
 
@@ -100,9 +107,9 @@ The metrics collected depend on the target device:
 | Device memory shared (peak MB) | — | ✓ | ✓ |
 | Engine running time (ns) | — | ✓ | ✓ |
 
-- **CPU**: Only system-level metrics (CPU %, RAM) are reported since there is no separate device memory.
+- **CPU**: Only system-level metrics (CPU %, RAM) are shown in terminal output. In JSON, `device_memory` and `running_time_ns` are still present but will be zero.
 - **GPU**: Reports GPU engine utilisation plus dedicated VRAM (`local_peak_mb`) and shared system memory (`shared_peak_mb`) allocated by the GPU driver.
-- **NPU**: Same structure as GPU. NPU adapters register as Windows GPU Engine devices, so utilisation and memory are read via the same PDH counters. `local_peak_mb` represents on-chip memory; `shared_peak_mb` is system memory shared with the NPU.
+- **NPU**: Same structure as GPU. NPU adapters register as Windows GPU Engine devices, so utilisation and memory are read via the same PDH counters. `local_peak_mb` represents dedicated adapter memory; `shared_peak_mb` is system memory shared with the NPU.
 
 ### Terminal output
 
@@ -141,6 +148,7 @@ When a hardware accelerator is active, `device_kind` will be `"npu"` or `"gpu"`,
 
 ```json
 "hw_monitor": {
+  "monitor": "HWMonitor",
   "device_kind": "npu",
   "adapter_luid": "0x0000abcd12340000",
   "cpu": { "mean_pct": 12.1, "peak_pct": 34.5, "sample_count": 50 },

From 559cd77b2ffda688d8368c04ee467ec1540f42eb Mon Sep 17 00:00:00 2001
From: Zhenchao Ni <zhenni@microsoft.com>
Date: Thu, 11 Jun 2026 15:00:22 +0800
Subject: [PATCH 141/143] Fix docs for eval, compile and quantize (#874)

Fix docs for eval, compile and quantize
---
 docs/commands/compile.md  |  7 ++--
 docs/commands/eval.md     | 80 ++++++++++++++++++++++++++++++---------
 docs/commands/quantize.md | 50 +++++++++++++-----------
 3 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/docs/commands/compile.md b/docs/commands/compile.md
index ca69a5267..d5e516971 100644
--- a/docs/commands/compile.md
+++ b/docs/commands/compile.md
@@ -19,10 +19,11 @@ $ winml compile [options]
 | Flag | Short | Type | Default | Description |
 |---|---|---|---|---|
 | `--model` | `-m` | path | *(required unless `--list`)* | Input ONNX model file. |
+| `--output` | `-o` | path | — | Output file path (e.g., `model_compiled.onnx`). Takes precedence over `--output-dir`. |
 | `--output-dir` | | path | same dir as input | Directory to write compiled output artifacts. |
 | `--device` | `-d` | choice | `auto` | Target device: `auto`, `npu`, `gpu`, or `cpu`. |
 | `--ep` | | choice | `None` | Force a specific execution provider, overriding device-to-provider mapping. Choices: `cpu`, `cuda`, `dml`, `migraphx`, `openvino`, `qnn`, `tensorrt`, `vitisai`. |
-| `--no-validate` | | flag | `false` | Skip validation of the compiled model after compilation. |
+| `--validate` / `--no-validate` | | flag | `--validate` | Run a post-compilation validation pass on the target hardware. Enabled by default; pass `--no-validate` to skip when the target hardware or driver is unavailable. |
 | `--compiler` | | choice | `ort` | Compiler backend: `ort` (ONNX Runtime) or `qairt` (Qualcomm AI Runtime Tools). |
 | `--qnn-sdk-root` | | path | `None` | Path to the QNN SDK root directory. |
 | `--embed/--no-embed` | | flag | `false` | Embed the EP context blob inside the ONNX file instead of writing a separate `.bin` file. |
@@ -73,8 +74,8 @@ winml compile -m bert-base-uncased_qdq.onnx --embed
 ```
 
 ```bash
-# Compile for GPU using the MIGraphX execution provider
-winml compile -m microsoft_resnet50.onnx --device gpu --ep migraphx
+# Compile for GPU using the OpenVINO execution provider
+winml compile -m microsoft_resnet50.onnx --device gpu --ep openvino
 ```
 
 ## Common pitfalls
diff --git a/docs/commands/eval.md b/docs/commands/eval.md
index 9615ad102..f31a5df28 100644
--- a/docs/commands/eval.md
+++ b/docs/commands/eval.md
@@ -18,18 +18,24 @@ $ winml eval [options]
 |---|---|---|---|---|
 | `--model` | `-m` | `TEXT` | — | HuggingFace model ID, or path to a local `.onnx` file. Required (unless `--model-id` is provided directly). |
 | `--model-id` | | `TEXT` | — | HuggingFace model ID used for preprocessor and config resolution when `-m` points to an `.onnx` file. Required when `-m` is an ONNX file. |
-| `--dataset` | | `TEXT` | task default | HuggingFace dataset path (e.g., `imagenet-1k`, `glue`). If omitted, a default dataset is selected based on the task. |
-| `--dataset-name` | | `TEXT` | — | Dataset configuration name for multi-config datasets (e.g., `mrpc` within `glue`). |
-| `--task` | | `TEXT` | auto-detected | Task name (e.g., `image-classification`). Auto-detected from `--model-id` when not provided. |
-| `--device` | | `auto\|cpu\|gpu\|npu` | `auto` | Device to run inference on during evaluation. `auto` selects the best available device. |
+| `--task` | | `TEXT` | auto-detected | Task name (e.g., `image-classification`). Auto-detected from `--model-id` when not provided. Required when `-m` is an ONNX file and the task cannot be inferred. |
+| `--precision` | | `TEXT` | `auto` | Precision used when building the model from a HuggingFace ID. One of `auto`, `fp32`, `fp16`, `int8`, `int16`, or a mixed `w{x}a{y}` spec (e.g., `w8a16`). `fp16`/`fp32` skip quantization. **Ignored** when `-m` is a pre-built `.onnx` file — the precision is already baked in. |
+| `--device` | | choice | `auto` | Target device. Choices: `auto`, `npu`, `gpu`, `cpu`. `auto` selects the best available device. Combined with `--precision`, this drives the build when `-m` is a HuggingFace ID. |
+| `--ep` / `--execution-provider` | | `TEXT` | — | Target ONNX Runtime execution provider when finer control than `--device` is needed. Full names (e.g., `QNNExecutionProvider`, `OpenVINOExecutionProvider`, `VitisAIExecutionProvider`) and aliases (`qnn`, `ov`/`openvino`, `vitis`/`vitisai`) are accepted. |
+| `--dataset` | | `TEXT` | task default | HuggingFace dataset path (e.g., `imagenet-1k`, `nyu-mll/glue`). If omitted, a default dataset is selected based on the task. |
+| `--dataset-name` | | `TEXT` | — | Dataset configuration name for multi-config datasets. |
+| `--dataset-revision` | | `TEXT` | — | Git revision (branch, tag, or commit) of the dataset to load. Use `refs/convert/parquet` for HF datasets that are only served via the parquet mirror. |
+| `--dataset-script` | | `TEXT` | — | Path to a Python script that builds the evaluation dataset locally. Requires `--trust-remote-code`. |
+| `--trust-remote-code / --no-trust-remote-code` | | flag | `false` | Allow executing custom code from model repositories or dataset scripts. Required with `--dataset-script`. Use only with trusted sources. |
 | `--samples` | | `INTEGER` | `100` | Number of dataset samples to evaluate. |
 | `--split` | | `TEXT` | `validation` | Dataset split to use (e.g., `validation`, `test`, `train`). |
 | `--shuffle / --no-shuffle` | | flag | `shuffle` | Shuffle the dataset before sampling. Disable with `--no-shuffle` for reproducible sample ordering. |
-| `--streaming/--no-streaming` | | flag | `false` | Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. |
+| `--streaming / --no-streaming` | | flag | `false` | Stream the dataset from the Hub instead of downloading the full split. Useful for large datasets. |
 | `--column` | | `TEXT` (multiple) | — | Column mapping as `key=value` pairs (e.g., `--column input_column=image`). Can be specified multiple times. |
-| `--label-mapping` | | `PATH` | — | Path to a JSON file mapping label names to integer IDs: `{"label_name": id}`. |
+| `--label-mapping` | | `PATH` | — | Path to a JSON file mapping dataset label names to the integer class IDs the model emits: `{"label_name": id}`. |
 | `--output` | `-o` | `PATH` | — | Output JSON file path for the evaluation results. |
 | `--schema` | | flag | `false` | Print the expected dataset schema for the given `--task` and exit. Does not run evaluation. |
+| `--mode` | | `onnx\|compare` | `onnx` | Evaluation mode. `onnx` evaluates the ONNX candidate on a dataset. `compare` runs the ONNX candidate and the HuggingFace reference on identical random inputs and reports per-tensor similarity metrics — no dataset required. |
 
 ## How it works
 
@@ -45,7 +51,7 @@ $ winml eval -m microsoft/resnet-50
 
 ```text
 Task:     image-classification
-Dataset:  imagenet-1k (validation, 100 samples)
+Dataset:  timm/mini-imagenet (test, 100 samples)
 Device:   auto
 
 Accuracy: 76.00%
@@ -56,37 +62,75 @@ Results saved to: microsoft_resnet-50_eval.json
 Evaluate a pre-exported ONNX file, providing the source model ID for preprocessing:
 
 ```bash
-$ winml eval -m model.onnx --model-id microsoft/resnet-50 --dataset imagenet-1k
+$ winml eval -m model.onnx --model-id microsoft/resnet-50 --dataset timm/mini-imagenet
 ```
 
 Evaluate a BERT model on the MRPC paraphrase task with column remapping:
 
 ```bash
-$ winml eval -m bert-base-uncased --dataset glue --dataset-name mrpc \
-    --column input_column=sentence1 --samples 500
+$ winml eval -m Intel/bert-base-uncased-mrpc --dataset nyu-mll/glue --dataset-name mrpc --column input_column=sentence1 --column second_input_column=sentence2 --samples 500
 ```
 
-Check what dataset columns are expected before running, then evaluate on the NPU:
+Check what dataset columns are expected before running, then remap them to match your dataset:
 
 ```bash
-$ winml eval --schema --task image-classification
-$ winml eval -m facebook/convnext-tiny-224 --device npu --samples 200 --split test
+$ winml eval --schema --task text-classification
 ```
 
-Evaluate with a custom label mapping file and save results:
+```text
+Input schema for text-classification models
+==================================================
+
+--column option schema
+
+Evaluating needs a dataset with the following columns:
+  input_column
+      input text (default: text)
+  label_column
+      class label (ClassLabel or integer) (default: label)
+  second_input_column
+      second text for sentence-pair tasks (optional) (default: None)
+
+Override any default with --column:
+  --column input_column=<your_text_column>
+  --column label_column=<your_label_column>
+  --column second_input_column=<your_pair_column>
+```
+
+The GLUE SST-2 dataset uses `sentence` instead of the default `text` column, so remap it with a single `--column` override:
+
+```bash
+$ winml eval -m distilbert/distilbert-base-uncased-finetuned-sst-2-english --dataset nyu-mll/glue --dataset-name sst2 --column input_column=sentence --samples 500
+```
+
+Evaluate against a custom dataset whose label names differ from the model's class IDs. The `--label-mapping` flag points to a JSON file whose **keys are the label name strings as they appear in the dataset** and whose **values are the integer class IDs the model emits**. For example, ResNet-50 outputs ImageNet-1k class IDs (`0`–`999`), so if your custom dataset uses readable strings like `"tabby cat"` or `"golden retriever"`, `labels.json` translates each dataset label to the corresponding ImageNet ID the model predicts:
+
+```json
+{
+  "tabby cat": 281,
+  "Egyptian cat": 285,
+  "golden retriever": 207
+}
+```
+
+```bash
+$ winml eval -m microsoft/resnet-50 --dataset my-org/my-pets-dataset --label-mapping labels.json -o results/resnet_eval.json
+```
+
+Evaluate a composite model from pre-exported ONNX files. Some tasks (e.g., `image-to-text`, encoder-decoder, dual-encoder) split the model across multiple ONNX files, one per role. Pass `-m` once per role as `<role>=<path>.onnx` and supply `--model-id` so the preprocessor and tokenizer can be resolved. Run `winml eval --schema --task image-to-text` to see the expected roles for a task:
 
 ```bash
-$ winml eval -m model.onnx --model-id microsoft/resnet-50 \
-    --label-mapping labels.json -o results/resnet_eval.json
+$ winml eval -m encoder=encoder.onnx -m decoder=decoder.onnx --model-id microsoft/trocr-base-printed
 ```
 
 ## Common pitfalls
 
 - **ONNX file without `--model-id` fails.** When `-m` is a `.onnx` path, `--model-id` is mandatory. Without it the command cannot resolve the preprocessor or label vocabulary and will exit with a usage error.
-- **Default dataset requires Hub credentials for gated datasets.** Some task defaults (e.g., `imagenet-1k`) require a HuggingFace account with accepted terms of use. Log in with `huggingface-cli login` before running eval on gated data.
+- **The task-default dataset may not match every model.** A default dataset cannot fit every model. Classification and detection models in particular need a dataset whose label space and domain match what the model was trained on — using the default may produce misleadingly low scores, missing-label errors, or a dataset-schema error. Always pass `--dataset` (and `--label-mapping` if needed) when evaluating a model whose label space or domain differs from the task default.
+- **Some dataset requires Hub credentials for gated datasets.** Some datasets (e.g., `imagenet-1k`) require a HuggingFace account with accepted terms of use. Log in with `huggingface-cli login` before running eval on gated data.
 - **`--shuffle` is on by default.** The random 100-sample slice changes between runs unless you pass `--no-shuffle`. Use `--no-shuffle` when comparing two model variants to ensure they see identical samples.
 - **`--streaming` skips the local cache.** Streaming mode avoids downloading the full split but prevents random shuffling on large datasets. For reproducible evaluation, download the split once and omit `--streaming`.
-- **Column names vary across dataset versions.** If the evaluator raises a missing-column error, run `winml eval --schema --task <task>` to inspect the expected schema and use `--column` to remap dataset field names to the expected names.
+- **Column names vary across datasets.** If the evaluator raises a missing-column error, run `winml eval --schema --task <task>` to inspect the expected schema and use `--column` to remap dataset field names to the expected names.
 
 ## See also
 
diff --git a/docs/commands/quantize.md b/docs/commands/quantize.md
index 046723b0e..94fe42563 100644
--- a/docs/commands/quantize.md
+++ b/docs/commands/quantize.md
@@ -21,11 +21,13 @@ $ winml quantize [options]
 |---|---|---|---|---|
 | `--model` | `-m` | path | *(required)* | Input ONNX model file. |
 | `--output` | `-o` | path | `{input}_qdq.onnx` | Output path for the quantized model. |
+| `--task` | | string | — | Task name (e.g., `image-classification`, `text-classification`) used to select a task-appropriate calibration dataset. Pair with `--model-name` so the dataset is preprocessed exactly the way the model expects. Without `--task`, calibration falls back to synthetic random data. |
+| `--model-name` | | string | — | HuggingFace model ID (e.g., `microsoft/resnet-50`) used to load the matching preprocessor/tokenizer for calibration. Only used when `--task` is provided. |
 | `--precision` | `-p` | string | `None` | Precision shorthand: `int8`, `int16`, or mixed-precision like `w8a16`. Overridden by explicit `--weight-type` / `--activation-type`. |
 | `--samples` | | integer | `10` | Number of calibration samples used to compute quantization ranges. |
 | `--method` | | choice | `minmax` | Calibration algorithm: `minmax`, `entropy`, or `percentile`. |
-| `--weight-type` | | choice | `None` | Per-tensor type for weights: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. |
-| `--activation-type` | | choice | `None` | Per-tensor type for activations: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. |
+| `--weight-type` | | choice | `uint8` | Per-tensor type for weights: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. When unset, the effective type comes from `--precision`, or `uint8` if neither is set. |
+| `--activation-type` | | choice | `uint8` | Per-tensor type for activations: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. When unset, the effective type comes from `--precision`, or `uint8` if neither is set. |
 | `--per-channel/--no-per-channel` | | flag | `false` | Apply per-channel (rather than per-tensor) quantization to weight tensors. |
 | `--symmetric/--no-symmetric` | | flag | `false` | Use symmetric quantization (zero-point fixed at 0). |
 | `--help` | `-h` | flag | | Show this message and exit. |
@@ -42,6 +44,16 @@ Precision can be set at a coarse level with `--precision` or tuned per tensor
 type with `--weight-type` and `--activation-type`; explicit type flags always
 override `--precision`.
 
+Calibration data is selected from `--task` and `--model-name`. For a supported
+task, a built-in default calibration dataset is loaded and preprocessed through
+the model's own tokenizer or image processor, so the calibration tensors match
+what the model will see at inference time. For an unsupported task — or when
+`--task` is omitted entirely — calibration falls back to synthetic random data
+synthesized from the ONNX input specification. Random-data calibration is fast
+and always works, but the resulting scales are typically less accurate than
+dataset-driven calibration, so always provide `--task` and `--model-name` when
+the model task is supported.
+
 ## Examples
 
 ```bash
@@ -65,6 +77,11 @@ QDQ nodes inserted: 53
 Total time: 4.31s
 ```
 
+```bash
+# Task-aware calibration: real samples preprocessed through the model's own image processor
+winml quantize -m resnet50.onnx --task image-classification --model-name microsoft/resnet-50 --samples 128
+```
+
 ```bash
 # int8 precision shorthand (equivalent to --weight-type int8 --activation-type int8)
 winml quantize -m resnet50.onnx -p int8
@@ -72,16 +89,12 @@ winml quantize -m resnet50.onnx -p int8
 
 ```bash
 # Mixed-precision: int8 weights, uint16 activations with entropy calibration
-winml quantize -m bert-base-uncased.onnx \
-  --weight-type int8 --activation-type uint16 \
-  --method entropy --samples 64
+winml quantize -m bert-base-uncased.onnx --weight-type int8 --activation-type uint16 --method entropy --samples 64
 ```
 
 ```bash
 # Per-channel symmetric quantization to a specific output path
-winml quantize -m facebook_convnext.onnx \
-  -o facebook_convnext_qdq.onnx \
-  --per-channel --symmetric --samples 32
+winml quantize -m facebook_convnext.onnx -o facebook_convnext_qdq.onnx --per-channel --symmetric --samples 32
 ```
 
 ```bash
@@ -91,21 +104,12 @@ winml quantize -m bert-base-uncased.onnx --precision int16
 
 ## Common pitfalls
 
-- **`--weight-type` / `--activation-type` silently override `--precision`.**
-  If you pass both, the explicit type flags win. Omit `--precision` when
-  setting types explicitly to avoid confusion.
-- **Low sample counts can hurt accuracy.** The default of 10 samples is
-  sufficient for quick testing, but production models typically need 64–256
-  representative samples for good calibration.
-- **`--per-channel` increases model size.** Per-channel quantization stores a
-  separate scale and zero-point per output channel; this can noticeably inflate
-  the model file size compared to per-tensor mode.
-- **Output defaults to `{stem}_qdq.onnx` in the same directory as input.**
-  Always pass `-o` when writing to a specific location to avoid accidentally
-  overwriting or cluttering the source directory.
-- **Quantizing an already-quantized model (one containing QDQ nodes) is
-  unsupported and will produce incorrect results.** Use `winml compile
-  --no-quant` instead if the model already contains QDQ nodes.
+- **Calibration uses synthetic random data by default.** Without `--task` and `--model-name`, scales and zero-points are computed from random tensors synthesized from the ONNX input specification — the model never sees realistic activations, so accuracy after quantization can degrade noticeably. Always pass `--task` and `--model-name` for supported tasks (e.g., `--task image-classification --model-name microsoft/resnet-50`) so calibration runs on real samples preprocessed through the model's own tokenizer or image processor.
+- **`--weight-type` / `--activation-type` silently override `--precision`.** If you pass both, the explicit type flags win. Omit `--precision` when setting types explicitly to avoid confusion.
+- **Low sample counts can hurt accuracy.** The default of 10 samples is sufficient for quick testing, but production models typically need 64–256 representative samples for good calibration.
+- **`--per-channel` increases model size.** Per-channel quantization stores a separate scale and zero-point per output channel; this can noticeably inflate the model file size compared to per-tensor mode.
+- **Output defaults to `{stem}_qdq.onnx` in the same directory as input.** Always pass `-o` when writing to a specific location to avoid accidentally overwriting or cluttering the source directory.
+- **Quantizing an already-quantized model (one containing QDQ nodes) is unsupported and will produce incorrect results.** Use `winml compile --no-quant` instead if the model already contains QDQ nodes.
 
 ## See also
 

From 0426ec88482efa77a4a482a870a3c1656e78a79d Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 12 Jun 2026 08:31:39 +0800
Subject: [PATCH 142/143] docs: fix --ep choices in compile.md, clarify
 quantize defaults

- compile: remove invalid 'cuda' and 'tensorrt' from --ep list, add correct aliases
- quantize: --weight-type/--activation-type default is resolved (not hardcoded uint8)
---
 docs/commands/compile.md  | 2 +-
 docs/commands/quantize.md | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/commands/compile.md b/docs/commands/compile.md
index d5e516971..a3d4e8f46 100644
--- a/docs/commands/compile.md
+++ b/docs/commands/compile.md
@@ -22,7 +22,7 @@ $ winml compile [options]
 | `--output` | `-o` | path | — | Output file path (e.g., `model_compiled.onnx`). Takes precedence over `--output-dir`. |
 | `--output-dir` | | path | same dir as input | Directory to write compiled output artifacts. |
 | `--device` | `-d` | choice | `auto` | Target device: `auto`, `npu`, `gpu`, or `cpu`. |
-| `--ep` | | choice | `None` | Force a specific execution provider, overriding device-to-provider mapping. Choices: `cpu`, `cuda`, `dml`, `migraphx`, `openvino`, `qnn`, `tensorrt`, `vitisai`. |
+| `--ep` | | `TEXT` | — | Force a specific execution provider, overriding device-to-provider mapping. Accepts full names (e.g., `QNNExecutionProvider`) or aliases (`qnn`, `dml`, `openvino`, `vitisai`, `migraphx`, `cpu`, `nvtensorrtrtx`). |
 | `--validate` / `--no-validate` | | flag | `--validate` | Run a post-compilation validation pass on the target hardware. Enabled by default; pass `--no-validate` to skip when the target hardware or driver is unavailable. |
 | `--compiler` | | choice | `ort` | Compiler backend: `ort` (ONNX Runtime) or `qairt` (Qualcomm AI Runtime Tools). |
 | `--qnn-sdk-root` | | path | `None` | Path to the QNN SDK root directory. |
diff --git a/docs/commands/quantize.md b/docs/commands/quantize.md
index 94fe42563..51128a046 100644
--- a/docs/commands/quantize.md
+++ b/docs/commands/quantize.md
@@ -26,8 +26,8 @@ $ winml quantize [options]
 | `--precision` | `-p` | string | `None` | Precision shorthand: `int8`, `int16`, or mixed-precision like `w8a16`. Overridden by explicit `--weight-type` / `--activation-type`. |
 | `--samples` | | integer | `10` | Number of calibration samples used to compute quantization ranges. |
 | `--method` | | choice | `minmax` | Calibration algorithm: `minmax`, `entropy`, or `percentile`. |
-| `--weight-type` | | choice | `uint8` | Per-tensor type for weights: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. When unset, the effective type comes from `--precision`, or `uint8` if neither is set. |
-| `--activation-type` | | choice | `uint8` | Per-tensor type for activations: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. When unset, the effective type comes from `--precision`, or `uint8` if neither is set. |
+| `--weight-type` | | choice | — | Per-tensor type for weights: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. When unset, defaults to `uint8` (or the type implied by `--precision`). |
+| `--activation-type` | | choice | — | Per-tensor type for activations: `uint8`, `int8`, `uint16`, or `int16`. Overrides `--precision`. When unset, defaults to `uint8` (or the type implied by `--precision`). |
 | `--per-channel/--no-per-channel` | | flag | `false` | Apply per-channel (rather than per-tensor) quantization to weight tensors. |
 | `--symmetric/--no-symmetric` | | flag | `false` | Use symmetric quantization (zero-point fixed at 0). |
 | `--help` | `-h` | flag | | Show this message and exit. |

From 9e961e8cf68af7e2744ccfbc145d3de609e39e47 Mon Sep 17 00:00:00 2001
From: Qiong Wu <qiowu@microsoft.com>
Date: Fri, 12 Jun 2026 15:09:41 +0800
Subject: [PATCH 143/143] docs: address PR review comments

- sys.md: fix EP mapping (QNN -> NPU/GPU, not just NPU)
- CONTRIBUTING.md: remove Linux/macOS unzip comment (Windows-only project)
- docs/contributing.md: sync with CONTRIBUTING.md
- .pre-commit-config.yaml: remove unnecessary --unsafe arg from check-yaml
- .gitignore: add comment explaining docs/versions.json
---
 .gitignore              | 2 ++
 .pre-commit-config.yaml | 1 -
 CONTRIBUTING.md         | 3 ---
 docs/commands/sys.md    | 2 +-
 docs/contributing.md    | 3 ---
 5 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/.gitignore b/.gitignore
index 0dc405bae..2184e9f72 100644
--- a/.gitignore
+++ b/.gitignore
@@ -264,4 +264,6 @@ specs/
 
 # Runtime check rule artifacts (hosted in external repo)
 src/winml/modelkit/analyze/rules/runtime_check_rules/**/*.parquet
+
+# Generated by mike (docs versioning)
 docs/versions.json
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index ade35422b..d189b0585 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -6,7 +6,6 @@ repos:
       - id: trailing-whitespace
         args: [--markdown-linebreak-ext=md]
       - id: check-yaml
-        args: [--unsafe]
 
   - repo: https://github.com/Lucas-C/pre-commit-hooks
     rev: v1.5.5
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 312f01a57..0602c3003 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -30,10 +30,7 @@ For external contributors, download from a GitHub release:
 
 ```bash
 gh release download <tag> --repo microsoft/winml-cli --pattern 'rules-v*.zip' --dir .
-# Windows:
 Expand-Archive -Path .\rules-v*.zip -DestinationPath src\winml\modelkit\analyze\rules\runtime_check_rules -Force
-# Linux/macOS:
-# unzip -o rules-v*.zip -d src/winml/modelkit/analyze/rules/runtime_check_rules
 ```
 
 ## Coding conventions and standards
diff --git a/docs/commands/sys.md b/docs/commands/sys.md
index 167692846..1ec4dacc7 100644
--- a/docs/commands/sys.md
+++ b/docs/commands/sys.md
@@ -69,7 +69,7 @@ Available Devices (priority order)
   #3  CPU   Snapdragon(R) X Elite
 
 Available Execution Providers
-  QNNExecutionProvider           -> NPU
+  QNNExecutionProvider           -> NPU/GPU
   DmlExecutionProvider           -> GPU
   CPUExecutionProvider           -> CPU
 ```
diff --git a/docs/contributing.md b/docs/contributing.md
index b76f8e934..21e159916 100644
--- a/docs/contributing.md
+++ b/docs/contributing.md
@@ -13,10 +13,7 @@ uv run pre-commit install
 
 # Download runtime check rules (required for `winml analyze`)
 gh release download <tag> --repo microsoft/winml-cli --pattern 'rules-v*.zip' --dir .
-# Windows:
 Expand-Archive -Path .\rules-v*.zip -DestinationPath src\winml\modelkit\analyze\rules\runtime_check_rules -Force
-# Linux/macOS:
-# unzip -o rules-v*.zip -d src/winml/modelkit/analyze/rules/runtime_check_rules
 
 # Run tests
 uv run pytest tests/ -m "not e2e and not npu and not gpu"