Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions scripts/e2e_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,105 @@ uv run python scripts/e2e_eval/run_eval.py --retry-failed
| `--verbose` | off | Print stderr for failed models |
| `--continue` | off | Skip models with existing results |
| `--retry-failed [TYPE ...]` | — | Re-run failed models (implies `--continue`) |
| `--build-only` | off | Build with `--no-compile`, writing each stage's ONNX (no EP needed). Loops the EP matrix when `--ep`/`--device` omitted |

#### `--build-only` — Generate per-stage models (no EP required)

`--build-only` runs config + build with `--no-compile`, writing each stage's ONNX —
`export.onnx`, `optimized.onnx`, `quantized.onnx`. Because compile is skipped, this
needs **no execution-provider hardware** and runs on any CPU machine. Perf and accuracy
phases are skipped.

When `--ep`/`--device` are **omitted**, every model is built once per EP in the
build-only matrix, each into a `<ep>_<device>/` subdir:

| Label | EP | Device |
|---|---|---|
| `qnn_npu` | qnn | npu |
| `qnn_gpu` | qnn | gpu |
| `ov_cpu` | openvino | cpu |
| `ov_npu` | openvino | npu |
| `ov_gpu` | openvino | gpu |
| `mlas_cpu` | cpu (MLAS) | cpu |
| `dml_gpu` | dml | gpu |
| `vitisai_npu` | vitisai | npu |

Precision per combo follows the eval policy: NPU defaults to `w8a16`, CPU/GPU omit the
flag (winml auto), and native-quant EPs (VitisAI) are built unquantized (`--no-quant`).
When `--ep` or `--device` is pinned, a single build is written directly into
`<output-dir>/models/<slug>/`.

```bash
# Build all EP-matrix variants for P0 models (8 builds per model)
uv run python scripts/e2e_eval/run_eval.py --build-only --priority P0

# Pin a single EP/device (no matrix; writes directly to model dir)
uv run python scripts/e2e_eval/run_eval.py --build-only --hf-model microsoft/resnet-50 --ep qnn --device npu
```

Composite models (multiple sub-components) are built into per-component subdirectories
under each EP subdir.

**Export dedup**: the `export.onnx` stage is EP/device-independent, so it is identical
across all matrix combos. It is stored once under `<model_dir>/_shared/export.onnx`
and removed from each `<ep>_<device>/` subdir, keeping only one copy on disk.

#### Streaming upload to the Azure Artifacts feed (`--upload`)

Running the full matrix over many models fills the local disk fast. `--upload`
publishes each model's artifacts to the **`Modelkit`** Azure Artifacts feed
(Universal Package) as soon as its combos are built, then deletes the local copy —
so peak disk stays at roughly one model's matrix.

- **Auth**: uses `az login` (Entra ID) — no PAT. The script verifies the
`azure-devops` az extension is installed (auto-adds it) and that you're logged in;
if not, it aborts (so disk isn't silently filled).
- **Package**: one package `winml-cli-models`, **one version per model**, named
`0.0.0-<run-stamp>-<model-slug>` where the run-stamp is a date (default today,
`YYYYMMDD`). e.g. `0.0.0-20260609-microsoft-resnet-50-image-classification`
(the `0.0.0-` core keeps it valid SemVer 2.0; the stamp+slug are the
pre-release segment). The shared run-stamp prefix groups a batch together.
- A `build_only_uploads.json` manifest (version → run-stamp → combos → status) is
written in the output dir; it drives `--continue`.

```bash
# Build the matrix and stream each model to the feed, deleting locals
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --priority P0

# Resume an interrupted batch: same run-stamp + --continue skips models already
# uploaded (per the manifest) without rebuilding them.
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --continue \
--run-stamp 20260609 --priority P0

# --upload-skip-existing: if the feed already has a version (e.g. manifest lost),
# treat the publish conflict as done and delete the local copy.
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --upload-skip-existing

# Upload but keep local copies (debug)
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --keep-local
```

Download a specific model's specific file later with `--file-filter`:

```bash
az artifacts universal download \
--organization https://dev.azure.com/microsoft --project windows.ai.toolkit \
--scope project --feed Modelkit --name winml-cli-models \
--version 0.0.0-20260609-microsoft-resnet-50-image-classification \
--path ./out --file-filter 'qnn_npu/quantized.onnx'
```

| Upload flag | Default | Description |
|---|---|---|
| `--upload` | off | Publish each model dir to the feed, then delete it locally |
| `--run-stamp` | today (`YYYYMMDD`) | Version prefix; pass the same stamp + `--continue` to resume |
| `--continue` | off | Skip models already uploaded for this run-stamp (no rebuild) |
| `--feed` | `Modelkit` | Azure Artifacts feed name |
| `--feed-org` | `https://dev.azure.com/microsoft` | Azure DevOps org URL |
| `--feed-project` | `windows.ai.toolkit` | Project for the project-scoped feed |
| `--package-name` | `winml-cli-models` | Universal Package name |
| `--keep-local` | off | Upload but do not delete the local dir |
| `--upload-skip-existing` | off | Treat an existing feed version as done (feed-based resume) |

### `generate_report.py` — Regenerate Reports

Expand Down
Loading
Loading