Skip to content

feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845

Open
KayMKM wants to merge 6 commits into
mainfrom
yuesu/build-only-ep-matrix
Open

feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845
KayMKM wants to merge 6 commits into
mainfrom
yuesu/build-only-ep-matrix

Conversation

@KayMKM

@KayMKM KayMKM commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a --build-only mode to scripts/e2e_eval/run_eval.py that generates ModelKit pipeline artifacts (export → optimize → quantize, no compile) across the full EP matrix, without requiring EP hardware, and optionally streams them to the Modelkit Azure Artifacts feed while bounding local disk usage.

Motivation: we need a way to mass-produce per-EP model artifacts for ~200 models on a single (CPU) box and distribute them, but (a) the build normally needs the target EP installed, and (b) writing every stage for every (model × EP) fills the disk fast.

What's included

1. --build-only mode + per-EP matrix

  • Runs winml config + winml build --no-compile per model; perf/accuracy are skipped.
  • When --ep/--device are omitted, builds the eval EP matrix into <model_dir>/<ep>_<device>/ subdirs: qnn_npu, qnn_gpu, ov_cpu, ov_npu, ov_gpu, mlas_cpu, dml_gpu, vitisai_npu. Pinning --ep/--device writes a single build directly into <model_dir>.
  • Precision per combo reuses the existing eval policy (NPU → w8a16, CPU/GPU → auto, native-quant EPs → --no-quant).

2. Cross-EP / cross-host builds (core fix in build.py)

  • _run_optimize_stage called resolve_device(device, ep=ep) purely to pick a progress-bar key, which raised when the target EP wasn't installed locally — blocking offline generation of (e.g.) QNN/VitisAI artifacts on a CPU box.
  • Now: when the build won't compile (config.compile is None), the missing-EP lookup soft-fails and falls back to the requested device (optimize/quantize only need the EP's static rule data, not a registered EP). When compile will run, it still fails fast. No behavior change for normal compiling builds.

3. Export dedup (disk saver)

  • The export.onnx stage is EP/device-independent, so all 8 combos produce an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to <model_dir>/_shared/, later identical ones are deleted — one export copy instead of 8 (export is the largest, full-precision artifact).

4. --upload: stream to Azure Artifacts feed, then delete locally

  • After a model's combos are built, publishes the whole model dir to the Modelkit feed as a Universal Package, then deletes the local copy — peak disk stays at ~one model's matrix.
  • Auth via az login (Entra ID), no PAT. The azure-devops extension is ensured and login verified up front; if not ready, the run aborts (so disk isn't silently filled).
  • Package: single name winml-cli-models, one version per model: 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; the 0.0.0- core keeps it legal, the date stamp + slug are the pre-release segment). The shared run-stamp groups a batch.
  • Resume: --continue + the same --run-stamp skips already-uploaded models without rebuilding them. Already-uploaded models are detected two ways: a local build_only_uploads.json manifest, and a query against the feed itself at startup (versions matching 0.0.0-<run-stamp>-*). Because the manifest is only written after a successful upload, a fresh --output-dir would otherwise start empty and rebuild everything; seeding from the feed makes it authoritative for what's published, so resume works regardless of local state. A feed-query failure falls back to local-manifest-only behavior.
  • Extra flags: --run-stamp, --keep-local, --upload-skip-existing, --feed/--feed-org/--feed-project/--package-name.

Usage

# Build the full EP matrix for P0 models, stream to the feed, delete locals
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --priority P0

# Resume an interrupted batch (same run-stamp; a fresh output dir is fine)
uv run python scripts/e2e_eval/run_eval.py --build-only --upload --continue \
  --run-stamp 20260609 --priority P0

# Local only (no feed), or pin a single EP/device
uv run python scripts/e2e_eval/run_eval.py --build-only --hf-model microsoft/resnet-50

Download a specific model's specific file later:

az artifacts universal download \
  --organization https://dev.azure.com/microsoft --project windows.ai.toolkit \
  --scope project --feed Modelkit --name winml-cli-models \
  --version 0.0.0-20260609-microsoft-resnet-50-image-classification \
  --path ./out --file-filter 'qnn_npu/model.onnx*'

Notes / scope

  • src/winml/modelkit/commands/build.py change is gated on config.compile is None, so it only affects no-compile builds; compiling builds are unchanged and still fail fast on a missing EP.
  • --continue resumes from the feed (versions matching 0.0.0-<run-stamp>-*) and the local build_only_uploads.json manifest, so a resumed run only needs the same --run-stamp — it no longer has to reuse the same --output-dir. The feed query uses two &-free REST GETs (list packages → resolve the UPack package GUID → list versions) because az resolves to az.cmd and cmd.exe splits query strings on &, dropping every parameter after the first.
  • Verified manually end-to-end on a CPU host: full 8-EP matrix build, export dedup, publish to Modelkit, and --file-filter download of individual model.onnx files.
  • The feed: https://dev.azure.com/microsoft/windows.ai.toolkit/_artifacts/feed/Modelkit/UPack/winml-cli-models/overview/0.0.0-20260609-prajjwal1-bert-tiny-text-classification

KayMKM added 3 commits June 5, 2026 16:44
Add a --build-only mode to run_eval.py that runs config + build with
--no-compile, writing each pipeline stage's ONNX (export/optimize/quantize)
without requiring execution-provider hardware. Perf and accuracy are skipped.

When --ep/--device are omitted, every model is built once per EP in the
build-only matrix (qnn npu/gpu, openvino cpu/npu/gpu, mlas, dml, vitisai)
into <model_dir>/<ep>_<device>/ subdirs. When either is pinned, a single
build is written directly into the model dir. Precision per combo reuses
the existing _resolve_precision policy (NPU w8a16, CPU/GPU auto, native-quant
EPs unquantized).

Reuses the existing _run_build via a build_only flag (-o <dir> --no-compile
instead of --use-cache).
Two bugs surfaced when running `run_eval.py --build-only` against the EP matrix on a CPU-only host:

1. Every combo for the 'no native EP' subset (mlas/dml/openvino) was reported as `[FAIL @ complete]` even though export/optimize/quantize/model.onnx all landed correctly. `_run_build` was funnelling build-only results through `_extract_onnx_path`, which scans stdout for a `Final artifact:` marker that `winml build --no-compile` never prints, and falls back to the global cache which build-only doesn't populate (`-o <dir>` writes elsewhere). In build-only mode there is no downstream consumer of the path, so trust exit-code 0 directly and record `build_out` to keep the per-component bookkeeping balanced.

2. QNN/VitisAI combos failed at the optimize stage with `Requested EP 'qnn' is not available on this system`. `_run_optimize_stage` calls `resolve_device(device, ep=ep)` purely to pick the right `has_rule_data_for_ep` key for the progress bar, but that helper raises when the EP isn't installed locally -- even when the rest of the pipeline (export + optimize + quantize) runs on CPU and the EP is only needed at compile time. Soft-fail the lookup *only when* `config.compile is None` (i.e. `--no-compile` or a config that explicitly opts out); otherwise re-raise so configs that will compile still fail fast here instead of deep inside the compile stage.

Also moves `--clean-cache` from per-combo to per-model in `_run_build_only`: combos for the same model share the same HF download, so clearing between combos forced N redundant re-downloads of the same weights.
…facts feed

Running --build-only over the 8-EP matrix for many models fills local disk.
Two additions keep disk bounded:

1. Export dedup: the export.onnx stage is EP/device-independent, so every
   combo produces an identical export. After each combo builds, its export is
   hash-compared against a per-model canonical: the first is moved to
   <model_dir>/_shared/, later identical ones are deleted. One export copy on
   disk instead of 8.

2. --upload: after a model's combos are built, publish the model dir to the
   Modelkit Azure Artifacts feed as a Universal Package version, then delete it
   locally. Auth via az login (no PAT); the azure-devops extension is ensured
   and login verified up front (aborts otherwise so disk isn't silently filled).
   Version is 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; date stamp
   groups a batch). --continue + --run-stamp resume an interrupted batch from
   the build_only_uploads.json manifest without rebuilding uploaded models;
   --keep-local, --upload-skip-existing, and feed/package args round it out.
@KayMKM KayMKM requested a review from a team as a code owner June 9, 2026 07:57
@KayMKM KayMKM changed the title Yuesu/build only ep matrix feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload Jun 9, 2026

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the full diff. Overall structure is solid and docstrings are thorough. Found a potential data-loss path (false-positive conflict detection triggers local directory deletion), a silent no-op for --continue in the non-upload build-only path, and a few other correctness/usability issues — see inline comments.

Comment thread scripts/e2e_eval/run_eval.py Outdated
with p.open("rb") as fh:
for chunk in iter(lambda: fh.read(1 << 20), b""):
h.update(chunk)
except OSError:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_hash_files: OSError swallowed with a fixed sentinel — potential dedup false-positive

When a file cannot be opened, the hash is updated with the literal bytes b"<unreadable>" regardless of which file failed or why. Two distinct files that both happen to be unreadable will produce the same hash, which could cause the duplicate-detection path to delete an artifact it never actually verified was identical.

Safer options:

  • Re-raise the OSError so the caller can skip dedup for that combo.
  • Include the filename in the sentinel so two different unreadable files don't collide: h.update(p.name.encode() + b":unreadable").

return "0.0.0-" + _slugify_version("-".join(parts))


def _is_publish_conflict(proc: dict) -> bool:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_is_publish_conflict: substring match on "conflict" is too broad — data-loss risk

Checking for the substring conflict in combined stdout+stderr will produce a false positive any time the Azure CLI emits that word in an unrelated context (e.g. "merge conflict", "address conflict", "proxy conflict"). A false positive here triggers exists-skipped status and deletes the local model directory via shutil.rmtree, making this a potential data-loss path.

Consider narrowing to the specific HTTP 409 indicators or the Azure DevOps error code: StatusCode=409, HttpStatusCode: 409, or PACKAGEVERSIONEXISTS. These are far less likely to appear in unrelated messages.


use_matrix = args.ep is None and args.device == "auto"
# Single combo uses an empty label (no subdir); matrix uses (label, ep, device).
combos = list(_BUILD_ONLY_EP_MATRIX) if use_matrix else [("", args.ep, args.device)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinned-EP-only case: _resolve_precision receives device="auto" — w8a16 default silently dropped

When the user sets --ep but omits --device (default is "auto"), use_matrix is False (because args.ep is not None) so combos = [("", args.ep, "auto")]. Then _resolve_precision("auto", ...) returns None since "auto" != "npu", and the config is generated without a precision flag.

This is inconsistent with the matrix path: entry ("qnn_npu", "qnn", "npu") correctly applies w8a16 because device is explicitly npu. Consider resolving "auto" to a concrete device per-EP before calling _resolve_precision, or documenting that precision auto-detection is intentionally delegated to winml config in this pinned path.

# Resume: skip models already uploaded for this run-stamp (manifest is
# the source of truth; failed/unrecorded models are (re)built). Skipping
# here avoids the expensive 8-EP rebuild, not just the upload.
if args.upload and args.continue_run:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--continue is silently a no-op in --build-only without --upload

The resume skip is gated on if args.upload and args.continue_run, so --build-only --continue without --upload does nothing — all models are rebuilt from scratch. There is no filesystem-based resume (e.g. checking whether model_dir already contains the expected combo subdirs).

Note: the pre-filtering of entries by --continue in main() lives inside the if args.list_json: branch, and the eval-path filtering is after the if args.build_only: return, so neither runs in build-only mode.

A user resuming an interrupted non-upload batch with --build-only --continue will silently rebuild everything. Consider either:

  1. Adding a local-disk skip (check if combo subdir already exists and has artifacts), or
  2. Raising an error/warning when --continue is combined with --build-only but not --upload.

parser.add_argument(
"--upload-skip-existing",
dest="upload_skip_existing",
action="store_true",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading --upload-skip-existing help text — the build is NOT skipped

The current text says "skip building+uploading a model whose package version already exists in the feed", but this flag only affects the publish step. By the time a conflict is detected, the model has already been fully built across all EP combos. The flag merely prevents treating the conflict as a failure.

The flag that actually skips the build is --continue (manifest-based resume). Suggested rewrite:

With --upload: treat a version-already-exists publish conflict as success and delete the local copy. To skip rebuilding entirely for already-uploaded models, use --continue.

Comment thread src/winml/modelkit/commands/build.py Outdated
_resolved_device, _ = _resolve_device(device=device or "auto", ep=ep)
try:
_resolved_device, _ = _resolve_device(device=device or "auto", ep=ep)
except ValueError:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except ValueError is overly broad — may swallow unrelated errors

The intent (per the comment) is to catch "EP not installed on this host", but _resolve_device could also raise ValueError for invalid device names, malformed input, or other programming errors. With config.compile is None, all such errors are silently swallowed and _resolved_device falls back to device or "".

If _resolve_device raises a typed exception for hardware-unavailability (or exposes a separate is_device_available check), prefer that. Otherwise, consider at least checking str(e) for expected keywords (e.g. "not available", "not installed") before swallowing, and re-raising for unexpected messages.

KayMKM added 2 commits June 10, 2026 12:03
… versions

The --continue skip logic only consulted the local build_only_uploads.json manifest, which is written after each successful upload. A fresh --output-dir (e.g. a gitignored temp dir) starts empty, so models already published to the Azure Artifacts feed under the same run-stamp were rebuilt and re-uploaded instead of being skipped.

Seed the in-memory manifest from the feed at startup: query the feed REST API for versions matching 0.0.0-<run-stamp>-* and mark them as uploaded so the existing skip check honors them. The feed is now authoritative for what's published, regardless of local state. Querying is best-effort -- a failure falls back to local-manifest-only behavior.

Use two ampersand-free REST GETs (list packages -> resolve UPack package GUID -> list versions) because az resolves to az.cmd and cmd.exe splits query strings on '&', dropping every parameter after the first.
- _hash_files: stop hashing unreadable files to a fixed sentinel; propagate
  OSError and have _dedup_export keep the export in place instead of risking
  deletion of an artifact never verified identical.
- _is_publish_conflict: narrow detection to specific version-exists / HTTP 409
  markers (drop bare 'conflict'/'409') so an unrelated message can't trigger
  exists-skipped and rmtree the local model dir.
- build.py _run_optimize_stage: narrow the no-compile EP fallback to only
  swallow EP-not-available ValueErrors; re-raise malformed device/EP names.
- Warn when --continue is used with --build-only but without --upload (no
  local-disk resume exists, so everything is rebuilt).
- Document that the pinned-EP auto-device path delegates precision to winml
  config's auto-detection.
- Fix misleading --upload-skip-existing help: it does not skip the build.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants