feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845
feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845KayMKM wants to merge 6 commits into
Conversation
Add a --build-only mode to run_eval.py that runs config + build with --no-compile, writing each pipeline stage's ONNX (export/optimize/quantize) without requiring execution-provider hardware. Perf and accuracy are skipped. When --ep/--device are omitted, every model is built once per EP in the build-only matrix (qnn npu/gpu, openvino cpu/npu/gpu, mlas, dml, vitisai) into <model_dir>/<ep>_<device>/ subdirs. When either is pinned, a single build is written directly into the model dir. Precision per combo reuses the existing _resolve_precision policy (NPU w8a16, CPU/GPU auto, native-quant EPs unquantized). Reuses the existing _run_build via a build_only flag (-o <dir> --no-compile instead of --use-cache).
Two bugs surfaced when running `run_eval.py --build-only` against the EP matrix on a CPU-only host: 1. Every combo for the 'no native EP' subset (mlas/dml/openvino) was reported as `[FAIL @ complete]` even though export/optimize/quantize/model.onnx all landed correctly. `_run_build` was funnelling build-only results through `_extract_onnx_path`, which scans stdout for a `Final artifact:` marker that `winml build --no-compile` never prints, and falls back to the global cache which build-only doesn't populate (`-o <dir>` writes elsewhere). In build-only mode there is no downstream consumer of the path, so trust exit-code 0 directly and record `build_out` to keep the per-component bookkeeping balanced. 2. QNN/VitisAI combos failed at the optimize stage with `Requested EP 'qnn' is not available on this system`. `_run_optimize_stage` calls `resolve_device(device, ep=ep)` purely to pick the right `has_rule_data_for_ep` key for the progress bar, but that helper raises when the EP isn't installed locally -- even when the rest of the pipeline (export + optimize + quantize) runs on CPU and the EP is only needed at compile time. Soft-fail the lookup *only when* `config.compile is None` (i.e. `--no-compile` or a config that explicitly opts out); otherwise re-raise so configs that will compile still fail fast here instead of deep inside the compile stage. Also moves `--clean-cache` from per-combo to per-model in `_run_build_only`: combos for the same model share the same HF download, so clearing between combos forced N redundant re-downloads of the same weights.
…facts feed Running --build-only over the 8-EP matrix for many models fills local disk. Two additions keep disk bounded: 1. Export dedup: the export.onnx stage is EP/device-independent, so every combo produces an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to <model_dir>/_shared/, later identical ones are deleted. One export copy on disk instead of 8. 2. --upload: after a model's combos are built, publish the model dir to the Modelkit Azure Artifacts feed as a Universal Package version, then delete it locally. Auth via az login (no PAT); the azure-devops extension is ensured and login verified up front (aborts otherwise so disk isn't silently filled). Version is 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; date stamp groups a batch). --continue + --run-stamp resume an interrupted batch from the build_only_uploads.json manifest without rebuilding uploaded models; --keep-local, --upload-skip-existing, and feed/package args round it out.
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Reviewed the full diff. Overall structure is solid and docstrings are thorough. Found a potential data-loss path (false-positive conflict detection triggers local directory deletion), a silent no-op for --continue in the non-upload build-only path, and a few other correctness/usability issues — see inline comments.
| with p.open("rb") as fh: | ||
| for chunk in iter(lambda: fh.read(1 << 20), b""): | ||
| h.update(chunk) | ||
| except OSError: |
There was a problem hiding this comment.
_hash_files: OSError swallowed with a fixed sentinel — potential dedup false-positive
When a file cannot be opened, the hash is updated with the literal bytes b"<unreadable>" regardless of which file failed or why. Two distinct files that both happen to be unreadable will produce the same hash, which could cause the duplicate-detection path to delete an artifact it never actually verified was identical.
Safer options:
- Re-raise the
OSErrorso the caller can skip dedup for that combo. - Include the filename in the sentinel so two different unreadable files don't collide:
h.update(p.name.encode() + b":unreadable").
| return "0.0.0-" + _slugify_version("-".join(parts)) | ||
|
|
||
|
|
||
| def _is_publish_conflict(proc: dict) -> bool: |
There was a problem hiding this comment.
_is_publish_conflict: substring match on "conflict" is too broad — data-loss risk
Checking for the substring conflict in combined stdout+stderr will produce a false positive any time the Azure CLI emits that word in an unrelated context (e.g. "merge conflict", "address conflict", "proxy conflict"). A false positive here triggers exists-skipped status and deletes the local model directory via shutil.rmtree, making this a potential data-loss path.
Consider narrowing to the specific HTTP 409 indicators or the Azure DevOps error code: StatusCode=409, HttpStatusCode: 409, or PACKAGEVERSIONEXISTS. These are far less likely to appear in unrelated messages.
|
|
||
| use_matrix = args.ep is None and args.device == "auto" | ||
| # Single combo uses an empty label (no subdir); matrix uses (label, ep, device). | ||
| combos = list(_BUILD_ONLY_EP_MATRIX) if use_matrix else [("", args.ep, args.device)] |
There was a problem hiding this comment.
Pinned-EP-only case: _resolve_precision receives device="auto" — w8a16 default silently dropped
When the user sets --ep but omits --device (default is "auto"), use_matrix is False (because args.ep is not None) so combos = [("", args.ep, "auto")]. Then _resolve_precision("auto", ...) returns None since "auto" != "npu", and the config is generated without a precision flag.
This is inconsistent with the matrix path: entry ("qnn_npu", "qnn", "npu") correctly applies w8a16 because device is explicitly npu. Consider resolving "auto" to a concrete device per-EP before calling _resolve_precision, or documenting that precision auto-detection is intentionally delegated to winml config in this pinned path.
| # Resume: skip models already uploaded for this run-stamp (manifest is | ||
| # the source of truth; failed/unrecorded models are (re)built). Skipping | ||
| # here avoids the expensive 8-EP rebuild, not just the upload. | ||
| if args.upload and args.continue_run: |
There was a problem hiding this comment.
--continue is silently a no-op in --build-only without --upload
The resume skip is gated on if args.upload and args.continue_run, so --build-only --continue without --upload does nothing — all models are rebuilt from scratch. There is no filesystem-based resume (e.g. checking whether model_dir already contains the expected combo subdirs).
Note: the pre-filtering of entries by --continue in main() lives inside the if args.list_json: branch, and the eval-path filtering is after the if args.build_only: return, so neither runs in build-only mode.
A user resuming an interrupted non-upload batch with --build-only --continue will silently rebuild everything. Consider either:
- Adding a local-disk skip (check if combo subdir already exists and has artifacts), or
- Raising an error/warning when
--continueis combined with--build-onlybut not--upload.
| parser.add_argument( | ||
| "--upload-skip-existing", | ||
| dest="upload_skip_existing", | ||
| action="store_true", |
There was a problem hiding this comment.
Misleading --upload-skip-existing help text — the build is NOT skipped
The current text says "skip building+uploading a model whose package version already exists in the feed", but this flag only affects the publish step. By the time a conflict is detected, the model has already been fully built across all EP combos. The flag merely prevents treating the conflict as a failure.
The flag that actually skips the build is --continue (manifest-based resume). Suggested rewrite:
With
--upload: treat a version-already-exists publish conflict as success and delete the local copy. To skip rebuilding entirely for already-uploaded models, use--continue.
| _resolved_device, _ = _resolve_device(device=device or "auto", ep=ep) | ||
| try: | ||
| _resolved_device, _ = _resolve_device(device=device or "auto", ep=ep) | ||
| except ValueError: |
There was a problem hiding this comment.
except ValueError is overly broad — may swallow unrelated errors
The intent (per the comment) is to catch "EP not installed on this host", but _resolve_device could also raise ValueError for invalid device names, malformed input, or other programming errors. With config.compile is None, all such errors are silently swallowed and _resolved_device falls back to device or "".
If _resolve_device raises a typed exception for hardware-unavailability (or exposes a separate is_device_available check), prefer that. Otherwise, consider at least checking str(e) for expected keywords (e.g. "not available", "not installed") before swallowing, and re-raising for unexpected messages.
… versions The --continue skip logic only consulted the local build_only_uploads.json manifest, which is written after each successful upload. A fresh --output-dir (e.g. a gitignored temp dir) starts empty, so models already published to the Azure Artifacts feed under the same run-stamp were rebuilt and re-uploaded instead of being skipped. Seed the in-memory manifest from the feed at startup: query the feed REST API for versions matching 0.0.0-<run-stamp>-* and mark them as uploaded so the existing skip check honors them. The feed is now authoritative for what's published, regardless of local state. Querying is best-effort -- a failure falls back to local-manifest-only behavior. Use two ampersand-free REST GETs (list packages -> resolve UPack package GUID -> list versions) because az resolves to az.cmd and cmd.exe splits query strings on '&', dropping every parameter after the first.
- _hash_files: stop hashing unreadable files to a fixed sentinel; propagate OSError and have _dedup_export keep the export in place instead of risking deletion of an artifact never verified identical. - _is_publish_conflict: narrow detection to specific version-exists / HTTP 409 markers (drop bare 'conflict'/'409') so an unrelated message can't trigger exists-skipped and rmtree the local model dir. - build.py _run_optimize_stage: narrow the no-compile EP fallback to only swallow EP-not-available ValueErrors; re-raise malformed device/EP names. - Warn when --continue is used with --build-only but without --upload (no local-disk resume exists, so everything is rebuilt). - Document that the pinned-EP auto-device path delegates precision to winml config's auto-detection. - Fix misleading --upload-skip-existing help: it does not skip the build.
Summary
Adds a
--build-onlymode toscripts/e2e_eval/run_eval.pythat generates ModelKit pipeline artifacts (export → optimize → quantize, no compile) across the full EP matrix, without requiring EP hardware, and optionally streams them to theModelkitAzure Artifacts feed while bounding local disk usage.Motivation: we need a way to mass-produce per-EP model artifacts for ~200 models on a single (CPU) box and distribute them, but (a) the build normally needs the target EP installed, and (b) writing every stage for every (model × EP) fills the disk fast.
What's included
1.
--build-onlymode + per-EP matrixwinml config+winml build --no-compileper model; perf/accuracy are skipped.--ep/--deviceare omitted, builds the eval EP matrix into<model_dir>/<ep>_<device>/subdirs:qnn_npu,qnn_gpu,ov_cpu,ov_npu,ov_gpu,mlas_cpu,dml_gpu,vitisai_npu. Pinning--ep/--devicewrites a single build directly into<model_dir>.w8a16, CPU/GPU → auto, native-quant EPs →--no-quant).2. Cross-EP / cross-host builds (core fix in
build.py)_run_optimize_stagecalledresolve_device(device, ep=ep)purely to pick a progress-bar key, which raised when the target EP wasn't installed locally — blocking offline generation of (e.g.) QNN/VitisAI artifacts on a CPU box.config.compile is None), the missing-EP lookup soft-fails and falls back to the requested device (optimize/quantize only need the EP's static rule data, not a registered EP). When compile will run, it still fails fast. No behavior change for normal compiling builds.3. Export dedup (disk saver)
export.onnxstage is EP/device-independent, so all 8 combos produce an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to<model_dir>/_shared/, later identical ones are deleted — one export copy instead of 8 (export is the largest, full-precision artifact).4.
--upload: stream to Azure Artifacts feed, then delete locallyModelkitfeed as a Universal Package, then deletes the local copy — peak disk stays at ~one model's matrix.az login(Entra ID), no PAT. Theazure-devopsextension is ensured and login verified up front; if not ready, the run aborts (so disk isn't silently filled).winml-cli-models, one version per model:0.0.0-<run-stamp>-<model-slug>(valid SemVer 2.0; the0.0.0-core keeps it legal, the date stamp + slug are the pre-release segment). The shared run-stamp groups a batch.--continue+ the same--run-stampskips already-uploaded models without rebuilding them. Already-uploaded models are detected two ways: a localbuild_only_uploads.jsonmanifest, and a query against the feed itself at startup (versions matching0.0.0-<run-stamp>-*). Because the manifest is only written after a successful upload, a fresh--output-dirwould otherwise start empty and rebuild everything; seeding from the feed makes it authoritative for what's published, so resume works regardless of local state. A feed-query failure falls back to local-manifest-only behavior.--run-stamp,--keep-local,--upload-skip-existing,--feed/--feed-org/--feed-project/--package-name.Usage
Download a specific model's specific file later:
az artifacts universal download \ --organization https://dev.azure.com/microsoft --project windows.ai.toolkit \ --scope project --feed Modelkit --name winml-cli-models \ --version 0.0.0-20260609-microsoft-resnet-50-image-classification \ --path ./out --file-filter 'qnn_npu/model.onnx*'Notes / scope
src/winml/modelkit/commands/build.pychange is gated onconfig.compile is None, so it only affects no-compile builds; compiling builds are unchanged and still fail fast on a missing EP.--continueresumes from the feed (versions matching0.0.0-<run-stamp>-*) and the localbuild_only_uploads.jsonmanifest, so a resumed run only needs the same--run-stamp— it no longer has to reuse the same--output-dir. The feed query uses two&-free REST GETs (list packages → resolve the UPack package GUID → list versions) becauseazresolves toaz.cmdand cmd.exe splits query strings on&, dropping every parameter after the first.Modelkit, and--file-filterdownload of individualmodel.onnxfiles.