Problem
There is no official winml CLI path to produce an FP16 ONNX model. This blocks the autoconfig skill from testing FP16 as an optimization hypothesis for QNN GPU and DML targets via a reproducible CLI path.
Proposed design
Unified --precision flag on winml build and winml export
--precision already exists on winml perf, winml eval, winml config — build and export should be consistent:
winml build -m facebook/convnext-tiny-224 --device gpu --precision fp16 -o out/
winml export -m facebook/convnext-tiny-224 -o model.onnx --precision fp16
winml build --precision fp16 should choose the implementation path based on model source:
- HuggingFace source → export-time (
torch_dtype=float16 when loading PyTorch model) — cleanest, native FP16
- Pre-exported ONNX → optimize-stage post-export cast (see below)
--enable-fp16-conversion on winml optimize
Consistent with the existing --enable-X / --disable-X pattern in winml optimize. Declarative, not implementation-specific:
# Full FP16 — inputs + weights + activations (QNN GPU / DML use case)
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion
# Keep model I/O as FP32 (CPU-safe fallback)
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion --fp16-keep-io-types
# Keep precision-sensitive ops (e.g. LayerNorm, Softmax) in FP32
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion --fp16-op-block-list LayerNorm,Softmax
Internally backed by onnxruntime.transformers.float16.convert_float_to_float16 (ORT built-in, no new dependencies).
Implementation priority
| Phase |
Work |
Effort |
| P0 |
winml optimize --enable-fp16-conversion (post-export cast via ORT transformers) |
Low |
| P1 |
winml build/export --precision fp16 wired to P0 for ONNX inputs |
Low (after P0) |
| P2 |
winml build --precision fp16 using export-time torch_dtype=float16 for HF sources |
Medium |
Empirical motivation
Tested on ConvNext (facebook/convnext-tiny-224), QNN GPU (Adreno X1-85):
| Version |
p50 |
p90 |
std |
| FP32 baseline |
17.7ms |
19.7ms |
1.0ms |
| FP16 (post-export cast) |
8.8ms |
32ms |
9ms |
FP16 p50 is 2× faster, making it the primary optimization lever for Adreno GPU. Without this CLI path, autoconfig must mark the FP16 hypothesis as SKIPPED — CLI gap.
Note: high p90 variance (32ms) is a separate issue likely related to DVFS on Adreno GPU — tracked in #865 (--ep-option htp_performance_mode).
Acceptance criteria (P0)
winml optimize -m model.onnx --enable-fp16-conversion produces a valid FP16 ONNX
--fp16-keep-io-types flag preserves FP32 I/O
- Model structure is preserved (only dtypes change); node count difference ≤ 2 (Cast nodes at I/O boundary)
- Operation is recorded in optimize output metadata for reproducibility
Problem
There is no official
winmlCLI path to produce an FP16 ONNX model. This blocks theautoconfigskill from testing FP16 as an optimization hypothesis for QNN GPU and DML targets via a reproducible CLI path.Proposed design
Unified
--precisionflag onwinml buildandwinml export--precisionalready exists onwinml perf,winml eval,winml config—buildandexportshould be consistent:winml build -m facebook/convnext-tiny-224 --device gpu --precision fp16 -o out/ winml export -m facebook/convnext-tiny-224 -o model.onnx --precision fp16winml build --precision fp16should choose the implementation path based on model source:torch_dtype=float16when loading PyTorch model) — cleanest, native FP16--enable-fp16-conversiononwinml optimizeConsistent with the existing
--enable-X / --disable-Xpattern inwinml optimize. Declarative, not implementation-specific:Internally backed by
onnxruntime.transformers.float16.convert_float_to_float16(ORT built-in, no new dependencies).Implementation priority
winml optimize --enable-fp16-conversion(post-export cast via ORT transformers)winml build/export --precision fp16wired to P0 for ONNX inputswinml build --precision fp16using export-timetorch_dtype=float16for HF sourcesEmpirical motivation
Tested on ConvNext (facebook/convnext-tiny-224), QNN GPU (Adreno X1-85):
FP16 p50 is 2× faster, making it the primary optimization lever for Adreno GPU. Without this CLI path,
autoconfigmust mark the FP16 hypothesis asSKIPPED — CLI gap.Note: high p90 variance (32ms) is a separate issue likely related to DVFS on Adreno GPU — tracked in #865 (
--ep-option htp_performance_mode).Acceptance criteria (P0)
winml optimize -m model.onnx --enable-fp16-conversionproduces a valid FP16 ONNX--fp16-keep-io-typesflag preserves FP32 I/O