Skip to content

feat: add --enable-fp16-conversion to winml optimize and --precision to winml build/export #867

@DingmaomaoBJTU

Description

@DingmaomaoBJTU

Problem

There is no official winml CLI path to produce an FP16 ONNX model. This blocks the autoconfig skill from testing FP16 as an optimization hypothesis for QNN GPU and DML targets via a reproducible CLI path.

Proposed design

Unified --precision flag on winml build and winml export

--precision already exists on winml perf, winml eval, winml configbuild and export should be consistent:

winml build  -m facebook/convnext-tiny-224 --device gpu --precision fp16 -o out/
winml export -m facebook/convnext-tiny-224 -o model.onnx --precision fp16

winml build --precision fp16 should choose the implementation path based on model source:

  • HuggingFace source → export-time (torch_dtype=float16 when loading PyTorch model) — cleanest, native FP16
  • Pre-exported ONNX → optimize-stage post-export cast (see below)

--enable-fp16-conversion on winml optimize

Consistent with the existing --enable-X / --disable-X pattern in winml optimize. Declarative, not implementation-specific:

# Full FP16 — inputs + weights + activations (QNN GPU / DML use case)
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion

# Keep model I/O as FP32 (CPU-safe fallback)
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion --fp16-keep-io-types

# Keep precision-sensitive ops (e.g. LayerNorm, Softmax) in FP32
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion --fp16-op-block-list LayerNorm,Softmax

Internally backed by onnxruntime.transformers.float16.convert_float_to_float16 (ORT built-in, no new dependencies).

Implementation priority

Phase Work Effort
P0 winml optimize --enable-fp16-conversion (post-export cast via ORT transformers) Low
P1 winml build/export --precision fp16 wired to P0 for ONNX inputs Low (after P0)
P2 winml build --precision fp16 using export-time torch_dtype=float16 for HF sources Medium

Empirical motivation

Tested on ConvNext (facebook/convnext-tiny-224), QNN GPU (Adreno X1-85):

Version p50 p90 std
FP32 baseline 17.7ms 19.7ms 1.0ms
FP16 (post-export cast) 8.8ms 32ms 9ms

FP16 p50 is 2× faster, making it the primary optimization lever for Adreno GPU. Without this CLI path, autoconfig must mark the FP16 hypothesis as SKIPPED — CLI gap.

Note: high p90 variance (32ms) is a separate issue likely related to DVFS on Adreno GPU — tracked in #865 (--ep-option htp_performance_mode).

Acceptance criteria (P0)

  • winml optimize -m model.onnx --enable-fp16-conversion produces a valid FP16 ONNX
  • --fp16-keep-io-types flag preserves FP32 I/O
  • Model structure is preserved (only dtypes change); node count difference ≤ 2 (Cast nodes at I/O boundary)
  • Operation is recorded in optimize output metadata for reproducibility

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — minor bug or non-critical improvementfeature scaleFeature scale work itemtriagedIssue has been triaged

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions