feat: add --enable-fp16-conversion to winml optimize and --precision to winml build/export

## Problem

There is no official `winml` CLI path to produce an FP16 ONNX model. This blocks the `autoconfig` skill from testing FP16 as an optimization hypothesis for QNN GPU and DML targets via a reproducible CLI path.

## Proposed design

### Unified `--precision` flag on `winml build` and `winml export`

`--precision` already exists on `winml perf`, `winml eval`, `winml config` — `build` and `export` should be consistent:

```bash
winml build  -m facebook/convnext-tiny-224 --device gpu --precision fp16 -o out/
winml export -m facebook/convnext-tiny-224 -o model.onnx --precision fp16
```

`winml build --precision fp16` should choose the implementation path based on model source:
- **HuggingFace source** → export-time (`torch_dtype=float16` when loading PyTorch model) — cleanest, native FP16
- **Pre-exported ONNX** → optimize-stage post-export cast (see below)

### `--enable-fp16-conversion` on `winml optimize`

Consistent with the existing `--enable-X / --disable-X` pattern in `winml optimize`. Declarative, not implementation-specific:

```bash
# Full FP16 — inputs + weights + activations (QNN GPU / DML use case)
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion

# Keep model I/O as FP32 (CPU-safe fallback)
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion --fp16-keep-io-types

# Keep precision-sensitive ops (e.g. LayerNorm, Softmax) in FP32
winml optimize -m model.onnx -o model_fp16.onnx --enable-fp16-conversion --fp16-op-block-list LayerNorm,Softmax
```

Internally backed by `onnxruntime.transformers.float16.convert_float_to_float16` (ORT built-in, no new dependencies).

## Implementation priority

| Phase | Work | Effort |
|---|---|---|
| **P0** | `winml optimize --enable-fp16-conversion` (post-export cast via ORT transformers) | Low |
| **P1** | `winml build/export --precision fp16` wired to P0 for ONNX inputs | Low (after P0) |
| **P2** | `winml build --precision fp16` using export-time `torch_dtype=float16` for HF sources | Medium |

## Empirical motivation

Tested on ConvNext (facebook/convnext-tiny-224), QNN GPU (Adreno X1-85):

| Version | p50 | p90 | std |
|---|---|---|---|
| FP32 baseline | 17.7ms | 19.7ms | 1.0ms |
| FP16 (post-export cast) | **8.8ms** | 32ms | 9ms |

FP16 p50 is 2× faster, making it the primary optimization lever for Adreno GPU. Without this CLI path, `autoconfig` must mark the FP16 hypothesis as `SKIPPED — CLI gap`.

Note: high p90 variance (32ms) is a separate issue likely related to DVFS on Adreno GPU — tracked in #865 (`--ep-option htp_performance_mode`).

## Acceptance criteria (P0)
- `winml optimize -m model.onnx --enable-fp16-conversion` produces a valid FP16 ONNX
- `--fp16-keep-io-types` flag preserves FP32 I/O
- Model structure is preserved (only dtypes change); node count difference ≤ 2 (Cast nodes at I/O boundary)
- Operation is recorded in optimize output metadata for reproducibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --enable-fp16-conversion to winml optimize and --precision to winml build/export #867

Problem

Proposed design

Unified `--precision` flag on `winml build` and `winml export`

`--enable-fp16-conversion` on `winml optimize`

Implementation priority

Empirical motivation

Acceptance criteria (P0)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Work	Effort
P0	`winml optimize --enable-fp16-conversion` (post-export cast via ORT transformers)	Low
P1	`winml build/export --precision fp16` wired to P0 for ONNX inputs	Low (after P0)
P2	`winml build --precision fp16` using export-time `torch_dtype=float16` for HF sources	Medium

Version	p50	p90	std
FP32 baseline	17.7ms	19.7ms	1.0ms
FP16 (post-export cast)	8.8ms	32ms	9ms

feat: add --enable-fp16-conversion to winml optimize and --precision to winml build/export #867

Description

Problem

Proposed design

Unified --precision flag on winml build and winml export

--enable-fp16-conversion on winml optimize

Implementation priority

Empirical motivation

Acceptance criteria (P0)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Unified `--precision` flag on `winml build` and `winml export`

`--enable-fp16-conversion` on `winml optimize`