Background
winml build --device gpu correctly sets quant=null via _patch_device() — normal CLI paths never produce a QDQ quantized model for GPU. This issue was discovered during research experiments that manually bypassed that protection.
However, winml perf and winml eval accept any ONNX as input. A user who:
- brings their own pre-quantized W8A8 QDQ ONNX, OR
- hand-crafts a config.json bypassing the GPU quant guard
...will encounter an infinite hang with no output, no error, and no timeout.
Root cause
QNN GPU EP cannot handle QDQ INT8 graphs (Conv/Gemm/LayerNorm patterns) and hangs silently in graph compilation rather than returning an error. This is ultimately a QNN SDK / ORT behavior, but winml perf can add a defensive check to protect the user experience.
Proposed enhancement
In winml perf and winml eval, before session creation, check:
if is_qdq_model(model_path) and ep == "qnn" and device == "gpu":
raise CliError with clear guidance:
- "QNN GPU EP does not support INT8 QDQ models"
- "Use FP32 (default) or FP16 once #867 (--enable-fp16-conversion) is available"
The is_qdq_model() check can inspect the ONNX graph for QuantizeLinear + DequantizeLinear node pairs.
Why enhancement, not bug
winml build already has the correct design (_patch_device() prevents QDQ configs for GPU). This is a defensive UX improvement for edge cases, not a fix for a broken code path.
Priority
Low — normal workflow is already protected. Fast-fail would prevent confusion for researchers and power users.
See also
Background
winml build --device gpucorrectly setsquant=nullvia_patch_device()— normal CLI paths never produce a QDQ quantized model for GPU. This issue was discovered during research experiments that manually bypassed that protection.However,
winml perfandwinml evalaccept any ONNX as input. A user who:...will encounter an infinite hang with no output, no error, and no timeout.
Root cause
QNN GPU EP cannot handle QDQ INT8 graphs (Conv/Gemm/LayerNorm patterns) and hangs silently in graph compilation rather than returning an error. This is ultimately a QNN SDK / ORT behavior, but
winml perfcan add a defensive check to protect the user experience.Proposed enhancement
In
winml perfandwinml eval, before session creation, check:if is_qdq_model(model_path) and ep == "qnn" and device == "gpu":raise CliError with clear guidance:
- "QNN GPU EP does not support INT8 QDQ models"
- "Use FP32 (default) or FP16 once #867 (--enable-fp16-conversion) is available"
The
is_qdq_model()check can inspect the ONNX graph forQuantizeLinear+DequantizeLinearnode pairs.Why enhancement, not bug
winml buildalready has the correct design (_patch_device()prevents QDQ configs for GPU). This is a defensive UX improvement for edge cases, not a fix for a broken code path.Priority
Low — normal workflow is already protected. Fast-fail would prevent confusion for researchers and power users.
See also
--ep-optionfor runtime EP flags--enable-fp16-conversion(the correct GPU optimization path)