Add CUDA-Windows quantization support to Parakeet README#18895
Add CUDA-Windows quantization support to Parakeet README#18895seyeong-han wants to merge 1 commit intomainfrom
Conversation
Document that CUDA-Windows backend supports 4w/8w/8da4w/8da8w quantization configs (without tile_packed_to_4d packing format). Add export example showing bf16 + 4w linear + 8w embedding quantization for CUDA-Windows. Note: tile_packed_to_4d is not supported for cuda-windows because the cross-compilation path keeps tensors on CPU during export, and the _convert_weight_to_int4pack op only has a CUDA kernel.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18895
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 7f25285 with merge base 38c5ca3 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Document that the
cuda-windowsbackend supports4w/8w/8da4w/8da8wquantization configs for the Parakeet TDT model export, and add an export example.Changes
CUDA-WindowsalongsideCUDAas a supported backend for4w,8w,8da4w, and8da8wquantization--dtype bf16 --qlinear_encoder 4w --qlinear 4w --qembedding 8w)--qlinear_packing_format tile_packed_to_4dis not supported forcuda-windowsbecause the cross-compilation path keeps tensors on CPU during export andaten::_convert_weight_to_int4packonly has a CUDA kernelBenchmark Results (RTX 5080, ~20s audio)