Skip to content

Add CUDA-Windows quantization support to Parakeet README#18895

Open
seyeong-han wants to merge 1 commit intomainfrom
remote/younghan/parakeet-cuda-windows-quantization
Open

Add CUDA-Windows quantization support to Parakeet README#18895
seyeong-han wants to merge 1 commit intomainfrom
remote/younghan/parakeet-cuda-windows-quantization

Conversation

@seyeong-han
Copy link
Copy Markdown
Contributor

Document that the cuda-windows backend supports 4w/8w/8da4w/8da8w quantization configs for the Parakeet TDT model export, and add an export example.

Changes

  • Updated the Quantization Configs table to list CUDA-Windows alongside CUDA as a supported backend for 4w, 8w, 8da4w, and 8da8w quantization
  • Added a new "Example: 4-bit Weight Quantization for CUDA-Windows" section with the export command (--dtype bf16 --qlinear_encoder 4w --qlinear 4w --qembedding 8w)
  • Added a note that --qlinear_packing_format tile_packed_to_4d is not supported for cuda-windows because the cross-compilation path keeps tensors on CPU during export and aten::_convert_weight_to_int4pack only has a CUDA kernel

Benchmark Results (RTX 5080, ~20s audio)

Metric Non-Quantized (fp32) Quantized (bf16 + 4w/8w) Improvement
Model size 2,445 MB 763 MB 3.2× smaller
Prefill speed 1,301 tok/s 3,218 tok/s 2.5× faster
Decode speed 1,325 tok/s 1,545 tok/s 1.2× faster
Model load time 4.1 s 1.6 s 2.5× faster
Time to first token 202 ms 86 ms 2.4× faster
Transcription accuracy Baseline Identical words Tie

Document that CUDA-Windows backend supports 4w/8w/8da4w/8da8w quantization
configs (without tile_packed_to_4d packing format). Add export example
showing bf16 + 4w linear + 8w embedding quantization for CUDA-Windows.

Note: tile_packed_to_4d is not supported for cuda-windows because the
cross-compilation path keeps tensors on CPU during export, and the
_convert_weight_to_int4pack op only has a CUDA kernel.
@seyeong-han seyeong-han requested a review from Gasoonjia April 14, 2026 21:28
@seyeong-han seyeong-han requested a review from lucylq as a code owner April 14, 2026 21:28
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Apr 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18895

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 7f25285 with merge base 38c5ca3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant