Skip to content

AVX-512 micro optimizations for non-ASCII input#128

Merged
hkratz merged 1 commit into
mainfrom
avx512-opt
Jun 15, 2026
Merged

AVX-512 micro optimizations for non-ASCII input#128
hkratz merged 1 commit into
mainfrom
avx512-opt

Conversation

@hkratz

@hkratz hkratz commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This PR contains several AVX-512 micro optimizations for non-ASCII input:

  1. Since we already require VBMI for our AVX-512 implementation we can use VPERMB for dynamic shuffles. The advantage is that VPERMB ignores the upper two bits and selects element 0-63 of the lookup table based solely on the lower 6 bits. Thus we do not need to mask out the upper nibble when shuffling on a 16-byte LUT (repeated 4 times). This allows us to get rid of two VPANDQ masking operations which we would otherwise need because AVX-512 (just like AVX2 and SSE) has only 16-bit shr operations. Those pollute the upper nibble of every other byte. The same we can also get rid of one other VPANDQ masking operation.
  2. LLVMs optimizer already fused four bit-manipulation operations into two VPTERNLOG operations in the hot loop. Doing it explicitly allows us to do one more fusion getting rid of another VPANDQ.
  3. Because of this fusion we also need one less register-register move.

So all in all this optimization removes four VPANDQ operations and one reg-reg move at the cost of one additional VPTERNLOG.

Overall that improves AVX-512 performance on non-ASCII input by up to 20%.

This required some minor refactoring so that we can replace check_special_cases and check_multibyte_lengths with specialized implementations on AVX-512.

@hkratz hkratz merged commit 641d57f into main Jun 15, 2026
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant