Implement fallback to smaller vector size for swizzle_dyn #433
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a fallback implementation so that e.g. u8x64::swizzle_dyn can be reasonably efficient even when only compiled with 128-bit SSSE.
A "downgraded" swizzle_dyn op on N lanes emits 4 swizzle_dyn ops on N/2 lanes. If the optimizer can deduce that index values are bounded to N/2 or less, then it will generally be more efficient.
For example,
u8x64::swizzle_dyn
will only emit 4pshufb
instructions on SSSE, instead of 16 in the general case, if the optimizer can prove index values are always <16 (this is generally achieved by preceding the swizzle with a 0xf mask).Additionaly, for non-power-of-two N values, this PR adds a fallback implementation which zero-extends to the next power of two size.
Benchmarks
Below are benchmark results for the following code, on 5 target-cpu levels:
x86-64
- baseline, no vectorized shufflesx86-64-v2
- ssse, adds pshufb on u8x16x86-64-v3
- avx2, adds vpshufb on u8x32 (vpshufb is not a true extension of pshufb to 256-bit, instead it's more like 2 pshufb ops ran in parallel, with 4-bit indices in the corresponding lane)x86-64-v4
- avx512, adds vpshufb on u8x64 (again really just 4x pshufb in parallel)icelake-server
- avx512vbmi, adds vpermb (this is a true 256/512-bit shuffle with 5/6-bit indices)N.B. the code used to benchmark includes #431, and removes the
src/masks/bitmask.rs
avx512f mask implementation to work around the problem discussed here.The main thing to look for is the performance of
swizzle_dyn_64*
on non-vbmi targets (v2, v3, v4), and the performance ofswizzle_dyn_32*
on x86-64-v2. With the previous implementation, sizes without native vector instructions fall back to the scalar implementation, while in the PR version they fall back to a lower vector size for ~4-10x performance over the scalar version.Benchmark data for old version
-Ctarget-cpu=x86-64
-Ctarget-cpu=x86-64-v2
-Ctarget-cpu=x86-64-v3
-Ctarget-cpu=x86-64-v4
-Ctarget-cpu=icelake-server
(avx512vbmi)Benchmark data for new version
-Ctarget-cpu=x86-64
-Ctarget-cpu=x86-64-v2
-Ctarget-cpu=x86-64-v3
-Ctarget-cpu=x86-64-v4
-Ctarget-cpu=icelake-server
(avx512vbmi)