-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added STATIC_BMI2 for compile time detection of BMI2 on MSVC, when enabled various intrinsics are used #2258
Conversation
Do you have some benchmark results ? |
Profiler showed some of these not being inlined on MSVC
Hi @Cyan4973 I used the fullbench program to test the changes in regards to performance. I also used the VS profiler, it showed a few functions not being inlined, so I applied a force inline to a few hot functions in bitstream.h The BitScanForward/Reverse changes were faster along with _bzhi_u64, but using _bextr_u64 in that context didn't work out, so I removed it. Here is the before and after of running fullbench. Before: After: |
Excellent, thank you @Niadb ! I'll have a more detailed look at this patch later today. |
I've been trying to replicate these results. I'm unable to detect any consistent compression speed improvement. If there is, it's well within noise level. Still, it's a net positive, so it qualifies, Just for curiosity, I'm wondering if I do these speed tests correctly. |
hi @Cyan4973 That is about the same as what I saw, the compression speed diff was very small, basically irrelevant. I was more interested in decompression speed, as that is what I mostly use zstd for. 1767.6/1727.9 is only about 2.2%, so in line with what you reported, it varies some each time I run fullbench, but is generally in the 2-3% range. I think you basically had the same flags as me. Intel vs AMD might make some difference, not sure what CPU you have. For comparison here is clang on the same CPU as the above ^, it is very far ahead of MSVC in every metric *** Zstandard speed analyzer 1.4.5 64-bits, by Yann Collet (Jul 28 2020) *** |
This adds STATIC_BMI2 define to compiler.h, which is enabled on MSVC when compiler target has AVX2 enabled
Like DYNAMIC_BMI2 but at compile time.
Replaces uses of BitScanForward/Reverse with _lzcnt_u64/_tzcnt_u64 and 32 variants for better code gen.
Other changes when STATIC_BMI2 is on:
BIT_getLowerBits uses _bzhi_u64
BIT_getMiddleBits uses _bextr_u64
STATIC_BMI2 is only turned on with MSVC
MSVC is terrible at detecting patterns such as _bextr_u64, clang is much better so maybe not necessary there.
In BIT_getLowerBits Clang is smart enough to use _bzhi_u64 when -march=skylake is enabled(AVX2 support), MSVC fails to do so when /arch:AVX2 is on.