Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added STATIC_BMI2 for compile time detection of BMI2 on MSVC, when enabled various intrinsics are used #2258

Merged
merged 3 commits into from
Aug 4, 2020

Conversation

Niadb
Copy link
Contributor

@Niadb Niadb commented Jul 28, 2020

This adds STATIC_BMI2 define to compiler.h, which is enabled on MSVC when compiler target has AVX2 enabled

Like DYNAMIC_BMI2 but at compile time.

Replaces uses of BitScanForward/Reverse with _lzcnt_u64/_tzcnt_u64 and 32 variants for better code gen.

Other changes when STATIC_BMI2 is on:
BIT_getLowerBits uses _bzhi_u64
BIT_getMiddleBits uses _bextr_u64

STATIC_BMI2 is only turned on with MSVC

MSVC is terrible at detecting patterns such as _bextr_u64, clang is much better so maybe not necessary there.

In BIT_getLowerBits Clang is smart enough to use _bzhi_u64 when -march=skylake is enabled(AVX2 support), MSVC fails to do so when /arch:AVX2 is on.

@Cyan4973
Copy link
Contributor

Do you have some benchmark results ?
Or perf / IACA results ?

Profiler showed some of these not being inlined on MSVC
@Niadb
Copy link
Contributor Author

Niadb commented Jul 28, 2020

Hi @Cyan4973 I used the fullbench program to test the changes in regards to performance.

I also used the VS profiler, it showed a few functions not being inlined, so I applied a force inline to a few hot functions in bitstream.h

The BitScanForward/Reverse changes were faster along with _bzhi_u64, but using _bextr_u64 in that context didn't work out, so I removed it.

Here is the before and after of running fullbench.
Compiled with VS2019(latest), CPU is Ryzen 9 3900x, running Windows 10
AVX2 was enabled in build

Before:
*** Zstandard speed analyzer 1.4.5 64-bits, by Yann Collet (Jul 28 2020) ***
 Sample 10000000 bytes :
 1#compress                     :   520.1 MB/s  ( 3154550)
 2#decompress                   :  1727.9 MB/s  (10000000)
11#compressContinue             :   519.6 MB/s  ( 3154550)
12#compressContinue_extDict     :   512.6 MB/s  ( 3154887)
13#decompressContinue           :  1724.4 MB/s  (10000000)
31#decodeLiteralsBlock          :  2534.8 MB/s  (   37031)
32#decodeSeqHeaders             : 51772.6 MB/s  (      73)
41#compressStream               :   421.9 MB/s  ( 3154678)
42#decompressStream             :  1726.2 MB/s  (10000000)
43#compressStream_freshCCtx     :   421.9 MB/s  ( 3154678)
51#compress_generic, continue   :   422.2 MB/s  ( 3154678)
52#compress_generic, end        :   519.8 MB/s  ( 3154550)
61#compress_generic, -T2, contin:   421.3 MB/s  ( 3154678)
62#compress_generic, -T2, end   :   519.1 MB/s  ( 3154550)

After:
*** Zstandard speed analyzer 1.4.5 64-bits, by Yann Collet (Jul 28 2020) ***
 Sample 10000000 bytes :
 1#compress                     :   526.9 MB/s  ( 3154550)
 2#decompress                   :  1767.6 MB/s  (10000000)
11#compressContinue             :   527.1 MB/s  ( 3154550)
12#compressContinue_extDict     :   520.9 MB/s  ( 3154887)
13#decompressContinue           :  1767.5 MB/s  (10000000)
31#decodeLiteralsBlock          :  2522.1 MB/s  (   37031)
32#decodeSeqHeaders             : 50173.2 MB/s  (      73)
41#compressStream               :   419.7 MB/s  ( 3154678)
42#decompressStream             :  1755.4 MB/s  (10000000)
43#compressStream_freshCCtx     :   420.7 MB/s  ( 3154678)
51#compress_generic, continue   :   422.7 MB/s  ( 3154678)
52#compress_generic, end        :   524.6 MB/s  ( 3154550)
61#compress_generic, -T2, contin:   418.1 MB/s  ( 3154678)
62#compress_generic, -T2, end   :   526.1 MB/s  ( 3154550)

@Cyan4973
Copy link
Contributor

Excellent, thank you @Niadb !
We are aware that Visual Studio performance is not as good as gcc or clang, so this patch is welcomed.
And yes, we also noticed that bextr is not an automatic win.

I'll have a more detailed look at this patch later today.

@Cyan4973 Cyan4973 self-requested a review July 28, 2020 17:33
@Cyan4973 Cyan4973 self-assigned this Jul 29, 2020
@Cyan4973
Copy link
Contributor

Cyan4973 commented Jul 30, 2020

I've been trying to replicate these results.
So far, results are slightly less good than anticipated from published measurements,
though broadly in line.

I'm unable to detect any consistent compression speed improvement. If there is, it's well within noise level.
On the decompression side though, there is a measurable speed improvement, just not a great one. The average is about ~3%, relatively consistent on synthetic data, but varies a lot on real datasets, with results in the [0-3%] range, generally lower.

Still, it's a net positive, so it qualifies,

Just for curiosity, I'm wondering if I do these speed tests correctly.
Apart from the usual speed optimization flags, I'm adding /Ob2 for "inline more often", and /arch:AVX2 in the expectation that it also enables BMI2. That's about it. @Niadb , are there other settings that matter, and could influence measurements ?

@Niadb
Copy link
Contributor Author

Niadb commented Aug 4, 2020

hi @Cyan4973

That is about the same as what I saw, the compression speed diff was very small, basically irrelevant.

I was more interested in decompression speed, as that is what I mostly use zstd for.

1767.6/1727.9 is only about 2.2%, so in line with what you reported, it varies some each time I run fullbench, but is generally in the 2-3% range.

I think you basically had the same flags as me. Intel vs AMD might make some difference, not sure what CPU you have.

For comparison here is clang on the same CPU as the above ^, it is very far ahead of MSVC in every metric

*** Zstandard speed analyzer 1.4.5 64-bits, by Yann Collet (Jul 28 2020) ***
Sample 10000000 bytes :
1#compress : 646.0 MB/s ( 3154550)
2#decompress : 2131.9 MB/s (10000000)
11#compressContinue : 659.2 MB/s ( 3154550)
12#compressContinue_extDict : 624.6 MB/s ( 3154887)
13#decompressContinue : 2021.5 MB/s (10000000)
31#decodeLiteralsBlock : 3057.5 MB/s ( 37031)
32#decodeSeqHeaders : 63629.0 MB/s ( 73)
41#compressStream : 478.1 MB/s ( 3154678)
42#decompressStream : 2046.9 MB/s (10000000)
43#compressStream_freshCCtx : 482.7 MB/s ( 3154678)
51#compress_generic, continue : 481.7 MB/s ( 3154678)
52#compress_generic, end : 643.3 MB/s ( 3154550)
61#compress_generic, -T2, contin: 486.5 MB/s ( 3154678)
62#compress_generic, -T2, end : 662.8 MB/s ( 3154550)

@Cyan4973 Cyan4973 merged commit 38e3854 into facebook:dev Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants