Added STATIC_BMI2 for compile time detection of BMI2 on MSVC, when enabled various intrinsics are used #2258

Niadb · 2020-07-28T09:02:52Z

This adds STATIC_BMI2 define to compiler.h, which is enabled on MSVC when compiler target has AVX2 enabled

Like DYNAMIC_BMI2 but at compile time.

Replaces uses of BitScanForward/Reverse with _lzcnt_u64/_tzcnt_u64 and 32 variants for better code gen.

Other changes when STATIC_BMI2 is on:
BIT_getLowerBits uses _bzhi_u64
BIT_getMiddleBits uses _bextr_u64

STATIC_BMI2 is only turned on with MSVC

MSVC is terrible at detecting patterns such as _bextr_u64, clang is much better so maybe not necessary there.

In BIT_getLowerBits Clang is smart enough to use _bzhi_u64 when -march=skylake is enabled(AVX2 support), MSVC fails to do so when /arch:AVX2 is on.

Cyan4973 · 2020-07-28T15:22:06Z

Do you have some benchmark results ?
Or perf / IACA results ?

Profiler showed some of these not being inlined on MSVC

Niadb · 2020-07-28T17:27:30Z

Hi @Cyan4973 I used the fullbench program to test the changes in regards to performance.

I also used the VS profiler, it showed a few functions not being inlined, so I applied a force inline to a few hot functions in bitstream.h

The BitScanForward/Reverse changes were faster along with _bzhi_u64, but using _bextr_u64 in that context didn't work out, so I removed it.

Here is the before and after of running fullbench.
Compiled with VS2019(latest), CPU is Ryzen 9 3900x, running Windows 10
AVX2 was enabled in build

Before:
*** Zstandard speed analyzer 1.4.5 64-bits, by Yann Collet (Jul 28 2020) ***
Sample 10000000 bytes :
1#compress : 520.1 MB/s ( 3154550)
2#decompress : 1727.9 MB/s (10000000)
11#compressContinue : 519.6 MB/s ( 3154550)
12#compressContinue_extDict : 512.6 MB/s ( 3154887)
13#decompressContinue : 1724.4 MB/s (10000000)
31#decodeLiteralsBlock : 2534.8 MB/s ( 37031)
32#decodeSeqHeaders : 51772.6 MB/s ( 73)
41#compressStream : 421.9 MB/s ( 3154678)
42#decompressStream : 1726.2 MB/s (10000000)
43#compressStream_freshCCtx : 421.9 MB/s ( 3154678)
51#compress_generic, continue : 422.2 MB/s ( 3154678)
52#compress_generic, end : 519.8 MB/s ( 3154550)
61#compress_generic, -T2, contin: 421.3 MB/s ( 3154678)
62#compress_generic, -T2, end : 519.1 MB/s ( 3154550)

After:
*** Zstandard speed analyzer 1.4.5 64-bits, by Yann Collet (Jul 28 2020) ***
Sample 10000000 bytes :
1#compress : 526.9 MB/s ( 3154550)
2#decompress : 1767.6 MB/s (10000000)
11#compressContinue : 527.1 MB/s ( 3154550)
12#compressContinue_extDict : 520.9 MB/s ( 3154887)
13#decompressContinue : 1767.5 MB/s (10000000)
31#decodeLiteralsBlock : 2522.1 MB/s ( 37031)
32#decodeSeqHeaders : 50173.2 MB/s ( 73)
41#compressStream : 419.7 MB/s ( 3154678)
42#decompressStream : 1755.4 MB/s (10000000)
43#compressStream_freshCCtx : 420.7 MB/s ( 3154678)
51#compress_generic, continue : 422.7 MB/s ( 3154678)
52#compress_generic, end : 524.6 MB/s ( 3154550)
61#compress_generic, -T2, contin: 418.1 MB/s ( 3154678)
62#compress_generic, -T2, end : 526.1 MB/s ( 3154550)

Cyan4973 · 2020-07-28T17:32:54Z

Excellent, thank you @Niadb !
We are aware that Visual Studio performance is not as good as gcc or clang, so this patch is welcomed.
And yes, we also noticed that bextr is not an automatic win.

I'll have a more detailed look at this patch later today.

Cyan4973 · 2020-07-30T17:40:44Z

I've been trying to replicate these results.
So far, results are slightly less good than anticipated from published measurements,
though broadly in line.

I'm unable to detect any consistent compression speed improvement. If there is, it's well within noise level.
On the decompression side though, there is a measurable speed improvement, just not a great one. The average is about ~3%, relatively consistent on synthetic data, but varies a lot on real datasets, with results in the [0-3%] range, generally lower.

Still, it's a net positive, so it qualifies,

Just for curiosity, I'm wondering if I do these speed tests correctly.
Apart from the usual speed optimization flags, I'm adding /Ob2 for "inline more often", and /arch:AVX2 in the expectation that it also enables BMI2. That's about it. @Niadb , are there other settings that matter, and could influence measurements ?

Niadb · 2020-08-04T16:18:53Z

hi @Cyan4973

That is about the same as what I saw, the compression speed diff was very small, basically irrelevant.

I was more interested in decompression speed, as that is what I mostly use zstd for.

1767.6/1727.9 is only about 2.2%, so in line with what you reported, it varies some each time I run fullbench, but is generally in the 2-3% range.

I think you basically had the same flags as me. Intel vs AMD might make some difference, not sure what CPU you have.

For comparison here is clang on the same CPU as the above ^, it is very far ahead of MSVC in every metric

*** Zstandard speed analyzer 1.4.5 64-bits, by Yann Collet (Jul 28 2020) ***
Sample 10000000 bytes :
1#compress : 646.0 MB/s ( 3154550)
2#decompress : 2131.9 MB/s (10000000)
11#compressContinue : 659.2 MB/s ( 3154550)
12#compressContinue_extDict : 624.6 MB/s ( 3154887)
13#decompressContinue : 2021.5 MB/s (10000000)
31#decodeLiteralsBlock : 3057.5 MB/s ( 37031)
32#decodeSeqHeaders : 63629.0 MB/s ( 73)
41#compressStream : 478.1 MB/s ( 3154678)
42#decompressStream : 2046.9 MB/s (10000000)
43#compressStream_freshCCtx : 482.7 MB/s ( 3154678)
51#compress_generic, continue : 481.7 MB/s ( 3154678)
52#compress_generic, end : 643.3 MB/s ( 3154550)
61#compress_generic, -T2, contin: 486.5 MB/s ( 3154678)
62#compress_generic, -T2, end : 662.8 MB/s ( 3154550)

Niadb added 2 commits July 28, 2020 02:52

Add files via upload

493fd40

Add files via upload

216a63d

facebook-github-bot added the CLA Signed label Jul 28, 2020

Cyan4973 added the optimization label Jul 28, 2020

Update bitstream.h

a8ebc14

Profiler showed some of these not being inlined on MSVC

Cyan4973 self-requested a review July 28, 2020 17:33

Cyan4973 self-assigned this Jul 29, 2020

Cyan4973 approved these changes Aug 4, 2020

View reviewed changes

Cyan4973 merged commit 38e3854 into facebook:dev Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added STATIC_BMI2 for compile time detection of BMI2 on MSVC, when enabled various intrinsics are used #2258

Added STATIC_BMI2 for compile time detection of BMI2 on MSVC, when enabled various intrinsics are used #2258

Niadb commented Jul 28, 2020 •

edited

Loading

Cyan4973 commented Jul 28, 2020

Niadb commented Jul 28, 2020

Cyan4973 commented Jul 28, 2020

Cyan4973 commented Jul 30, 2020 •

edited

Loading

Niadb commented Aug 4, 2020

Added STATIC_BMI2 for compile time detection of BMI2 on MSVC, when enabled various intrinsics are used #2258

Added STATIC_BMI2 for compile time detection of BMI2 on MSVC, when enabled various intrinsics are used #2258

Conversation

Niadb commented Jul 28, 2020 • edited Loading

Cyan4973 commented Jul 28, 2020

Niadb commented Jul 28, 2020

Cyan4973 commented Jul 28, 2020

Cyan4973 commented Jul 30, 2020 • edited Loading

Niadb commented Aug 4, 2020

Niadb commented Jul 28, 2020 •

edited

Loading

Cyan4973 commented Jul 30, 2020 •

edited

Loading