Couple tweaks to improve decompression speed with clang PGO compilation #3576

zhuhan0 · 2023-03-28T22:38:50Z

Adding inline hint to BIT_reloadDStream to improve register allocation down the pipeline.
Remove clang-only __builtin_expects from ZSTD_decodeSequence to lean on clang PGO profile data instead.

Please see the individual commits for some analysis details.

Inlining `BIT_reloadDStream` provided >3% decompression speed improvement for clang PGO-optimized zstd binary, measured using the Silesia corpus with compression level 1. The win comes from improved register allocation which leads to fewer spills and reloads. Take a look at this comparison of profile-annotated hot assembly before and after this change: https://www.diffchecker.com/UjDGIyLz/. The diff is a bit messy, but notice three fewer moves after inlining. In general LLVM's register allocator works better when it can see more code. For example, when the register allocator sees a call instruction, it partitions the registers into caller registers and callee registers, and it is not free to do whatever it wants with all the registers for the current function. Inlining the callee lets the register allocation access all registers and use them more flexsibly.

Looking at the __builtin_expect in ZSTD_decodeSequence: { size_t offset; #if defined(__clang__) if (LIKELY(ofBits > 1)) { #else if (ofBits > 1) { #endif ZSTD_STATIC_ASSERT(ZSTD_lo_isLongOffset == 1); From profile-annotated assembly, the probability of ofBits > 1 is about 75% (101k counts out of 135k counts). This is much smaller than the recommended likelihood to use __builtin_expect which is 99%. As a result, clang moved the else block further away which hurts cache locality. Removing this __built_expect along with two others in ZSTD_decodeSequence gave better performance when PGO is enabled. I suggest to remove these branch hints and rely on PGO which leverages runtime profiles from actual workload to calculate branch probability instead.

Cyan4973 · 2023-03-28T22:51:06Z

The explanations look good to me, now a couple of observation :

For patch 1 : I note this is considered desirable for clang with pgo. But what's the impact for gcc without pgo ?
Some benchmarking would be welcome.
For patch 2 : I note that the removed branch was recently added by @danlark1 . I'm curious about his own perspective on this proposed modification.

zhuhan0 · 2023-03-28T23:35:17Z

The explanations look good to me, now a couple of observation :

For patch 1 : I note this is considered desirable for clang with pgo. But what's the impact for gcc without pgo ?
Some benchmarking would be welcome.

For patch 2 : I note that the removed branch was recently added by @danlark1 . I'm curious about his own perspective on this proposed modification.

Gotcha. Here're some numbers for patch 1.

clang 15 thinLTO + PGO (15 runs average)
698.83 => 721.87 (+3.3%)

gcc 11 (15 runs average)
728.17 => 728.95 (neutral)

When I measured patch 2 alone there was ~2% win but it disappeared after combined with patch 1. They likely both influence the register allocator to spill less hot code, which is the main driver here. But we expect removing branch hints and relying on PGO will give better result in production because the branch probabilities will be specialized based on the workload.

danlark1 · 2023-03-28T23:37:48Z

Not going to block the patch, we checked and PGO for us gives around 2.8%. Second commit is neutral

Thanks!

zhuhan0 · 2023-03-30T16:54:41Z

@Cyan4973 I don't have write permission. Could you help merge this?

Cyan4973 · 2023-03-30T17:37:39Z

Don't worry, it will be merged

zhuhan0 added 2 commits March 28, 2023 15:36

facebook-github-bot added the CLA Signed label Mar 28, 2023

Cyan4973 approved these changes Mar 29, 2023

View reviewed changes

Cyan4973 self-assigned this Mar 29, 2023

Cyan4973 merged commit bb7fbd5 into facebook:dev Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couple tweaks to improve decompression speed with clang PGO compilation #3576

Couple tweaks to improve decompression speed with clang PGO compilation #3576

zhuhan0 commented Mar 28, 2023 •

edited

Loading

Cyan4973 commented Mar 28, 2023

zhuhan0 commented Mar 28, 2023 •

edited

Loading

danlark1 commented Mar 28, 2023

zhuhan0 commented Mar 30, 2023

Cyan4973 commented Mar 30, 2023

Couple tweaks to improve decompression speed with clang PGO compilation #3576

Couple tweaks to improve decompression speed with clang PGO compilation #3576

Conversation

zhuhan0 commented Mar 28, 2023 • edited Loading

Cyan4973 commented Mar 28, 2023

zhuhan0 commented Mar 28, 2023 • edited Loading

danlark1 commented Mar 28, 2023

zhuhan0 commented Mar 30, 2023

Cyan4973 commented Mar 30, 2023

zhuhan0 commented Mar 28, 2023 •

edited

Loading

zhuhan0 commented Mar 28, 2023 •

edited

Loading