Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couple tweaks to improve decompression speed with clang PGO compilation #3576

Merged
merged 2 commits into from
Mar 31, 2023

Conversation

zhuhan0
Copy link
Contributor

@zhuhan0 zhuhan0 commented Mar 28, 2023

  • Adding inline hint to BIT_reloadDStream to improve register allocation down the pipeline.
  • Remove clang-only __builtin_expects from ZSTD_decodeSequence to lean on clang PGO profile data instead.

Please see the individual commits for some analysis details.

zhuhan0 added 2 commits March 28, 2023 15:36
Inlining `BIT_reloadDStream` provided >3% decompression speed improvement for
clang PGO-optimized zstd binary, measured using the Silesia corpus with
compression level 1. The win comes from improved register allocation which leads
to fewer spills and reloads. Take a look at this comparison of
profile-annotated hot assembly before and after this change:
https://www.diffchecker.com/UjDGIyLz/. The diff is a bit messy, but notice three
fewer moves after inlining.

In general LLVM's register allocator works better when it can see more code. For
example, when the register allocator sees a call instruction, it partitions the
registers into caller registers and callee registers, and it is not free to do
whatever it wants with all the registers for the current function. Inlining the
callee lets the register allocation access all registers and use them more
flexsibly.
Looking at the __builtin_expect in ZSTD_decodeSequence:

{   size_t offset;
    #if defined(__clang__)
 if (LIKELY(ofBits > 1)) {
    #else
 if (ofBits > 1) {
    #endif
 ZSTD_STATIC_ASSERT(ZSTD_lo_isLongOffset == 1);

From profile-annotated assembly, the probability of ofBits > 1 is about 75%
(101k counts out of 135k counts). This is much smaller than the recommended
likelihood to use __builtin_expect which is 99%. As a result, clang moved the
else block further away which hurts cache locality. Removing this
__built_expect along with two others in ZSTD_decodeSequence gave better
performance when PGO is enabled. I suggest to remove these branch hints and
rely on PGO which leverages runtime profiles from actual workload to calculate
branch probability instead.
@Cyan4973
Copy link
Contributor

The explanations look good to me, now a couple of observation :

  • For patch 1 : I note this is considered desirable for clang with pgo. But what's the impact for gcc without pgo ?
    Some benchmarking would be welcome.

  • For patch 2 : I note that the removed branch was recently added by @danlark1 . I'm curious about his own perspective on this proposed modification.

@zhuhan0
Copy link
Contributor Author

zhuhan0 commented Mar 28, 2023

The explanations look good to me, now a couple of observation :

  • For patch 1 : I note this is considered desirable for clang with pgo. But what's the impact for gcc without pgo ?
    Some benchmarking would be welcome.
  • For patch 2 : I note that the removed branch was recently added by @danlark1 . I'm curious about his own perspective on this proposed modification.

Gotcha. Here're some numbers for patch 1.

clang 15 thinLTO + PGO (15 runs average)
698.83 => 721.87 (+3.3%)

gcc 11 (15 runs average)
728.17 => 728.95 (neutral)

When I measured patch 2 alone there was ~2% win but it disappeared after combined with patch 1. They likely both influence the register allocator to spill less hot code, which is the main driver here. But we expect removing branch hints and relying on PGO will give better result in production because the branch probabilities will be specialized based on the workload.

@danlark1
Copy link
Contributor

Not going to block the patch, we checked and PGO for us gives around 2.8%. Second commit is neutral

Thanks!

@Cyan4973 Cyan4973 self-assigned this Mar 29, 2023
@zhuhan0
Copy link
Contributor Author

zhuhan0 commented Mar 30, 2023

@Cyan4973 I don't have write permission. Could you help merge this?

@Cyan4973
Copy link
Contributor

Don't worry, it will be merged

@Cyan4973 Cyan4973 merged commit bb7fbd5 into facebook:dev Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants