-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couple tweaks to improve decompression speed with clang PGO compilation #3576
Conversation
Inlining `BIT_reloadDStream` provided >3% decompression speed improvement for clang PGO-optimized zstd binary, measured using the Silesia corpus with compression level 1. The win comes from improved register allocation which leads to fewer spills and reloads. Take a look at this comparison of profile-annotated hot assembly before and after this change: https://www.diffchecker.com/UjDGIyLz/. The diff is a bit messy, but notice three fewer moves after inlining. In general LLVM's register allocator works better when it can see more code. For example, when the register allocator sees a call instruction, it partitions the registers into caller registers and callee registers, and it is not free to do whatever it wants with all the registers for the current function. Inlining the callee lets the register allocation access all registers and use them more flexsibly.
Looking at the __builtin_expect in ZSTD_decodeSequence: { size_t offset; #if defined(__clang__) if (LIKELY(ofBits > 1)) { #else if (ofBits > 1) { #endif ZSTD_STATIC_ASSERT(ZSTD_lo_isLongOffset == 1); From profile-annotated assembly, the probability of ofBits > 1 is about 75% (101k counts out of 135k counts). This is much smaller than the recommended likelihood to use __builtin_expect which is 99%. As a result, clang moved the else block further away which hurts cache locality. Removing this __built_expect along with two others in ZSTD_decodeSequence gave better performance when PGO is enabled. I suggest to remove these branch hints and rely on PGO which leverages runtime profiles from actual workload to calculate branch probability instead.
The explanations look good to me, now a couple of observation :
|
Gotcha. Here're some numbers for patch 1. clang 15 thinLTO + PGO (15 runs average) gcc 11 (15 runs average) When I measured patch 2 alone there was ~2% win but it disappeared after combined with patch 1. They likely both influence the register allocator to spill less hot code, which is the main driver here. But we expect removing branch hints and relying on PGO will give better result in production because the branch probabilities will be specialized based on the workload. |
Not going to block the patch, we checked and PGO for us gives around 2.8%. Second commit is neutral Thanks! |
@Cyan4973 I don't have write permission. Could you help merge this? |
Don't worry, it will be merged |
BIT_reloadDStream
to improve register allocation down the pipeline.__builtin_expect
s fromZSTD_decodeSequence
to lean on clang PGO profile data instead.Please see the individual commits for some analysis details.