Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using LTO + PGO + Bolt #140

Open
zamazan4ik opened this issue Dec 22, 2022 · 8 comments
Open

Consider using LTO + PGO + Bolt #140

zamazan4ik opened this issue Dec 22, 2022 · 8 comments

Comments

@zamazan4ik
Copy link

Hi!

YDB right now does not support building with more advanced optimization techniques like PGO and BOLT. This tooling has an increasing adoption in the community as a tool to additionally optimize programs. With this tooling, there is a huge chance to gain even more performance "for free".

Here I suggest considering an option at least to play with LTO + PGO + Bolt pipeline (or any combination of them) and test, does it give a performance to the project or not. If yes, would be awesome to have prebuilt binaries with more advanced optimization from the scratch. Also, for the users will be helpful to have the ability to tweak manually their own binaries to their own workloads with the integrated into the build scripts functionality.

Also, there are some caveats to consider like:

  • Increased build times
  • BOLT could be still unstable (or even broken) on some architectures

Links:

@zamazan4ik
Copy link
Author

zamazan4ik commented Mar 25, 2023

I did some performance experiments on my local machine.

My setup:

  • OS: Fedora 37
  • Linux kernel: 6.2.7
  • Compiler: clang-15 from Fedora packages (clang 15.0.7 (I've patched a few sources to support this compiler)
  • Hardware: Ryzen 9 5900X, 32 Gib RAM, SSD

For benchmark purposes and profile generation, I've used KqpLoad actor (https://ydb.tech/en/docs/development/load-actors-kqp) which I've run multiple times for 300 seconds each time (all other parameters are default). YDB setup - local with RAM storage as described here: https://ydb.tech/en/docs/getting_started/self_hosted/ydb_local but with my own ydbd binaries.

I did the following things:

  • Build the usual release build and benchmark it
  • Build the instrumented build, run the same benchmark over it and then compile again with the generated profiles with Clang PGO

The results are the following:

  • Usual release build: 28k TPS
  • PGO-optimized build with the same release flags: 35k TPS

Also, I've tried to apply BOLT but perf2bolt consumes more than 32 Gib RAM for ydbd binary so it was OOM-killed :(

Additional notes regarding PGO via instrumentation. During my profile generation with instrumented ydbd binary via KqpLoadActor I found a strange error, possibly due to hardcoded deadlines - see here: https://github.com/ydb-platform/ydb/blob/main/ydb/core/load_test/kqp.cpp#L332 Since instrumented binaries are much slower, some deadlines shall be adjusted. During my local benchmarking, I just commented out these deadlines and the profile was generated successfully. Possibly, would be better to have an ability to configure the timeout externally without code modification.

@zamazan4ik
Copy link
Author

zamazan4ik commented Mar 27, 2023

Well, I managed to run BOLT with some "magic" options (details are here: llvm/llvm-project#61711).

As expected, BOLT didn't provide a significant performance boost after PGO - but still, I see measurable improvements:

  • PGO: 35k TPS
  • PGO + Bolt: 37k TPS

I think Propeller (an alternative approach, similar to BOLT but from Google) could bring almost the same numbers. I tried to test YDB with Propeller... But Propeller requires the latest Clang compiler from the main branch, and YDB has a bunch of compilation errors with it - and right now I have some motivation lack to fix them... Maybe, one day I will test it too :)

@eivanov89
Copy link
Member

Hi Alexander Zaitsev, thank you very much for sharing this excellent idea and making the initial experiments. One of our engineers have confirmed your results and working further on integration details. We will be back soon, when collect more data and understand best possible usage.

@zamazan4ik
Copy link
Author

@eivanov89 do you have updates regarding PGO? If you confirm the results and you find them useful, I suggest adding to the YDB documentation a note regarding tuning YDB with PGO. Here are the examples from other projects, how this documentation can look like:

Having this kind of information in the official documentation makes optimization opportunities more visible to the end users and maintainers.

@eivanov89
Copy link
Member

Hi @zamazan4ik, sorry for delay. We have some issues with our internal tools and build. Hope to solve soon though. But if fail, we will consider applying this to github build only.

@zamazan4ik
Copy link
Author

But if fail, we will consider applying this to github build only.

Understood. I suggest if you confirm the results above, add a note about PGO to the YDB documentation. So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

@eivanov89
Copy link
Member

So the users who build YDB binaries on their own will be able to estimate performance benefits from PGO on YDB and optimize their YDB builds too.

The tests that we both have used to test PGO are too narrow, imho. We're going to try YCSB and TPC-C to check if real benchmarks benefit same manner as microbenchmarks we have used so far.

@HUGODAI
Copy link

HUGODAI commented Nov 15, 2024

Using LTO+PGO to optimize MySQL performance has greatly improved. On this basis, using BOLT to optimize again did not achieve performance improvement. Is this result in line with expectations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants