Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use thinlto and pgo for x86_64 windows release packaging #71067

Merged
merged 4 commits into from
Nov 14, 2023

Conversation

zmodem
Copy link
Collaborator

@zmodem zmodem commented Nov 2, 2023

Applying this to 17.0.4 makes the installer 5% smaller (298 MB vs. 315 MB) and the toolchain 20% faster (as measured by building clang).

@zmodem
Copy link
Collaborator Author

zmodem commented Nov 2, 2023

+cc @pbo-linaro

This has been on the todo list a long time.

Why not x86 also? We can't do bootstrapping thinlto because the linker will run out of memory. I think it's okay to focus on x86_64 for performance. Maybe we can even stop building the 32-bit binaries eventually.

@ms178
Copy link

ms178 commented Nov 2, 2023

Why stop with LTO+PGO? BOLTing Clang/LLD should yield another sizeable improvement.

Added my recent BOLT stats for Clang/LLD-18 (on Linux):
LLVM-18-BOLT-231026_stats.txt

@pbo-linaro
Copy link
Contributor

pbo-linaro commented Nov 3, 2023

That would be a nice addition @zmodem.
I know it's a boring request, but could you evaluate separately benefits from PGO alone, and then adding ThinLTO?
Same applies if we add BOLT too.
Build times should be checked too ideally.

Regarding the external source file used for profiling (pgo_training-1.ii), I don't know LLVM policy, but maybe it would be better to host it on our side, so we're sure it does not disappear one of those days
By the way, what is the difference with pgo_training-2.ii, which is available too? If you have links to explain those files, it would be nice to add as a comment.

Finally, aarch64 build could benefit from that too, but that can be done later.

@zmodem
Copy link
Collaborator Author

zmodem commented Nov 6, 2023

Why stop with LTO+PGO? BOLTing Clang/LLD should yield another sizeable improvement.

My understanding was the BOLT for PE/COFF binaries isn't production ready.

Regarding the external source file used for profiling (pgo_training-1.ii), I don't know LLVM policy, but maybe it would be better to host it on our side, so we're sure it does not disappear one of those days. By the way, what is the difference with pgo_training-2.ii, which is available too? If you have links to explain those files, it would be nice to add as a comment.

Thinking about this some more, using pgo_training-1.ii is convenient for Chromium, but may not be right for LLVM. As you say, we should probably store it somewhere in the project, and at 11 MB we can't just put it in Git. Additionally, it may need updating now and then.

I've pushed a new commit to train by building Clang's Sema.cpp instead. It requires running CMake one more time, but I think that's okay. The performance seems to be the same.

(To answer your questions anyway, the origin of pgo_training-1.ii is documented here: https://source.chromium.org/chromium/chromium/src/+/main:tools/clang/scripts/build.py;l=1044 The story of the -2 version is here: https://bugs.chromium.org/p/chromium/issues/detail?id=984067)

I know it's a boring request, but could you evaluate separately benefits from PGO alone, and then adding ThinLTO? [..] Build times should be checked too ideally.

Build times on my workstation:

  1. Baseline (using the checked in version of build_llvm_package.py)
    1h 30 minutes
  2. PGO only
    1h 40 minutes (+11%)
  3. PGO+ThinLTO
    3h 44 minutes (+149%)

Performance, using the clang produced above to do a release build of clang (metrics are best-of-two):

  1. Baseline
    4m 55s
  2. PGO only
    3m 49s (-22%)
  3. PGO + ThinLTO
    3m 38s (-26%)

I was surprised that PGO helped so much by itself. Given that ThinLTO builds are so much slower to produce, I'm tempted to say we should just go with PGO for now. What do you think?

Finally, aarch64 build could benefit from that too, but that can be done later.

Yes, and hopefully we can refactor some of the code to be shared at that point.

@ms178
Copy link

ms178 commented Nov 6, 2023

Why stop with LTO+PGO? BOLTing Clang/LLD should yield another sizeable improvement.

My understanding was the BOLT for PE/COFF binaries isn't production ready.

@maksfb or @aaupov might have some insights to share about PE/COEFF binary support of BOLT or might be interested in a BOLT-optimized Windows release package.

@pbo-linaro
Copy link
Contributor

Thanks for taking the time to check the different configurations.

Using LLVM source code for profiling data is a good solution IMHO.
For any reasonable complicated project using "most" of C++ features, it should bring similar data.

The price for ThinLTO is surprisingly high (in memory usage too I guess), and compared to PGO alone, the added value is pretty low in this case. Considering the numbers, I agree with you to only do PGO, as it's almost "free".

If we go with the PGO only version, it could be nice to directly factorize this for all architecture (do_build_libxml function provides an example).

@aaupov
Copy link
Contributor

aaupov commented Nov 7, 2023

There's no BOLT support for PE/COFF, nor any plans on our end.

@zmodem
Copy link
Collaborator Author

zmodem commented Nov 7, 2023

The price for ThinLTO is surprisingly high (in memory usage too I guess), and compared to PGO alone, the added value is pretty low in this case. Considering the numbers, I agree with you to only do PGO, as it's almost "free".

Part of the high cost is that it applies not just to clang, lld, etc., but all the binaries, including the ones only used for testing.

If we go with the PGO only version, it could be nice to directly factorize this for all architecture (do_build_libxml function provides an example).

Agreed, but can we do it in a follow-up when applying this to more targets? (For example, the PGO part should work for x86 too.)

@pbo-linaro
Copy link
Contributor

Oh yes, I missed that. So memory usage might be more a problem when linking (with thinLTO) all binaries. Better to not enable this for now.

No problem to do it in another PR. Indeed, all the architectures (x86, x64, arm64) could benefit from PGO.

@mstorsjo
Copy link
Member

mstorsjo commented Nov 7, 2023

A quick side question regarding PGO (while listening in on this topic); how portable is the profile information across builds? If I e.g. build Clang on Linux, then do some profiling with that, I can use that to do a new Clang build for Linux with PGO. But can I use the same profile information for PGO when I, still on Linux, cross compile Clang to run on Windows?

@zmodem
Copy link
Collaborator Author

zmodem commented Nov 14, 2023

Apologies for the late reply; I got distracted by other things.

how portable is the profile information across builds?

It can work across targets, but e.g. if the function names from the profile and the current build don't match because of mangling, those functions will not benefit from the profile info.

@zmodem
Copy link
Collaborator Author

zmodem commented Nov 14, 2023

I extracted the profile generation into a separate "function". It doesn't work on 32-bit x86 because the linker runs out of memory, but it would be interesting to hear whether it works for arm64!

@zmodem zmodem merged commit 060af26 into llvm:main Nov 14, 2023
2 of 3 checks passed
@mstorsjo
Copy link
Member

how portable is the profile information across builds?

It can work across targets, but e.g. if the function names from the profile and the current build don't match because of mangling, those functions will not benefit from the profile info.

Ah, thanks! That makes sense.

So between Linux and MinGW x86_64, arm and aarch64, it should mostly match. (I guess that the exception is for cases where types are defined differently, and those types are visible in function signatures.) Cases where functions take C++ standard library types probably won't match. Linux and MSVC builds probably wouldn't match at all as the C++ symbol name mangling is different.

One interesting case is where the mangling mostly matches but one case has an extra symbol prefix while the other one doesn't - like Windows/i386, and macOS on all architectures. It would be neat if it would be possible to apply a transformation on the profile information, to add/remove such a prefix, to make it match the other target better. CC @ZequanWu - how feasible would this be?

@zmodem
Copy link
Collaborator Author

zmodem commented Nov 15, 2023

I used the new version of the script to package the 17.0.5 binaries, so if someone wants to try the PGO-built toolchain, it's here: https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.5/LLVM-17.0.5-win64.exe

@zmodem zmodem deleted the thinlto_packaging branch November 15, 2023 13:11
@ZequanWu
Copy link
Contributor

One interesting case is where the mangling mostly matches but one case has an extra symbol prefix while the other one doesn't - like Windows/i386, and macOS on all architectures. It would be neat if it would be possible to apply a transformation on the profile information, to add/remove such a prefix, to make it match the other target better. CC @ZequanWu - how feasible would this be?

IIUC, you want to remove such mangled name prefix when using profile from x64 but targeting to windows i386. I'm not sure about this. @xur-llvm might know this. BTW, what's the extra symbol prefix? Is it prefixed to all symbol on Windows i386?

@mstorsjo
Copy link
Member

IIUC, you want to remove such mangled name prefix when using profile from x64 but targeting to windows i386.

My primary usecase would be to profile on x86_64 Linux, and apply this profile when I cross-build binaries for Windows i386. (The symbol names should mostly match without other tricks when cross building binaries for Windows for other architectures.)

The same mechanism would help if profiling on x86_64 Windows and using the profile for a build for Windows i386 as well, or profiling on Linux but using the profile on macOS, or vice versa as well.

I'm not sure about this. @xur-llvm might know this. BTW, what's the extra symbol prefix? Is it prefixed to all symbol on Windows i386?

Almost, yes. On macOS, essentially all extern symbols have an extra underscore at the start. On Windows i386, all cdecl symbols have an extra underscore at the start as well. MSVC C++ ABI symbols don't have such an extra underscore on i386 though. But the Itanium ABI/name mangling used in MinGW should be essentially the same as on other Itanium platforms, modulo an extra leading underscore. And the same goes for Linux<->macOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants