Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCons: Make lto=auto prefer ThinLTO over full LTO for LLVM targets #96785

Merged
merged 1 commit into from
Jan 9, 2025

Conversation

akien-mga
Copy link
Member

@akien-mga akien-mga commented Sep 10, 2024

Edit: Changed the scope of this PR to only impact targets for which we already used LLVM's full LTO, and change those to ThinLTO, to speed up builds significantly.


This speeds up build time considerably for these platforms compared to
using lto=full, which is sadly single-threaded with LLVM, unlike GCC.

Changes to default behavior of lto=auto (i.e. production=yes):

  • Linux: Prefer ThinLTO for LLVM
  • Web: Prefer ThinLTO
  • Windows: Prefer ThinLTO for llvm-mingw

The following LLVM targets don't use LTO by default currently, which
needs to be assessed further (gains from LLVM LTO on performance need
to be weighed against the potential size increase from heavy inlining):

  • Android
  • iOS
  • macOS
  • Windows clang-cl

Needs heavy testing and comparison of builds with and without LTO (thin/full) for the affected platforms.

We should benchmark and documents once and for all the impact of LTO on build time, build size, and performance for each platform, so we can default to the optimal configuration out of the box.

@akien-mga
Copy link
Member Author

akien-mga commented Sep 10, 2024

Did some tests for the Web export templates, for now evaluating only build time and build size.

Initial findings are that enabling LTO (both thin and full) significantly increases binary size.
ThinLTO is pretty fast, a full LTO more than doubles build time.

I haven't evaluated performance for now, but all my builds are available here if someone wants to benchmark them (not sure how to do this easily on the web). They're built from this PR, which has the same base commit as 4.4.dev2 (97ef3c8), so you can use the 4.4.dev2 editor to export with those templates.

https://downloads.tuxfamily.org/godotengine/testing/4.4-dev2-lto-comparison-web.zip


Web

$ emcc -v
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.62 (34c1aa36052b1882058f22aa1916437ba0872690)
clang version 19.0.0git (https:/github.com/llvm/llvm-project c00ada070207979f092be9046a02fcfff8b9f9ce)
$ inxi -CS
System:
  Host: fedora Kernel: 6.10.8-200.1.copr.fc40.x86_64 arch: x86_64 bits: 64
  Desktop: KDE Plasma v: 6.1.4 Distro: Fedora Linux 40 (KDE Plasma)
CPU:
  Info: 8-core model: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics bits: 64
    type: MT MCP cache: L2: 8 MiB
  Speed (MHz): avg: 717 min/max: 400/5137 cores: 1: 542 2: 706 3: 400 4: 599
    5: 874 6: 447 7: 779 8: 495 9: 447 10: 889 11: 1287 12: 681 13: 999 14: 817
    15: 400 16: 1112

template_release

lto=none

scons p=web verbose=yes target=template_release production=yes lto=none:

[Time elapsed: 00:04:43.60]

33774773 godot.web.template_release.wasm32.wasm
8032489 godot.web.template_release.wasm32.zip

lto=thin

scons p=web verbose=yes target=template_release production=yes lto=thin:

[Time elapsed: 00:05:18.45]

36016994 godot.web.template_release.wasm32.wasm
8396460 godot.web.template_release.wasm32.zip

Size impact: +6.64%
Time impact: +12.35%

lto=full

scons p=web verbose=yes target=template_release production=yes lto=full:

[Time elapsed: 00:10:47.08]

34667428 godot.web.template_release.wasm32.wasm
8261894 godot.web.template_release.wasm32.zip

Size impact: +2.64%
Time impact: +128.22%

template_debug

lto=none

scons p=web verbose=yes target=template_release production=yes lto=none:

[Time elapsed: 00:06:10.06]

36556150 godot.web.template_debug.wasm32.wasm
8827897 godot.web.template_debug.wasm32.zip

lto=thin

scons p=web verbose=yes target=template_release production=yes lto=thin:

[Time elapsed: 00:06:21.69]

42375870 godot.web.template_debug.wasm32.wasm
9378595 godot.web.template_debug.wasm32.zip

Size impact: +15.92%
Time impact: +3.16%

lto=full

scons p=web verbose=yes target=template_release production=yes lto=full:

[Time elapsed: 00:12:59.24]

41470059 godot.web.template_debug.wasm32.wasm
9213546 godot.web.template_debug.wasm32.zip

Size impact: +13.44%
Time impact: +110.59%

@akien-mga
Copy link
Member Author

akien-mga commented Sep 10, 2024

Tested for Android, and similarly found that LLVM LTO (both thin and full) seem to significantly increase the binary size.

Likewise, I haven't judged the performance impact on the other hand. I'm also uploading my builds if someone wants to do some benchmarking (apk and zip only to minimize the total download size).

https://downloads.tuxfamily.org/godotengine/testing/4.4-dev2-lto-comparison-android.zip

Android

template_release

lto=none

scons p=android verbose=yes target=template_release production=yes lto=none

[Time elapsed: 00:03:13.65]

23052683 android_release.apk
21429208 android_source.zip
21520355 godot-lib.template_release.aar
61942576 libgodot.android.template_release.arm64.so

lto=thin

scons p=android verbose=yes target=template_release production=yes lto=thin

[Time elapsed: 00:03:58.01]

26353343 Sep 10 16:22 android_release.apk
24729109 Sep 10 16:22 android_source.zip
24821015 Sep 10 16:22 godot-lib.template_release.aar
73214136 Sep 10 16:22 libgodot.android.template_release.arm64.so

Size impact (apk): +14.32%
Size impact (aar): +15.34%
Time impact: +22.90%

lto=full

scons p=android verbose=yes target=template_release production=yes lto=full

[Time elapsed: 00:13:03.85]

26127215 Sep 10 16:50 android_release.apk
24517319 Sep 10 16:50 android_source.zip
24594889 Sep 10 16:50 godot-lib.template_release.aar
72010648 Sep 10 16:51 libgodot.android.template_release.arm64.so

Size impact (apk): +13.34%
Size impact (aar): +14.29%
Time impact: +304.78%

template_debug

lto=none

scons p=android verbose=yes target=template_debug production=yes lto=none

[Time elapsed: 00:03:24.44]

27025684 android_debug.apk
22539674 android_source.zip
22668382 godot-lib.template_debug.aar
67420976 libgodot.android.template_debug.arm64.so

lto=thin

scons p=android verbose=yes target=template_debug production=yes lto=thin

[Time elapsed: 00:04:28.97]

31233392 android_debug.apk
26403585 android_source.zip
26529454 godot-lib.template_debug.aar
84032512 libgodot.android.template_debug.arm64.so

Size impact (apk): +15.57%
Size impact (aar): +17.03%
Time impact: +31.56%

lto=full

scons p=android verbose=yes target=template_debug production=yes lto=full

[Time elapsed: 00:15:28.26]

31036164 android_debug.apk
26258954 android_source.zip
26378082 godot-lib.template_debug.aar
83105936 libgodot.android.template_debug.arm64.so

Size impact (apk): +14.84%
Size impact (aar): +16.37%
Time impact: +354.05%

@adamscott
Copy link
Member

I thought that using -O2 instead of -Os could help for the LTO (for whatever reason), but no. It's even worse, as expected.

@akien-mga
Copy link
Member Author

Did some more compilation tests for Linux, with both GCC 14.2.1 and LLVM 18.1.6.

It seems like our common argument that LTO improves not only performance but also binary size holds true for GCC builds (-9% on release template with GCC full LTO), but is totally wrong for LLVM (+14.5% for ThinLTO and +11.5% for full LTO with LLVM, with atrocious build times for the latter).

Again haven't checked performance numbers.
Linux builds (from Fedora 40, might not be compatible with older distros) available on:

https://downloads.tuxfamily.org/godotengine/testing/4.4-dev2-lto-comparison-linux.zip

Linux (GCC)

$ gcc --version
gcc (GCC) 14.2.1 20240801 (Red Hat 14.2.1-1)

template_release

lto=none

scons p=linux verbose=yes target=template_release production=yes lto=none

[Time elapsed: 00:05:03.43]

74118880 godot.linuxbsd.template_release.x86_64

lto=full

scons p=linux verbose=yes target=template_release production=yes lto=full

[Time elapsed: 00:05:02.65]

67309248 godot.linuxbsd.template_release.x86_64

Size impact: -9.19%
Time impact: -0.25%

template_debug

lto=none

scons p=linux verbose=yes target=template_debug production=yes lto=none

[Time elapsed: 00:05:02.00]

72275712 godot.linuxbsd.template_debug.x86_64

lto=full

scons p=linux verbose=yes target=template_debug production=yes lto=full

[Time elapsed: 00:05:21.60]

67694368 godot.linuxbsd.template_debug.x86_64

Size impact: -6.34%
Time impact: +6.49%

Linux (LLVM)

$ clang --version
clang version 18.1.6 (Fedora 18.1.6-3.fc40)

template_release

lto=none

scons p=linux verbose=yes target=template_release production=yes use_llvm=yes lto=none

[Time elapsed: 00:04:28.10]

68099496 godot.linuxbsd.template_debug.x86_64.llvm

lto=thin

scons p=linux verbose=yes target=template_release production=yes use_llvm=yes lto=thin

[Time elapsed: 00:05:06.13]

77995208 godot.linuxbsd.template_debug.x86_64.llvm

Size impact: +14.53%
Time impact: +14.18%

lto=full

scons p=linux verbose=yes target=template_release production=yes use_llvm=yes lto=full

[Time elapsed: 00:13:33.94]

75940456 godot.linuxbsd.template_debug.x86_64.llvm

Size impact: +11.51%
Time impact: +203.60%

(Forgot to do template_debug builds too, but the results should be consistent with template_release and other LLVM platforms tested earlier.)

@Riteo
Copy link
Contributor

Riteo commented Sep 11, 2024

After searching random things online it's possible that the extra size is inlining. From what I'm reading the whole point of LTO is inlining between compilation units, outside of more accurate dead code elimination and stuff like that, so extra size is to be expected.

It looks there are some extra settings that might be useful, pointed out by this link: https://discourse.llvm.org/t/clang-lld-thin-lto-footprint-and-run-time-performance-outperformed-by-gcc-ld/78997

I haven't read it fully but it seems very relevant to this PR. Probably the easiest change would be to pass -Oz instead of -Os:

One other thing worth noting is that Clang -Os isn’t as aggressive at size optimization as GCC -Os. Clang does have an -Oz option that may do more size optimization at the expense of performance.

It looks like LLVM is way more focused on performance, which might explain why even -Os seems to inline more or anyways not optimize for size as much as GCC.

The link above mentions a lot of other things we could try. It also mentions an interesting feature called "remarks", which apparently makes the compiler tell us why it hasn't optimized something: https://llvm.org/docs/Remarks.html

I have no idea how it works but it might be insightful if there's an easy way to parse the resulting file (the wiki mentions some like opt-viewer but I have no idea how they're supposed to work)

@akien-mga if it isn't too annoying could you also pass a verbose sample of the passed flags to the compiler/linker for each build test? I think that might help getting and idea of what's already done and what we could do to improve output size.

@akien-mga akien-mga force-pushed the scons-lto-use-thinlto-llvm branch from bbe5449 to d4655b1 Compare September 11, 2024 11:22
@akien-mga akien-mga changed the title SCons: Make lto=auto enable/prefer ThinLTO for LLVM targets SCons: Make lto=auto prefer ThinLTO over full LTO for LLVM targets Sep 11, 2024
@akien-mga
Copy link
Member Author

Based on findings so far, I updated this PR to only change the platforms which currently used lto=full on LLVM targets for production=yes, to now use lto=thin.

For our official builds, this means it only affects:

  • Web builds will now use thin instead of full LTO. Based on my numbers above, that does mean a size increase of +3.89% (wasm) / +1.63% (zip), but a reduction in build time by half (likely more on the official buildsystem which builds with -j64, so is even more bottlenecked by the slow single threaded linking of full LTO on LLVM).
  • Windows arm64 builds (using llvm-mingw) will now use thin instead of full LTO. This might also increase size slightly but should reduce build times very significantly (adding arm64 Windows builds with full LTO added almost 2h of build times to official builds).

For evaluation the actual gains of various LTO configurations, and see if we should start using LTO for Android/macOS/Windows clang-cl (and maybe iOS but this caused slow linking in Xcode, could maybe be re-assessed with ThinLTO), I'll open a new issue where I'll share my metrics again (and @Riteo can add the research they wrote here).

@bruvzg
Copy link
Member

bruvzg commented Sep 11, 2024

I have not tested lto on current master macOS/iOS (will do in a few hours), but last time I did, patter was the same as other clang builds: size increase for both thin and full lto, huge time difference (and memory usage for full lto).

@akien-mga
Copy link
Member Author

I opened #96851 to continue the in-depth review of the different configuration options for each target.

This PR in the meantime just switches LLVM full LTO to thin LTO for the targets that currently use LTO.

Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me. The time and memory usage required for full LTO can be prohibitive for casual builders who just use production=yes to create their own binaries, so ThinLTO is a better default.

This speeds up build time considerably for these platforms compared to
using `lto=full`, which is sadly single-threaded with LLVM, unlike GCC.

Changes to default behavior of `lto=auto` (i.e. `production=yes`):
- Linux: Prefer ThinLTO for LLVM
- Web: Prefer ThinLTO
- Windows: Prefer ThinLTO for llvm-mingw

The following LLVM targets don't use LTO by default currently, which
needs to be assessed further (gains from LLVM LTO on performance need
to be weighed against the potential size increase from heavy inlining):
- Android
- iOS
- macOS
- Windows clang-cl
@akien-mga akien-mga force-pushed the scons-lto-use-thinlto-llvm branch from c814952 to 26db0bb Compare January 9, 2025 12:04
@akien-mga akien-mga requested a review from a team as a code owner January 9, 2025 12:04
@akien-mga akien-mga merged commit fcc9e3a into godotengine:master Jan 9, 2025
20 checks passed
@akien-mga akien-mga deleted the scons-lto-use-thinlto-llvm branch January 9, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants