SCons: Make `lto=auto` prefer ThinLTO over full LTO for LLVM targets #96785

akien-mga · 2024-09-10T10:33:00Z

Edit: Changed the scope of this PR to only impact targets for which we already used LLVM's full LTO, and change those to ThinLTO, to speed up builds significantly.

This speeds up build time considerably for these platforms compared to
using lto=full, which is sadly single-threaded with LLVM, unlike GCC.

Changes to default behavior of lto=auto (i.e. production=yes):

Linux: Prefer ThinLTO for LLVM
Web: Prefer ThinLTO
Windows: Prefer ThinLTO for llvm-mingw

The following LLVM targets don't use LTO by default currently, which
needs to be assessed further (gains from LLVM LTO on performance need
to be weighed against the potential size increase from heavy inlining):

Android
iOS
macOS
Windows clang-cl

Needs heavy testing and comparison of builds with and without LTO (thin/full) for the affected platforms.

We should benchmark and documents once and for all the impact of LTO on build time, build size, and performance for each platform, so we can default to the optimal configuration out of the box.

akien-mga · 2024-09-10T14:01:05Z

Did some tests for the Web export templates, for now evaluating only build time and build size.

Initial findings are that enabling LTO (both thin and full) significantly increases binary size.
ThinLTO is pretty fast, a full LTO more than doubles build time.

I haven't evaluated performance for now, but all my builds are available here if someone wants to benchmark them (not sure how to do this easily on the web). They're built from this PR, which has the same base commit as 4.4.dev2 (97ef3c8), so you can use the 4.4.dev2 editor to export with those templates.

https://downloads.tuxfamily.org/godotengine/testing/4.4-dev2-lto-comparison-web.zip

Web

$ emcc -v
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.62 (34c1aa36052b1882058f22aa1916437ba0872690)
clang version 19.0.0git (https:/github.com/llvm/llvm-project c00ada070207979f092be9046a02fcfff8b9f9ce)

$ inxi -CS
System:
  Host: fedora Kernel: 6.10.8-200.1.copr.fc40.x86_64 arch: x86_64 bits: 64
  Desktop: KDE Plasma v: 6.1.4 Distro: Fedora Linux 40 (KDE Plasma)
CPU:
  Info: 8-core model: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics bits: 64
    type: MT MCP cache: L2: 8 MiB
  Speed (MHz): avg: 717 min/max: 400/5137 cores: 1: 542 2: 706 3: 400 4: 599
    5: 874 6: 447 7: 779 8: 495 9: 447 10: 889 11: 1287 12: 681 13: 999 14: 817
    15: 400 16: 1112

template_release

`lto=none`

scons p=web verbose=yes target=template_release production=yes lto=none:

[Time elapsed: 00:04:43.60]

33774773 godot.web.template_release.wasm32.wasm
8032489 godot.web.template_release.wasm32.zip

`lto=thin`

scons p=web verbose=yes target=template_release production=yes lto=thin:

[Time elapsed: 00:05:18.45]

36016994 godot.web.template_release.wasm32.wasm
8396460 godot.web.template_release.wasm32.zip

Size impact: +6.64%
Time impact: +12.35%

`lto=full`

scons p=web verbose=yes target=template_release production=yes lto=full:

[Time elapsed: 00:10:47.08]

34667428 godot.web.template_release.wasm32.wasm
8261894 godot.web.template_release.wasm32.zip

Size impact: +2.64%
Time impact: +128.22%

template_debug

`lto=none`

scons p=web verbose=yes target=template_release production=yes lto=none:

[Time elapsed: 00:06:10.06]

36556150 godot.web.template_debug.wasm32.wasm
8827897 godot.web.template_debug.wasm32.zip

`lto=thin`

scons p=web verbose=yes target=template_release production=yes lto=thin:

[Time elapsed: 00:06:21.69]

42375870 godot.web.template_debug.wasm32.wasm
9378595 godot.web.template_debug.wasm32.zip

Size impact: +15.92%
Time impact: +3.16%

`lto=full`

scons p=web verbose=yes target=template_release production=yes lto=full:

[Time elapsed: 00:12:59.24]

41470059 godot.web.template_debug.wasm32.wasm
9213546 godot.web.template_debug.wasm32.zip

Size impact: +13.44%
Time impact: +110.59%

akien-mga · 2024-09-10T15:44:27Z

Tested for Android, and similarly found that LLVM LTO (both thin and full) seem to significantly increase the binary size.

Likewise, I haven't judged the performance impact on the other hand. I'm also uploading my builds if someone wants to do some benchmarking (apk and zip only to minimize the total download size).

https://downloads.tuxfamily.org/godotengine/testing/4.4-dev2-lto-comparison-android.zip

Android

template_release

`lto=none`

scons p=android verbose=yes target=template_release production=yes lto=none

[Time elapsed: 00:03:13.65]

23052683 android_release.apk
21429208 android_source.zip
21520355 godot-lib.template_release.aar
61942576 libgodot.android.template_release.arm64.so

`lto=thin`

scons p=android verbose=yes target=template_release production=yes lto=thin

[Time elapsed: 00:03:58.01]

26353343 Sep 10 16:22 android_release.apk
24729109 Sep 10 16:22 android_source.zip
24821015 Sep 10 16:22 godot-lib.template_release.aar
73214136 Sep 10 16:22 libgodot.android.template_release.arm64.so

Size impact (apk): +14.32%
Size impact (aar): +15.34%
Time impact: +22.90%

`lto=full`

scons p=android verbose=yes target=template_release production=yes lto=full

[Time elapsed: 00:13:03.85]

26127215 Sep 10 16:50 android_release.apk
24517319 Sep 10 16:50 android_source.zip
24594889 Sep 10 16:50 godot-lib.template_release.aar
72010648 Sep 10 16:51 libgodot.android.template_release.arm64.so

Size impact (apk): +13.34%
Size impact (aar): +14.29%
Time impact: +304.78%

template_debug

`lto=none`

scons p=android verbose=yes target=template_debug production=yes lto=none

[Time elapsed: 00:03:24.44]

27025684 android_debug.apk
22539674 android_source.zip
22668382 godot-lib.template_debug.aar
67420976 libgodot.android.template_debug.arm64.so

`lto=thin`

scons p=android verbose=yes target=template_debug production=yes lto=thin

[Time elapsed: 00:04:28.97]

31233392 android_debug.apk
26403585 android_source.zip
26529454 godot-lib.template_debug.aar
84032512 libgodot.android.template_debug.arm64.so

Size impact (apk): +15.57%
Size impact (aar): +17.03%
Time impact: +31.56%

`lto=full`

scons p=android verbose=yes target=template_debug production=yes lto=full

[Time elapsed: 00:15:28.26]

31036164 android_debug.apk
26258954 android_source.zip
26378082 godot-lib.template_debug.aar
83105936 libgodot.android.template_debug.arm64.so

Size impact (apk): +14.84%
Size impact (aar): +16.37%
Time impact: +354.05%

adamscott · 2024-09-10T20:01:20Z

I thought that using -O2 instead of -Os could help for the LTO (for whatever reason), but no. It's even worse, as expected.

akien-mga · 2024-09-10T20:58:01Z

Did some more compilation tests for Linux, with both GCC 14.2.1 and LLVM 18.1.6.

It seems like our common argument that LTO improves not only performance but also binary size holds true for GCC builds (-9% on release template with GCC full LTO), but is totally wrong for LLVM (+14.5% for ThinLTO and +11.5% for full LTO with LLVM, with atrocious build times for the latter).

Again haven't checked performance numbers.
Linux builds (from Fedora 40, might not be compatible with older distros) available on:

https://downloads.tuxfamily.org/godotengine/testing/4.4-dev2-lto-comparison-linux.zip

Linux (GCC)

$ gcc --version
gcc (GCC) 14.2.1 20240801 (Red Hat 14.2.1-1)

template_release

`lto=none`

scons p=linux verbose=yes target=template_release production=yes lto=none

[Time elapsed: 00:05:03.43]

74118880 godot.linuxbsd.template_release.x86_64

`lto=full`

scons p=linux verbose=yes target=template_release production=yes lto=full

[Time elapsed: 00:05:02.65]

67309248 godot.linuxbsd.template_release.x86_64

Size impact: -9.19%
Time impact: -0.25%

template_debug

`lto=none`

scons p=linux verbose=yes target=template_debug production=yes lto=none

[Time elapsed: 00:05:02.00]

72275712 godot.linuxbsd.template_debug.x86_64

`lto=full`

scons p=linux verbose=yes target=template_debug production=yes lto=full

[Time elapsed: 00:05:21.60]

67694368 godot.linuxbsd.template_debug.x86_64

Size impact: -6.34%
Time impact: +6.49%

Linux (LLVM)

$ clang --version
clang version 18.1.6 (Fedora 18.1.6-3.fc40)

template_release

`lto=none`

scons p=linux verbose=yes target=template_release production=yes use_llvm=yes lto=none

[Time elapsed: 00:04:28.10]

68099496 godot.linuxbsd.template_debug.x86_64.llvm

`lto=thin`

scons p=linux verbose=yes target=template_release production=yes use_llvm=yes lto=thin

[Time elapsed: 00:05:06.13]

77995208 godot.linuxbsd.template_debug.x86_64.llvm

Size impact: +14.53%
Time impact: +14.18%

`lto=full`

scons p=linux verbose=yes target=template_release production=yes use_llvm=yes lto=full

[Time elapsed: 00:13:33.94]

75940456 godot.linuxbsd.template_debug.x86_64.llvm

Size impact: +11.51%
Time impact: +203.60%

(Forgot to do template_debug builds too, but the results should be consistent with template_release and other LLVM platforms tested earlier.)

Riteo · 2024-09-11T09:24:19Z

After searching random things online it's possible that the extra size is inlining. From what I'm reading the whole point of LTO is inlining between compilation units, outside of more accurate dead code elimination and stuff like that, so extra size is to be expected.

It looks there are some extra settings that might be useful, pointed out by this link: https://discourse.llvm.org/t/clang-lld-thin-lto-footprint-and-run-time-performance-outperformed-by-gcc-ld/78997

I haven't read it fully but it seems very relevant to this PR. Probably the easiest change would be to pass -Oz instead of -Os:

One other thing worth noting is that Clang -Os isn’t as aggressive at size optimization as GCC -Os. Clang does have an -Oz option that may do more size optimization at the expense of performance.

It looks like LLVM is way more focused on performance, which might explain why even -Os seems to inline more or anyways not optimize for size as much as GCC.

The link above mentions a lot of other things we could try. It also mentions an interesting feature called "remarks", which apparently makes the compiler tell us why it hasn't optimized something: https://llvm.org/docs/Remarks.html

I have no idea how it works but it might be insightful if there's an easy way to parse the resulting file (the wiki mentions some like opt-viewer but I have no idea how they're supposed to work)

@akien-mga if it isn't too annoying could you also pass a verbose sample of the passed flags to the compiler/linker for each build test? I think that might help getting and idea of what's already done and what we could do to improve output size.

akien-mga · 2024-09-11T11:30:10Z

Based on findings so far, I updated this PR to only change the platforms which currently used lto=full on LLVM targets for production=yes, to now use lto=thin.

For our official builds, this means it only affects:

Web builds will now use thin instead of full LTO. Based on my numbers above, that does mean a size increase of +3.89% (wasm) / +1.63% (zip), but a reduction in build time by half (likely more on the official buildsystem which builds with -j64, so is even more bottlenecked by the slow single threaded linking of full LTO on LLVM).
Windows arm64 builds (using llvm-mingw) will now use thin instead of full LTO. This might also increase size slightly but should reduce build times very significantly (adding arm64 Windows builds with full LTO added almost 2h of build times to official builds).

For evaluation the actual gains of various LTO configurations, and see if we should start using LTO for Android/macOS/Windows clang-cl (and maybe iOS but this caused slow linking in Xcode, could maybe be re-assessed with ThinLTO), I'll open a new issue where I'll share my metrics again (and @Riteo can add the research they wrote here).

bruvzg · 2024-09-11T12:02:49Z

I have not tested lto on current master macOS/iOS (will do in a few hours), but last time I did, patter was the same as other clang builds: size increase for both thin and full lto, huge time difference (and memory usage for full lto).

akien-mga · 2024-09-11T12:30:45Z

I opened #96851 to continue the in-depth review of the different configuration options for each target.

This PR in the meantime just switches LLVM full LTO to thin LTO for the targets that currently use LTO.

Calinou

Make sense to me. The time and memory usage required for full LTO can be prohibitive for casual builders who just use production=yes to create their own binaries, so ThinLTO is a better default.

This speeds up build time considerably for these platforms compared to using `lto=full`, which is sadly single-threaded with LLVM, unlike GCC. Changes to default behavior of `lto=auto` (i.e. `production=yes`): - Linux: Prefer ThinLTO for LLVM - Web: Prefer ThinLTO - Windows: Prefer ThinLTO for llvm-mingw The following LLVM targets don't use LTO by default currently, which needs to be assessed further (gains from LLVM LTO on performance need to be weighed against the potential size increase from heavy inlining): - Android - iOS - macOS - Windows clang-cl

akien-mga added enhancement platform:windows platform:linuxbsd platform:web platform:android platform:macos topic:buildsystem needs testing performance labels Sep 10, 2024

akien-mga added this to the 4.4 milestone Sep 10, 2024

akien-mga requested review from a team as code owners September 10, 2024 10:33

akien-mga marked this pull request as draft September 10, 2024 10:33

akien-mga requested a review from bruvzg September 10, 2024 10:33

Repiteo mentioned this pull request Sep 10, 2024

SCons: Fix clang-cl link/ar flags #96813

Merged

akien-mga force-pushed the scons-lto-use-thinlto-llvm branch from bbe5449 to d4655b1 Compare September 11, 2024 11:22

akien-mga changed the title ~~SCons: Make lto=auto enable/prefer ThinLTO for LLVM targets~~ SCons: Make lto=auto prefer ThinLTO over full LTO for LLVM targets Sep 11, 2024

akien-mga marked this pull request as ready for review September 11, 2024 11:30

akien-mga force-pushed the scons-lto-use-thinlto-llvm branch from d4655b1 to c814952 Compare September 11, 2024 11:39

akien-mga mentioned this pull request Sep 11, 2024

Evaluation of LTO configuration for all targets, and its impact on build time, build size, and performance #96851

Open

Calinou approved these changes Oct 21, 2024

View reviewed changes

akien-mga force-pushed the scons-lto-use-thinlto-llvm branch from c814952 to 26db0bb Compare January 9, 2025 12:04

akien-mga requested a review from a team as a code owner January 9, 2025 12:04

akien-mga merged commit fcc9e3a into godotengine:master Jan 9, 2025
20 checks passed

akien-mga deleted the scons-lto-use-thinlto-llvm branch January 9, 2025 12:53

AThousandShips removed the needs testing label Jan 9, 2025

akien-mga mentioned this pull request Jan 9, 2025

Inline String::utf8 and String::utf16 for their simplicity. #101356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCons: Make `lto=auto` prefer ThinLTO over full LTO for LLVM targets #96785

SCons: Make `lto=auto` prefer ThinLTO over full LTO for LLVM targets #96785

akien-mga commented Sep 10, 2024 •

edited

Loading

akien-mga commented Sep 10, 2024 •

edited

Loading

akien-mga commented Sep 10, 2024 •

edited

Loading

adamscott commented Sep 10, 2024

akien-mga commented Sep 10, 2024

Riteo commented Sep 11, 2024 •

edited

Loading

akien-mga commented Sep 11, 2024

bruvzg commented Sep 11, 2024

akien-mga commented Sep 11, 2024

Calinou left a comment

SCons: Make lto=auto prefer ThinLTO over full LTO for LLVM targets #96785

SCons: Make lto=auto prefer ThinLTO over full LTO for LLVM targets #96785

Conversation

akien-mga commented Sep 10, 2024 • edited Loading

akien-mga commented Sep 10, 2024 • edited Loading

Web

template_release

lto=none

lto=thin

lto=full

template_debug

lto=none

lto=thin

lto=full

akien-mga commented Sep 10, 2024 • edited Loading

Android

template_release

lto=none

lto=thin

lto=full

template_debug

lto=none

lto=thin

lto=full

adamscott commented Sep 10, 2024

akien-mga commented Sep 10, 2024

Linux (GCC)

template_release

lto=none

lto=full

template_debug

lto=none

lto=full

Linux (LLVM)

template_release

lto=none

lto=thin

lto=full

Riteo commented Sep 11, 2024 • edited Loading

akien-mga commented Sep 11, 2024

bruvzg commented Sep 11, 2024

akien-mga commented Sep 11, 2024

Calinou left a comment

Choose a reason for hiding this comment

SCons: Make `lto=auto` prefer ThinLTO over full LTO for LLVM targets #96785

SCons: Make `lto=auto` prefer ThinLTO over full LTO for LLVM targets #96785

akien-mga commented Sep 10, 2024 •

edited

Loading

akien-mga commented Sep 10, 2024 •

edited

Loading

`lto=none`

`lto=thin`

`lto=full`

`lto=none`

`lto=thin`

`lto=full`

akien-mga commented Sep 10, 2024 •

edited

Loading

`lto=none`

`lto=thin`

`lto=full`

`lto=none`

`lto=thin`

`lto=full`

`lto=none`

`lto=full`

`lto=none`

`lto=full`

`lto=none`

`lto=thin`

`lto=full`

Riteo commented Sep 11, 2024 •

edited

Loading