Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: use more aggressive LTO on gcc #56138

Closed
wants to merge 3 commits into from

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Dec 5, 2024

Please validate locally before landing this pull-request. This is an area that I'm not familiar with.

Using more aggressive LTO, the startup time of Node.js can be reduced from 38ms to 28ms.

Before

➜  node git:(main) ✗ hyperfine 'out/Release/node index.js' --warmup 30
Benchmark 1: out/Release/node index.js
  Time (mean ± σ):      38.7 ms ±   0.8 ms    [User: 27.1 ms, System: 8.1 ms]
  Range (min … max):    37.9 ms …  42.2 ms    70 runs

After

➜  node git:(main) ✗ hyperfine 'out/Release/node index.js' --warmup 30
Benchmark 1: out/Release/node index.js
  Time (mean ± σ):      32.0 ms ±   2.7 ms    [User: 26.7 ms, System: 5.3 ms]
  Range (min … max):    28.5 ms …  42.3 ms    53 runs

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/gyp

@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. needs-ci PRs that need a full CI run. labels Dec 5, 2024
@anonrig
Copy link
Member Author

anonrig commented Dec 5, 2024

cc @nodejs/build @nodejs/performance please take a look

@anonrig anonrig added request-ci Add this label to start a Jenkins CI on a PR. and removed request-ci Add this label to start a Jenkins CI on a PR. labels Dec 5, 2024
Copy link
Member

@juanarbol juanarbol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

Just as a question, do we distribute binaries built w/ gcc?

@anonrig
Copy link
Member Author

anonrig commented Dec 5, 2024

Just as a question, do we distribute binaries built w/ gcc?

Yes, I think so, but @nodejs/build can answer this better than me.

Copy link
Member

@RafaelGSS RafaelGSS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured it in different environments?

@anonrig
Copy link
Member Author

anonrig commented Dec 5, 2024

Have you measured it in different environments?

Only macOS for now. I can measure it on Linux tomorrow, if someone doesn't beat me to it. Please run and share outputs from your machine regardless. Let's make sure the numbers are correct.

@juanarbol
Copy link
Member

Only macOS for now. I can measure it on Linux tomorrow, if someone doesn't beat me to it. Please run and share outputs from your machine regardless. Let's make sure the numbers are correct.

I can test linux in a couple hours

@joyeecheung
Copy link
Member

Only macOS for now.

You build with gcc on macOS? 🧐

@targos targos removed their request for review December 5, 2024 08:16
@targos
Copy link
Member

targos commented Dec 5, 2024

I know nothing about LTO

@@ -13,6 +13,7 @@
'enable_pgo_generate%': '0',
'enable_pgo_use%': '0',
'python%': 'python',
'enable_lto': 'true',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this bypasses / overrides the --enable-lto configure flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah this should just flip the flag in configure.py, at least flip it to true on known safe platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this gets overridden by the config.gypi generated by ./configure, so it's not actually enabling LTO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried enabling it with --enable-lto and it in fact took longer than the current build, but it didn't take forever to link mkdsnapshot. It took longer but it finished. Am I missing something?

Copy link
Member

@joyeecheung joyeecheung Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be that you Mac is beefier than mine/your lld is newer than mine/you just have more patience than I do ;) (I think I waited for ~20 minutes before I gave up)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. I have the new mac mini with the highest spec.

@targos
Copy link
Member

targos commented Dec 5, 2024

@joyeecheung The improvements on macOS probably come from 'enable_lto': 'true', not the gcc config changes.

@ptr1337
Copy link

ptr1337 commented Dec 5, 2024

If -ffunctions and -fdata-sections is passed, should be not also --gc-sections passed to the linker?

@joyeecheung
Copy link
Member

joyeecheung commented Dec 5, 2024

So I tried building with configure --ninja --enable-lto, but V8's own mksnapshot never finishes linking....trying to debug that stuck process shows that it's hard at work actuallhy doing LTO, but in the CI of this PR, it seems to just finish everything on macOS in 42m, which makes me wonder if this PR actually manages to enable LTO..

Which now reminds me, we need to disable ICF for both node_mksnapshot and v8's mksnapshot, because otherwise the external references may not match after functions are folded and the program may crash. Which means porting this

node/deps/v8/BUILD.gn

Lines 7351 to 7366 in 5ab3140

# This config disables a link time optimization "ICF", which may merge
# different functions into one if the function signature and body of them are
# identical.
#
# ICF breaks 1:1 mappings of the external references for V8 snapshot, so we
# disable it while taking a V8 snapshot.
config("disable_icf") {
visibility = [ ":*" ] # Only targets in this file can depend on this.
if (is_win) {
ldflags = [ "/OPT:NOICF" ] # link.exe, but also lld-link.exe.
} else if (is_apple && !use_lld) {
ldflags = [ "-Wl,-no_deduplicate" ] # ld64.
} else if (use_lld) {
ldflags = [ "-Wl,--icf=none" ]
}
}

To the gyp configs.

(IIRC windows enables LTO + ICF and it seems to have no problem, but maybe I am reading it wrong, but anyway it can be risky not to disable it)

@sxa
Copy link
Member

sxa commented Dec 5, 2024

Just as a question, do we distribute binaries built w/ gcc?

Yes :-)

@joyeecheung
Copy link
Member

joyeecheung commented Dec 5, 2024

I tried building this PR locally on macOS, and AFAICT, it doesn't enable LTO.

❯ hyperfine "./node-c4aa34aa test/fixtures/empty.js" "./node-lto  test/fixtures/empty.js" --warmup 30
Benchmark 1: ./node-c4aa34aa test/fixtures/empty.js
  Time (mean ± σ):      32.4 ms ±   2.1 ms    [User: 29.4 ms, System: 2.3 ms]
  Range (min … max):    31.4 ms …  47.4 ms    87 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./node-lto  test/fixtures/empty.js
  Time (mean ± σ):      33.6 ms ±   2.3 ms    [User: 29.5 ms, System: 2.4 ms]
  Range (min … max):    32.4 ms …  52.3 ms    84 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  './node-c4aa34aa test/fixtures/empty.js' ran
    1.04 ± 0.10 times faster than './node-lto  test/fixtures/empty.js'

And the config.gypi generated by configure is the same as the main branch - enable_lto is still false in there, which overrides common.gypi, and as a result the command used to link the node executable is the same as the main branch.

ccache clang++ -Wl,-force_load,libnode.a -Wl,-force_load,libv8_base_without_compiler.a -Wl,-force_load,libzlib.a -Wl,-force_load,libuv.a -Wl,-force_load,libv8_snapshot.a -Wl,-force_load,libopenssl.a -Wl,-search_paths_first -mmacosx-version-min=11.0 -arch arm64 -L./ -stdlib=libc++ -o node obj/gen/node.node_snapshot.o obj/src/node.node_main.o libhistogram.a libnode.a libv8_snapshot.a libv8_libplatform.a libicui18n.a libzlib.a libllhttp.a libcares.a libuv.a libuvwasi.a libnghttp2.a libada.a libsimdjson.a libsimdutf.a libbrotli.a libsqlite.a libopenssl.a libngtcp2.a libnghttp3.a libnbytes.a libncrypto.a libicuucx.a libicudata.a libv8_base_without_compiler.a libv8_libbase.a libv8_abseil.a libv8_zlib.a libv8_compiler.a libv8_turboshaft.a libv8_initializers.a libv8_initializers_slow.a libzlib_inflate_chunk_simd.a libzlib_adler32_simd.a libzlib_arm_crc32.a  -framework CoreFoundation -lm

Whereas when I actually build the main branch with ./configure --ninja --enable-lto, it gets stuck forever when linking V8's mksnapshot executable doing a lot of LTO work. That command looks like this (from ps output, which I attached a debugger to and confirmed that it's actually doing LTO...)

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld -demangle -lto_library /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/libLTO.dylib -dynamic -arch arm64 -platform_version macos 11.0.0 13.0 -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -o mksnapshot -L./ -L/usr/local/lib -search_paths_first obj/deps/v8/src/snapshot/embedded/mksnapshot.embedded-empty.o obj/deps/v8/src/snapshot/embedded/mksnapshot.embedded-file-writer.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-aix.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-base.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-generic.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-mac.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-win.o obj/deps/v8/src/snapshot/embedded/mksnapshot.platform-embedded-file-writer-zos.o obj/deps/v8/src/snapshot/mksnapshot.mksnapshot.o obj/deps/v8/src/snapshot/mksnapshot.snapshot-empty.o obj/deps/v8/src/snapshot/mksnapshot.static-roots-gen.o libv8_base_without_compiler.a libv8_init.a libv8_libbase.a libv8_libplatform.a libv8_turboshaft.a libv8_abseil.a libicui18n.a libicuucx.a libicudata.a libv8_zlib.a libv8_compiler.a libv8_initializers.a libv8_initializers_slow.a -lc++ -lSystem /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.osx.a

@richardlau
Copy link
Member

richardlau commented Dec 5, 2024

Just as a question, do we distribute binaries built w/ gcc?

https://github.com/nodejs/node/blob/main/BUILDING.md#official-binary-platforms-and-toolchains

node/BUILDING.md

Lines 158 to 190 in c4aa34a

### Official binary platforms and toolchains
Binaries at <https://nodejs.org/download/release/> are produced on:
| Binary package | Platform and Toolchain |
| ----------------------- | ------------------------------------------------------------------------------------------------------------- |
| aix-ppc64 | AIX 7.2 TL04 on PPC64BE with GCC 12[^5] |
| darwin-x64 | macOS 11, Xcode 13 with -mmacosx-version-min=11.0 |
| darwin-arm64 (and .pkg) | macOS 11 (arm64), Xcode 13 with -mmacosx-version-min=11.0 |
| linux-arm64 | RHEL 8 with gcc-toolset-12[^6] |
| linux-armv7l | Cross-compiled on RHEL 9 x64 with a [custom GCC toolchain](https://github.com/rvagg/rpi-newer-crosstools)[^7] |
| linux-ppc64le | RHEL 8 with gcc-toolset-12[^6] |
| linux-s390x | RHEL 8 with gcc-toolset-12[^6] |
| linux-x64 | RHEL 8 with gcc-toolset-12[^6] |
| win-arm64 | Windows Server 2022 (x64) with Visual Studio 2022 |
| win-x64 | Windows Server 2022 (x64) with Visual Studio 2022 |
<!--lint disable final-definition-->
[^5]: Binaries produced on these systems require libstdc++12, available
from the [AIX toolbox][].
[^6]: Binaries produced on these systems are compatible with glibc >= 2.28
and libstdc++ >= 6.0.25 (`GLIBCXX_3.4.25`). These are available on
distributions natively supporting GCC 8.1 or higher, such as Debian 10,
RHEL 8 and Ubuntu 20.04.
[^7]: Binaries produced on these systems are compatible with glibc >= 2.28
and libstdc++ >= 6.0.30 (`GLIBCXX_3.4.30`). These are available on
distributions natively supporting GCC 12.1 or higher, such as Debian 12,
Ubuntu 22.04.
<!--lint enable final-definition-->

@richardlau
Copy link
Member

richardlau commented Dec 5, 2024

Re. LTO, we currently do not have that enabled by default on the official binaries. Anecdotally, I understand that downstream Fedora/RHEL Node.js had to turn off LTO due to the increased memory requirements/build times with it enabled.

There's also #49063, which may help.

@joyeecheung
Copy link
Member

So I gave it a bit more patience and managed to build the main branch with ./configure --ninja --enable-lto after waiting for 31 minutes since linking of V8's mksnapshot started. In comparison the same steps configured without lto takes just 5s.

Locally it doesn't look like LTO makes much of a difference on macOS despite waiting for more than half of an hour to build...

❯ hyperfine "./node-c4aa34aa test/fixtures/empty.js" "./node-lto-c4aa34aa test/fixtures/empty.js" --warmup 30
Benchmark 1: ./node-c4aa34aa test/fixtures/empty.js
  Time (mean ± σ):      32.8 ms ±   3.2 ms    [User: 29.5 ms, System: 2.5 ms]
  Range (min … max):    31.6 ms …  52.1 ms    87 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./node-lto-c4aa34aa test/fixtures/empty.js
  Time (mean ± σ):      31.3 ms ±   2.9 ms    [User: 27.1 ms, System: 2.4 ms]
  Range (min … max):    30.1 ms …  49.7 ms    91 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  './node-lto-c4aa34aa test/fixtures/empty.js' ran
    1.05 ± 0.14 times faster than './node-c4aa34aa test/fixtures/empty.js'

Also to my surprise, binary size is bigger after LTO?!

❯ ls -lah ./node-lto-c4aa34aa
-rwxr-xr-x  1 joyee  staff   122M Dec  5 16:14 ./node-lto-c4aa34aa
❯ ls -lah ./node-c4aa34aa
-rwxr-xr-x  1 joyee  staff   118M Dec  5 12:39 ./node-c4aa34aa

@anonrig
Copy link
Member Author

anonrig commented Dec 5, 2024

I tried enabling PGO as well, and it didn't make a difference at all. Maybe we are missing something @joyeecheung

@anonrig anonrig marked this pull request as draft December 5, 2024 15:41
@anonrig
Copy link
Member Author

anonrig commented Dec 5, 2024

If -ffunctions and -fdata-sections is passed, should be not also --gc-sections passed to the linker?

@ptr1337 I found some examples of --gc-sections not optimizing correctly. https://stackoverflow.com/questions/31521326/gc-sections-discards-used-data Is this not true in 2024/2025?

@anonrig anonrig closed this Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants