-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustc: Implement ThinLTO #44841
rustc: Implement ThinLTO #44841
Conversation
r? @arielb1 (rust_highfive has picked a reviewer for you, use r? to override) |
Note that the first commit here is from #44783, so shouldn't need extra review, the thin support should all be in the second. Example output of the modified |
if codegen_unit_name.contains(NUMBERED_CODEGEN_UNIT_MARKER) { | ||
// If we use the numbered naming scheme for modules, we don't want | ||
// the files to look like <crate-name><extra>.<crate-name>.<index>.<ext> | ||
// but simply <crate-name><extra>.<index>.<ext> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelwoerister mind helping me understand what was going on here? For ThinLTO we need to make sure that all the objects and such have unique names, which this comment seems to indicate we will have achieved. (although in practice I didn't see crate name hashes and such in those names)
With this name munging left in though I found that lots of objects were overwriting one another by accident, because I think the backend of shuffling files around was "getting weird". I couldn't find a downside to removing this logic, though, so I was curious if you knew what this was originally added for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks like a measure for keeping filenames/paths short. We could probably just get rid of the "numbered" codegen unit naming scheme (incremental does not use that). That was only introduced to stay close to the then existing behavior. That would have the downside of codegen units having somewhat misleading names but they should be unique.
That's a nice surprise
cc @rust-lang/compiler |
Oops an excellent question! I had 8 cores.
AFAIK no, I realized when doing all this that we should just turn this on by default. I was gonna do that in a separate PR orthogonal to this though.
Hm sort of! I think it's a lot better than it looks though. For example, here's a comparison between 1 and 16 codegen units, without ThinLTO enabled (of the regex benchmark suite, threshold 5% regression at least). And then here's the same comparison when those 16 CGUs are compiled with ThinLTO. Notably nearly every benchmark regesses (sometimes by 10x) with just vanilla codegen units, whereas with ThinLTO the worst regression is 18ns/iter -> 27ns/iter, and ThinLTO even improved the performance in one case! In other words, my conclusion is that runtime performance is "basically the same" modulo compiler wizardry details. My guess is that any possible performance loss could be easily fixed by Overally though I definitely wouldn't classify it as overall a significant performance loss, only a very minor performance loss in some esoteric situations, at best. From what I've seen it does everything you'd expect it to do across the board. Then again, that's why this is unstable to start out with :) |
☔ The latest upstream changes (presumably #44085) made this pull request unmergeable. Please resolve the merge conflicts. |
Amazing!! On the subject of performance, note that ThinLTO has historically been missing some optimizations of classical LTO (just due to them not being implemented). One class of such optimizations specifically called out on LLVM's "open projects" page is "propagating more global informations across the program". I assume upstream has made progress on this since LLVM 4.0, so maybe we'll see improved performance just by updating to LLVM 5.0? |
This commit changes the default of rustc to use 32 codegen units when compiling in debug mode, typically an opt-level=0 compilation. Since their inception codegen units have matured quite a bit, gaining features such as: * Parallel translation and codegen enabling codegen units to get worked on even more quickly. * Deterministic and reliable partitioning through the same infrastructure as incremental compilation. * Global rate limiting through the `jobserver` crate to avoid overloading the system. The largest benefit of codegen units has forever been faster compilation through parallel processing of modules on the LLVM side of things, using all the cores available on build machines that typically have many available. Some downsides have been fixed through the features above, but the major downside remaining is that using codegen units reduces opportunities for inlining and optimization. This, however, doesn't matter much during debug builds! In this commit the default number of codegen units for debug builds has been raised from 1 to 32. This should enable most `cargo build` compiles that are bottlenecked on translation and/or code generation to immediately see speedups through parallelization on available cores. Work is being done to *always* enable multiple codegen units (and therefore parallel codegen) but it requires rust-lang#44841 at least to be landed and stabilized, but stay tuned if you're interested in that aspect!
@alexcrichton I've just ran some build time benchmarks on my project that uses a lot of popular rust libraries and codegen (diesel, hyper, serde, tokio, futures, reqwest) on my Intel Core i5 laptop (skylake 2c/4t) and got these results:
Cargo profile:
rustc 1.22.0-nightly (17f56c5 2017-09-21) As expected, the best results is for codegen units of number of system threads, and 32 is way to much for an average machine. Did you concider an option to select number of codegen units depending on number of cpus, with Thank you for working on compile times! |
@mersinvald very interesting! Did you mean to comment on #44853 though? If so, maybe we can continue over there? |
@alexcrichton ok, sorry :) |
d8ae4f8
to
d3b0c49
Compare
@alexcrichton FYI if you were trying to convince us that you weren't a robot, this and #44853 back-to-back aren't helping. ;) |
…rister rustc: Default 32 codegen units at O0 This commit changes the default of rustc to use 32 codegen units when compiling in debug mode, typically an opt-level=0 compilation. Since their inception codegen units have matured quite a bit, gaining features such as: * Parallel translation and codegen enabling codegen units to get worked on even more quickly. * Deterministic and reliable partitioning through the same infrastructure as incremental compilation. * Global rate limiting through the `jobserver` crate to avoid overloading the system. The largest benefit of codegen units has forever been faster compilation through parallel processing of modules on the LLVM side of things, using all the cores available on build machines that typically have many available. Some downsides have been fixed through the features above, but the major downside remaining is that using codegen units reduces opportunities for inlining and optimization. This, however, doesn't matter much during debug builds! In this commit the default number of codegen units for debug builds has been raised from 1 to 32. This should enable most `cargo build` compiles that are bottlenecked on translation and/or code generation to immediately see speedups through parallelization on available cores. Work is being done to *always* enable multiple codegen units (and therefore parallel codegen) but it requires #44841 at least to be landed and stabilized, but stay tuned if you're interested in that aspect!
☔ The latest upstream changes (presumably #44853) made this pull request unmergeable. Please resolve the merge conflicts. |
d3b0c49
to
51e37aa
Compare
Rebased and should be ready for review! |
⌛ Testing commit 5f66c99efd92eef321dbe007e5a71b7b757c8f8b with merge f8577605e7e59dd1354533f9465a365cc4cb814d... |
💔 Test failed - status-travis |
Multiple
|
@bors: r=michaelwoerister |
📌 Commit 1cb0c99 has been approved by |
⌛ Testing commit 1cb0c99d055072c9dfcaf922b26fc66f25cbbb43 with merge 54b42bae05a52565bcf4ebb2780d7f94deae900d... |
💔 Test failed - status-travis |
let foo = foo as usize as *const u8; | ||
let bar = bar::bar as usize as *const u8; | ||
|
||
assert_eq!(*foo, *bar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asm.js failed on this line...
[01:25:14] failures:
[01:25:14]
[01:25:14] ---- [run-pass] run-pass/thin-lto-inlines.rs stdout ----
[01:25:14]
[01:25:14] error: test run failed!
[01:25:14] status: exit code: 101
[01:25:14] command: "/emsdk-portable/node/4.1.1_64bit/bin/node" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/run-pass/thin-lto-inlines.stage2-asmjs-unknown-emscripten.js"
[01:25:14] stdout:
[01:25:14] ------------------------------------------
[01:25:14] 3 3
[01:25:14]
[01:25:14] ------------------------------------------
[01:25:14] stderr:
[01:25:14] ------------------------------------------
[01:25:14] thread 'main' panicked at 'assertion failed: `(left == right)`
[01:25:14] left: `115`,
[01:25:14] right: `99`', /checkout/src/test/run-pass/thin-lto-inlines.rs:36:8
[01:25:14] note: Run with `RUST_BACKTRACE=1` for a backtrace.
[01:25:14]
[01:25:14] ------------------------------------------
[01:25:14]
[01:25:14] thread '[run-pass] run-pass/thin-lto-inlines.rs' panicked at 'explicit panic', /checkout/src/tools/compiletest/src/runtest.rs:2433:8
[01:25:14] note: Run with `RUST_BACKTRACE=1` for a backtrace.
This commit is an implementation of LLVM's ThinLTO for consumption in rustc itself. Currently today LTO works by merging all relevant LLVM modules into one and then running optimization passes. "Thin" LTO operates differently by having more sharded work and allowing parallelism opportunities between optimizing codegen units. Further down the road Thin LTO also allows *incremental* LTO which should enable even faster release builds without compromising on the performance we have today. This commit uses a `-Z thinlto` flag to gate whether ThinLTO is enabled. It then also implements two forms of ThinLTO: * In one mode we'll *only* perform ThinLTO over the codegen units produced in a single compilation. That is, we won't load upstream rlibs, but we'll instead just perform ThinLTO amongst all codegen units produced by the compiler for the local crate. This is intended to emulate a desired end point where we have codegen units turned on by default for all crates and ThinLTO allows us to do this without performance loss. * In anther mode, like full LTO today, we'll optimize all upstream dependencies in "thin" mode. Unlike today, however, this LTO step is fully parallelized so should finish much more quickly. There's a good bit of comments about what the implementation is doing and where it came from, but the tl;dr; is that currently most of the support here is copied from upstream LLVM. This code duplication is done for a number of reasons: * Controlling parallelism means we can use the existing jobserver support to avoid overloading machines. * We will likely want a slightly different form of incremental caching which integrates with our own incremental strategy, but this is yet to be determined. * This buys us some flexibility about when/where we run ThinLTO, as well as having it tailored to fit our needs for the time being. * Finally this allows us to reuse some artifacts such as our `TargetMachine` creation, where all our options we used today aren't necessarily supported by upstream LLVM yet. My hope is that we can get some experience with this copy/paste in tree and then eventually upstream some work to LLVM itself to avoid the duplication while still ensuring our needs are met. Otherwise I fear that maintaining these bindings may be quite costly over the years with LLVM updates!
@bors: r=michaelwoerister |
📌 Commit 4ca1b19 has been approved by |
⌛ Testing commit 4ca1b19 with merge 33f1f8654b50e49db838b64e76c4af59bc55ddb5... |
💔 Test failed - status-appveyor |
|
rustc: Implement ThinLTO This commit is an implementation of LLVM's ThinLTO for consumption in rustc itself. Currently today LTO works by merging all relevant LLVM modules into one and then running optimization passes. "Thin" LTO operates differently by having more sharded work and allowing parallelism opportunities between optimizing codegen units. Further down the road Thin LTO also allows *incremental* LTO which should enable even faster release builds without compromising on the performance we have today. This commit uses a `-Z thinlto` flag to gate whether ThinLTO is enabled. It then also implements two forms of ThinLTO: * In one mode we'll *only* perform ThinLTO over the codegen units produced in a single compilation. That is, we won't load upstream rlibs, but we'll instead just perform ThinLTO amongst all codegen units produced by the compiler for the local crate. This is intended to emulate a desired end point where we have codegen units turned on by default for all crates and ThinLTO allows us to do this without performance loss. * In anther mode, like full LTO today, we'll optimize all upstream dependencies in "thin" mode. Unlike today, however, this LTO step is fully parallelized so should finish much more quickly. There's a good bit of comments about what the implementation is doing and where it came from, but the tl;dr; is that currently most of the support here is copied from upstream LLVM. This code duplication is done for a number of reasons: * Controlling parallelism means we can use the existing jobserver support to avoid overloading machines. * We will likely want a slightly different form of incremental caching which integrates with our own incremental strategy, but this is yet to be determined. * This buys us some flexibility about when/where we run ThinLTO, as well as having it tailored to fit our needs for the time being. * Finally this allows us to reuse some artifacts such as our `TargetMachine` creation, where all our options we used today aren't necessarily supported by upstream LLVM yet. My hope is that we can get some experience with this copy/paste in tree and then eventually upstream some work to LLVM itself to avoid the duplication while still ensuring our needs are met. Otherwise I fear that maintaining these bindings may be quite costly over the years with LLVM updates!
☀️ Test successful - status-appveyor, status-travis |
🎉 🎉 🎉 🎉 🎉 |
This commit is an implementation of LLVM's ThinLTO for consumption in rustc
itself. Currently today LTO works by merging all relevant LLVM modules into one
and then running optimization passes. "Thin" LTO operates differently by having
more sharded work and allowing parallelism opportunities between optimizing
codegen units. Further down the road Thin LTO also allows incremental LTO
which should enable even faster release builds without compromising on the
performance we have today.
This commit uses a
-Z thinlto
flag to gate whether ThinLTO is enabled. It thenalso implements two forms of ThinLTO:
In one mode we'll only perform ThinLTO over the codegen units produced in a
single compilation. That is, we won't load upstream rlibs, but we'll instead
just perform ThinLTO amongst all codegen units produced by the compiler for
the local crate. This is intended to emulate a desired end point where we have
codegen units turned on by default for all crates and ThinLTO allows us to do
this without performance loss.
In anther mode, like full LTO today, we'll optimize all upstream dependencies
in "thin" mode. Unlike today, however, this LTO step is fully parallelized so
should finish much more quickly.
There's a good bit of comments about what the implementation is doing and where
it came from, but the tl;dr; is that currently most of the support here is
copied from upstream LLVM. This code duplication is done for a number of
reasons:
avoid overloading machines.
integrates with our own incremental strategy, but this is yet to be
determined.
having it tailored to fit our needs for the time being.
TargetMachine
creation, where all our options we used today aren't necessarily supported by
upstream LLVM yet.
My hope is that we can get some experience with this copy/paste in tree and then
eventually upstream some work to LLVM itself to avoid the duplication while
still ensuring our needs are met. Otherwise I fear that maintaining these
bindings may be quite costly over the years with LLVM updates!