Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve rustc wrapper startup time? #2626

Open
matthiaskrgr opened this issue Jan 1, 2021 · 31 comments
Open

improve rustc wrapper startup time? #2626

matthiaskrgr opened this issue Jan 1, 2021 · 31 comments

Comments

@matthiaskrgr
Copy link
Member

time ~/.rustup/toolchains/master/bin/rustc >& /dev/null
0,040 total

time ~/.cargo/bin/rustc >& /dev/null
0,272 total

That's almost 7x slower.
I wonder if there is some something we can do to get these numbers a bit closer?

@jyn514
Copy link
Member

jyn514 commented Jan 1, 2021

@matthiaskrgr this is about the rustup wrapper, not the x.py rustc shim, right? If so I think this should be moved to https://github.com/rust-lang/rustup/.

@matthiaskrgr
Copy link
Member Author

Ah, ~/.cargo/bin/rustc is rustup? Yes, the issue should be moved then. :)

@pietroalbini pietroalbini transferred this issue from rust-lang/rust Jan 1, 2021
@kinnison
Copy link
Contributor

kinnison commented Jan 1, 2021

I can confirm that the difference in performance is around 6x:

listless% hyperfine ~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc rustc 
Benchmark #1: /home/dsilvers/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc
  Time (mean ± σ):      11.3 ms ±   0.9 ms    [User: 6.9 ms, System: 4.3 ms]
  Range (min … max):    10.1 ms …  15.4 ms    252 runs
 
Benchmark #2: rustc
  Time (mean ± σ):      71.2 ms ±   2.3 ms    [User: 60.0 ms, System: 11.2 ms]
  Range (min … max):    67.3 ms …  76.2 ms    41 runs
 
Summary
  '/home/dsilvers/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc' ran
    6.31 ± 0.54 times faster than 'rustc'
listless%                                                                                

@kinnison
Copy link
Contributor

kinnison commented Jan 1, 2021

Looking locally, ca. 15ms are before main() is entered (presumably due to dynamic link time for all the libraries rustup uses by dint of being linked to curl).

It looks like about 25ms is taken parsing manifest files, sadly this is also done twice, and thus accounts for the vast majority of the extra time. The toml parser must be quite inefficient.

It might be possible to knock one of those parse steps on the head since in theory we only need to parse once, but it'll involve quite a bit of refactoring.

If someone wants to begin to look at this, then the place to start is the config.rs file, around line 620 which is find_or_install_override_toolchain_or_default(). In the sub-function components_exist() you can see it acquire the manifest (since it needs it) and then it lists the components in order to check that everything needed is installed. The list_components() function though ends up loading the manifest itself. Perhaps the distributable toolchain could cache the manifest having loaded it?

@rbtcollins
Copy link
Contributor

To help triage this, how much does this contribute to typical build times? like, - sure we can put a chunk of time into making this snappy, and that would be good, but if a typical build is 1500ms ... ?

@jyn514
Copy link
Member

jyn514 commented Jan 1, 2021

@rbtcollins I wouldn't notice so much on builds, but this hurts things like rustc --explain E0111 or rustc --help.

$ time /home/joshua/.local/lib/rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc --help >/dev/null
real	0m0.050s
user	0m0.026s
sys	0m0.020s
$ time rustc --help >/dev/null
real	0m0.157s
user	0m0.129s
sys	0m0.028s
$ time pip --help >/dev/null
real	0m0.206s
user	0m0.178s
sys	0m0.028s
$ time go help >/dev/null
real	0m0.022s
user	0m0.008s
sys	0m0.017s

@matthiaskrgr
Copy link
Member Author

to addd to the list:

time clang >&/dev/null 
clang >&/dev/null  0,04s user 0,03s system 96% cpu 0,064 total

time gcc >& /dev/null
gcc >&/dev/null  0,00s user 0,01s system 86% cpu 0,009 total

@kinnison
Copy link
Contributor

kinnison commented Jan 2, 2021

@matthiaskrgr

Comparing with C compilers is disingenuous since the number of rustc invocations for a similarly sized codebase will be many fewer. For example, rustup is around 130 .rs files, but including its dependencies the total rustc invocations is only 320 or so. If we assume a wasted 35ms (the repeated manifest parse) of CPU time on my computer per rustc invocation and we generously assume that all that waste is reflected in the wallclock compile time at the same parallelism that the build occurs at, then of the 83s it took my computer to build rustup just now at 6.1x parallel, then that's 11375ms wasted which at 6.1x parallel is 1.864s of wallclock time or 2.2% of all build time in a debug build.

The same maths on a release build, 11375ms wasted for a total build time of 195s is only 0.95% of the total runtime of the build.

@jyn514

Considering the rustc --explain or rustc --help scenarios is also disingenuous. Most humans don't need to run those repeatedly in tight loops, and will take far longer than the wasted 50ms or so even noticing the screen updated let alone reading and grokking the output.

however

Having said all that, I do think it's worth correcting the double-load of the manifest since it's wasteful to do that anyway, and that shouldn't be too complex to do in a grungy way; or only medium hard to do nicely.

What would be of even more use though is working out how it takes 35ms on a 4GHz CPU core to parse the manifest, that's utterly abysmal.

@kinnison
Copy link
Contributor

kinnison commented Jan 2, 2021

Further investigation yields:

The majority of the time spent parsing the manifest is in the toml load itself, i.e. before we apply any semantic understanding to the manifest. The channel is around 700k and will only get larger over time. Another possible improvement to do would be to trim out anything not relevant to the toolchain being installed when fetching the manifest. That ought not to be too complex to do and would speed up future operations since the manifest would be significantly smaller

@kinnison
Copy link
Contributor

kinnison commented Jan 2, 2021

I wrote a simple trimmer for the manifests to reduce the toml quantity in installed toolchains. Doing that results in the following (stable is untrimmed, beta is trimmed):

listless% echo stable > rust-toolchain                                        
listless% hyperfine rustc                                                     
Benchmark #1: rustc
  Time (mean ± σ):      69.8 ms ±   1.9 ms    [User: 59.5 ms, System: 10.2 ms]
  Range (min … max):    67.4 ms …  76.1 ms    42 runs
 
listless% echo beta > rust-toolchain                                          
listless% hyperfine rustc                                                    
Benchmark #1: rustc
  Time (mean ± σ):      19.6 ms ±   0.7 ms    [User: 14.2 ms, System: 5.4 ms]
  Range (min … max):    18.7 ms …  22.9 ms    128 runs
 
listless%                                                                    

(for reference, directly invoking rustc (stable or beta) without rustup in the way is ca. 11ms.

@kinnison
Copy link
Contributor

kinnison commented Jan 2, 2021

@matthiaskrgr Assuming #2627 goes CI-green, you might want to try the binary from there and see how it affects performance for you.

@jyn514
Copy link
Member

jyn514 commented Jan 2, 2021

Most humans don't need to run those repeatedly in tight loops, and will take far longer than the wasted 50ms or so even noticing the screen updated let alone reading and grokking the output.

Well, I somewhat disagree: https://danluu.com/keyboard-latency/#humans-can-t-notice-100ms-or-200ms-latency. But the numbers after your change are a lot better :)

@matthiaskrgr
Copy link
Member Author

matthiaskrgr commented Jan 2, 2021

Well, I made a small ICE-finder that runs rustc on all the files inside the rustc repo.
Luckily, I chose the master toolchain (I use a direct path to the master rustc) instead of using generic rustc already

Using rustc it would take at least 1.5 hours to only start rustc that many times (20K files)
With the direct path (skipping rustup wrapping), that's only 13 minutes..

@rbtcollins
Copy link
Contributor

Running it serially, individually? Not terribly surprising.

I'm totally fine with rustup being faster, but development of rust itself isn't our primary use case: shipping rust toolchains to our users is, and running rustc in such a non idiomatic fashion is very much a specialised, developer of rust, need.

To me that says, if someone wants to make this faster, great, but we're unlikely to steer folk to this as a priority issue, unlike eg the corruption issues, locking issues, etc.

@rbtcollins
Copy link
Contributor

Most humans don't need to run those repeatedly in tight loops, and will take far longer than the wasted 50ms or so even noticing the screen updated let alone reading and grokking the output.

Well, I somewhat disagree: https://danluu.com/keyboard-latency/#humans-can-t-notice-100ms-or-200ms-latency. But the numbers after your change are a lot better :)

If folk are running error explains in their game update loops - the context of that blog post - we have a whole new set of requirements to feed into rustup design.

Fast is good, and it is a feature.

But language servers and dev environments can avoid making the user wait for the invocation of the compiler in various ways; and should simply because remote dev environments pay latency in many ways, so having latency hiding techniques in play is just good design.

@kinnison
Copy link
Contributor

kinnison commented Jan 2, 2021

When I was playing with my dodgy PR above, I ran into the fact that rustc-dev appears to be present in all toolchains in all targets, like rust-std is. @Mark-Simulacrum is going to have a look at why this might be, since regardless of whether or not we end up merging this "optimisation" we could do with knowing why rustc-dev is like this.

@joshtriplett
Copy link
Member

Manifest parsing aside, would it potentially make sense for the top-level rustup-enabled cargo invocation to determine the toolchain and then set the RUSTC environment variable (if not already set) to point directly to the appropriate binary? That would eliminate almost all of the repeated startup times and manifest parses, and let rustc run directly.

@joshtriplett
Copy link
Member

I just tested this approach building tide, which builds a total of 209 crates. I compared time cargo build to time RUSTC=$HOME/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin/rustc cargo build. Passing RUSTC reliably shaved off at least 20 CPU seconds, for around 6 seconds of build time on my 4-way laptop; that's around an 8% speedup to the build.

I think that'd be a worthwhile optimization.

@kinnison
Copy link
Contributor

Is setting RUSTC purely something that cargo would want, or is it indicative of a general thing we should consider of setting RUSTC CARGO RLS etc. depending on which tools are installed? I assume setting this means that cargo simply doesn't invoke rustup's proxy?

One concern with this, rather than simply improving the startup performance of rustup, is how this might interact with how rustup sets up fallbacks for incomplete linked toolchains (i.e. as used in cargo/rustc development flows). If we can be sure that the fallback approach won't be broken by this, or that we can somehow detect when we'd invoke fallback case and not set the variables in that context, then it ought to be plausible.

@joshtriplett
Copy link
Member

Is setting RUSTC purely something that cargo would want, or is it indicative of a general thing we should consider of setting RUSTC CARGO RLS etc. depending on which tools are installed?

I think it'd be reasonable to always set RUSTC from the rustup proxy, as long as the toolchain has rustc and RUSTC isn't already set in the environment.

I assume setting this means that cargo simply doesn't invoke rustup's proxy?

Right. See https://doc.rust-lang.org/cargo/reference/environment-variables.html ; cargo will invoke RUSTC when it wants to run rustc.

One concern with this, rather than simply improving the startup performance of rustup, is how this might interact with how rustup sets up fallbacks for incomplete linked toolchains (i.e. as used in cargo/rustc development flows). If we can be sure that the fallback approach won't be broken by this, or that we can somehow detect when we'd invoke fallback case and not set the variables in that context, then it ought to be plausible.

I think we can handle that case by just not setting RUSTC if our toolchain doesn't have rustc. We could take our path to rustc, resolve all symlinks, and make sure the result exists as a file.

@alex
Copy link
Member

alex commented May 20, 2021

Is it worth reconsidering rustup as a chimera binary, given just how much stuff gets linked that isn't relevant to rustc (on my machine, in order of comedy value: libkr5, libssh2, librtmp, libcurl, etc.)

Alternatively, if a chimera is still desirable, splitting most of rustup into a dynamic library that's only loaded after we've checked if we're a proxy binary.

@rbtcollins
Copy link
Contributor

Yes, worth considering. I have a project (#2441) to make proxies safe on windows as well which is related; that said, how much warm startup-time impact does linking those libraries have today? Cold time is irrelevant, since a single process invocation is trivially amortised over a full build. There is some complexity consider in having a separate binary though such as distribution and updating. I certainly don't have time to engage with it at the moment, though if someone else wanted to take it on I do have a few ideas related to #2441 that would be worth considering.

@kinnison
Copy link
Contributor

IME having profiled it a bit, the majority of the startup time of the proxies, by a LONG margin, is the parsing of the channel toml. Hence I played with #2627 though it's the wrong solution, it's an approach to consider.

@rbtcollins
Copy link
Contributor

@kinnison it may be different on Windows

@kinnison
Copy link
Contributor

@rbtcollins good point, I suppose I'd best test there too.

@alexcrichton
Copy link
Member

I was poking around recently when I felt that cargo test had a surprisingly large amount of overhead outside of the tests themselves. Some profiling showed that rustup is doing a large amount of TOML parsing when it starts up, and some digging showed that it's parsing a multirust-channel-manifest.toml file which is 22k lines of TOML.

I'd expect that ideally when a toolchain is already installed and such the cargo and rustc wrappers would be extremely thin shims over calling the relevant executables themselves. As-is it looks like parsing 20k+ lines of TOML is adding a nontrivial amount of overhead, and even if a faster serialization format were used (like bincode in the linked PR) it seems like that would only marginally improve things and the best solution is to not do parsing at all.

Cargo itself has already landed a change to avoid rustup wrappers entirely because of the large performance boost to the test suite, and I suspect that users in general can tell as well with 100= ms added to each command invocation.

@kinnison
Copy link
Contributor

Switching away from toml to bincode basically made the manifest decode drop to be in the noise. sub 1ms. Not loading the manifest at all would mean we'd need to introduce lazy manifest loading because we do manifest/component reconciliation during proxy startup in case a rust-toolchain.toml or similar is requesting components, or if a directory override needs to trigger something.
Everything except toml parsing is around 10-12ms on my computer, and toml parsing is another 100 to 120. I agree it needs solving, but if we're going to go for a binary-format cache we need to organise versioning it properly so we can fall back to recreating it on format changes.

@joshtriplett
Copy link
Member

I still think it would be valuable to export the appropriate environment variables so that invocations of rustc underneath a rustup'd cargo don't need to re-do the parsing and selection logic, even if we can substantially speed up that logic by switching to bincode.

@konstin
Copy link

konstin commented Feb 25, 2024

This is an issue for uv, too. We've been asked to include the output of rustc --version in our user agent when making requests to the python package index so the python ecosystem gets usage stats. A minimal resolution with a network request (revalidation request) takes ~100ms on machine, so 20ms extra before the first network request is noticeable. I'd also be happy to read the default rustc version from another location, given that this works with alternative ways of installation.

The benchmark runs from my user home, and i've include python as well as node with volta shim and without for comparison.

$ hyperfine --warmup 3 --shell=none "rustc --version" ".rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin/rustc --version" "python --version" "node --version" ".volta/tools/image/node/18.18.2/bin/node --version"
Benchmark 1: rustc --version
  Time (mean ± σ):      19.5 ms ±   1.3 ms    [User: 14.2 ms, System: 5.2 ms]
  Range (min … max):    17.2 ms …  28.2 ms    160 runs
 
Benchmark 2: .rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin/rustc --version
  Time (mean ± σ):       7.8 ms ±   2.7 ms    [User: 3.6 ms, System: 4.0 ms]
  Range (min … max):     3.6 ms …  17.6 ms    469 runs
 
Benchmark 3: python --version
  Time (mean ± σ):       1.3 ms ±   0.5 ms    [User: 0.9 ms, System: 0.3 ms]
  Range (min … max):     0.3 ms …   2.5 ms    2166 runs
 
Benchmark 4: node --version
  Time (mean ± σ):       8.8 ms ±   3.3 ms    [User: 3.3 ms, System: 5.5 ms]
  Range (min … max):     2.7 ms …  13.5 ms    230 runs
 
Benchmark 5: .volta/tools/image/node/18.18.2/bin/node --version
  Time (mean ± σ):       5.9 ms ±   2.2 ms    [User: 2.0 ms, System: 3.7 ms]
  Range (min … max):     1.9 ms …  10.2 ms    704 runs
 
Summary
  python --version ran
    4.44 ± 2.40 times faster than .volta/tools/image/node/18.18.2/bin/node --version
    5.84 ± 3.08 times faster than .rustup/toolchains/nightly-x86_64-unknown-linux-gnu/bin/rustc --version
    6.64 ± 3.62 times faster than node --version
   14.67 ± 5.92 times faster than rustc --version

8ms without rustup shim still seems slow, should i file a separate ticket for that in rust-lang/rust?

@joshtriplett
Copy link
Member

@konstin Yes, please do file a ticket for that in rust-lang/rust. And please give the details about bypassing the wrapper and invoking rustc directly, so that anyone trying to reproduce it can easily do so without getting sidetracked by the wrapper.

@konstin
Copy link

konstin commented Feb 26, 2024

rust-lang/rust issue: rust-lang/rust#121631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants