Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make resolution with correct lock file faster #5321

Closed
matklad opened this issue Apr 8, 2018 · 20 comments
Closed

Make resolution with correct lock file faster #5321

matklad opened this issue Apr 8, 2018 · 20 comments

Comments

@matklad
Copy link
Member

matklad commented Apr 8, 2018

This came up when profiling no-op builds recently. Although the majority of time is spent shelling out to rustc, we still spend about 50ms in resolve, although we have a lockfile, and it is valid.

Perhaps we can make this happy path much faster? Theory says that we can verify solution to an NP problem in polynomial time, and practice hints that verifying lockfile should probably be linear?

@Eh2406 perhaps you know some ways to make no-op case specifically faster?

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 8, 2018

I don't know of anything in particular off the top of my head, I am happy to take a look. I'd think that the resolver is linear with a lock file, but I have been wrong before.

If you are doing profiling anyway, maybe what nightly made it regress?

@matklad
Copy link
Member Author

matklad commented Apr 8, 2018

I don't think this is a regression. At least, not a recent one.

I get 50 ms in resolve on current stable for http://github.com/matklad/fall/, which has about 100 packages in lockfile. 50ms seems way to much for a linear algorithm with an input of size 100. I haven't profiled it to closely though, so maybe we do some costly IO during resolution as well?

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 8, 2018

I used to profile a servo>cargo generate-lockfile -Z no-index-update which takes 2-3 sec, but most of that time was spent making sure that the git dependencies are up to date. So it was not terribly helpful. What is the best way to reproduce your fall example?

@matklad
Copy link
Member Author

matklad commented Apr 8, 2018

~/trash
λ git clone https://github.com/matklad/fall

~/trash
λ cd fall && cargo build

~/trash
λ CARGO_PROFILE=3 cargo build

For CARGO_PROFILE, you might want to check this PR #5319, because the current implementation of profiling is buggy and would print some stuff twice.

When I do this, I see

        54ms - resolving
              0ms - updating: registry `https://github.com/rust-lang/crates.io-index`

It might be prudent to try to insert some more-fine grained profiling calls as well, to see the breakdown of those 54 mses.

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 8, 2018

fall>cargo +nightly generate-lockfile -Z no-index-update

and I can reproduce:

        132ms - resolving
              0ms - updating: registry `https://github.com/rust-lang/crates.io-index`

So now to dig in

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 8, 2018

Unfortunately, I don't know how to share flame graphs from WPA in a targeted way. So you'll have to settle for my summary. Ferst, a local build of master is slower than nightly ~370ms. Don't know why, maybe the debug symbols I have turned on so as to get flame graphs, or the profiling.

For a random invocation WPA claims:
cargo::main took 307ms of witch
cargo::commands::generate_lockfile::exec took 284ms of witch
cargo::core::resolver::activate_deps_loop took 168ms (ignoring the other stacs as this is about resolver) of witch
cargo::sources::registry::index::RegistryIndex::query took 165ms (I am glad to see that all the conflict resolutions stuff is nowhere to be seen, even though that is the only part of the code I really know.) of witch
cargo::sources::registry::index::RegistryIndex::summaries 155ms (the rest is file cargo::core::registry::lock) of witch
cargo::sources::registry::remote::{{impl}}::load 124ms (the rest is flock::Filesystem::open and drop)
which seems to have 6 stacs:

  • 32ms -> cargo::sources::registry::index::{{impl}}::load_summaries::{{closure}}
  • 30ms -> git2::tree::Tree::get_path
  • 22ms -> git2::repo::Repository::find_commit
  • 21ms -> git2::tree::TreeEntry::to_object
  • 10ms -> git_repository_open_ext
  • 6ms -> git2::repo::Repository::refname_to_id

That takes us well into parts of the code I know nothing about, but maybe some of those queries could be cached? The repo won't move, nore will the head commit.

@matklad
Copy link
Member Author

matklad commented Apr 8, 2018

Interesting! So, this is indeed IO bound. I wonder, however, if we indeed need all this IO. In theory, it should be possible to verify Cargo.lock against Cargo.toml without actually loading any crates, and just blindly trusting that Cargo.lock lists correct dependencies.

This assumption is, of course, invalid, because path dependencies can change under our feet. But looks like we can assume that registry dependencies, which are the absolute majority, do not change? @alexcrichton do you think this could be a valid optimization? It is, of course, much less important then waiting for rustc -vV for 100ms, but, if we implement rustc caching, it'll become the next bottleneck.

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 8, 2018

This is the second time we noted that something is slow because querying is slow. I added a cache to the conflict path in #5112 so that slow resolutions don't call this slow path. I suspect we will continue to find things that are slowed down by querying. I am not objecting to a cache in the large approach for the happy path. But, my 2c is to find ways to make querying faster if possible. (Of course they are not mutually exclusive.)

@alexcrichton
Copy link
Member

@Eh2406's findings definitely make sense to me, that's what I've historically seen as well. In that slow registry times are primarily related to tons of crates.io crates where we need to go back to the git repository to load information and parse the JSON coming out, both of which are relatively costly.

I'm personally ok making changes here as well but we need to be quite careful as determining whether a lockfile matches the previous manifests is certainly non-trivial (and I've gotten it wrong a bunch in the past). I'd prefer to first optimize the actual pieces here if we can.

Historically we actually checked out the index on disk but we stopped doing that one day to cut down on checkout size and now it's all stored as a raw git database. While that has clear space savings I'm not sure that it made anything faster (and in fact may have made things slower!). That's one possible route for optimization I think.

The other possibility to optimize here is that when we query a registry with a dependency that looks like = "1.0.1" (what lock files do) we shouldn't parse every single line of json in the registry. Instead we should only look for 1.0.1 versions (quickly) and then parse that.

@matklad
Copy link
Member Author

matklad commented Apr 16, 2018

Now that #5359 is merged, I believe this is the main bottleneck of no-op builds. I was going to close this issue because it doesn't suggest anything too actionable, but I think I have at least an idea!

Performance is easy: you just make stuff faster somehow! (nrc at all-hands)

What if we pull essentially the same trick as we did with rustc caching? I think that the main bottleneck with queries is the actual git operations and JSON parsing. It seems unlikely that we can optimize those further: for git we use libgit2, and we cache all the relevant handles, for JSON, we could parse only a single line with a specific version, but to do that, you need to parse all input at least partially as well. So it might be interesting to cache those things across Cargo invocations.

What if we store a per-project cache as a bincode-endcoded HashMap<String, Vec<(Summary, Yanked)>>, which is fingerprinted and invalidated by the registry commit hash, populated lazily by load_summaries call and written down to disk if there are changes? Code-wise, I think we need to just make this cache persistent and add invalidation strategy?

@alexcrichton
Copy link
Member

@matklad an excellent idea! I've toyed with the idea of doing that in the past but it was never pressing enough to attempt to tackle until now!

I think it's possible that we lazily compute a faster-to-access version of the index on the fly whenever Cargo is executing. We could crate-by-crate create a much faster format on-disk for accessing files, maybe even mmap them if we're intrepid! We could then bust the cache as soon as the registry updates (easy to do, we already have a git rev).

One more possible optimization to explore though would be to avoid using libgit2 at all for reading json files. We initial stopped checking out the index to save disk space but that may have come at the cost of performance if libgit2 isn't the speediest at loading these files. It may actually be faster to have a checkout of the index which we then read file-by-file. This wouldn't solve the JSON parse times though, but wanted to mention it as a possible optimization!

@alexcrichton
Copy link
Member

Er you actually said most of what I was already gonna say I should read past the first few sentences...

In any case, I agree! I think it's worthwhile to leave this open. I'd love for huge projects to have at most 5ms of overhead from Cargo per builds.

@matklad
Copy link
Member Author

matklad commented Apr 17, 2018

So, I've cooked up a quick and dirty patch to add persistent caching, the code is here: https://github.com/matklad/cargo/tree/cache-more-of-the-things.

The result is marginally faster than staus quo. This means thee things:

  • libgit and serde json are not too slow
  • Amdahl's law: resolution is not the only bottleneck, we actually spend a fair amount of time when computing dependencies between units
  • it probably makes sense to implement this caching, but it has to be done carefully (that is, not allocating all other the place, as my patch does).

Here are flamegraphs

  • With caching. Note three distinct parts: cache saving on drop (we can avoid it altogether), building a units map, loading of the cache)
  • Without caching. Note the absence of libgit2 frames: I think this is just an artifact, most likely due to the absence of debug symbols?

@alexcrichton
Copy link
Member

Hm are you sure the "with caching" one is optimized well enough and working? 10% of the runtime is dropping the registry index which I'd expect to be close to 0 for a fresh build, right? The fact that Context::compile takes 25% of the time is also sorta crazy. It does look though like there's a lot of TOML parsing going on which may be slowing things down.

Overall it looks like there's just a ton of allocations which we may be able to work around with more strategic caching/interning (maybe even arenas?). I suspect though that TOML may be one of the lowest hanging fruit to speed up (although not super easy either)

@matklad
Copy link
Member Author

matklad commented Apr 17, 2018

Hm are you sure the "with caching" one is optimized well enough and working?

I am sure it is not optimized at all :) Like, in indeed dumps registry to disk every time, because it doesn't track "dirty" state at all. It looks like it mostly works though (custom registries are mishandled, but otherwise Cargo with this patch successfully builds stuff).

I was just hoping that libgit and parsing JSON are orders of magnitude slow, so I was expecting that simplistic caching would remove lockfile verification from profile altogether. Sadly, this hasn't happened, so we'll have to hunt down those allocations :-)

As another interesting metric, the size of the cache itself, for cargo, is 1.3M, which is kinda huge, given that it is bincode-encoded data (though, the data is mostly short strings, and I think this would be an especially bad case for bincode?)

Context::compile takes 25% of the time is also sorta crazy.

This might have regressed after my refactorings of unit_dependencies, which are know eager and might calculate too much perhaps?

Overall it looks like there's just a ton of allocations which we may be able to work around with more strategic caching/interning

Definitely! I think carefully implemented caching would do the trick. And we probably want to intern at the storage layer as well, so as not to store a 100500 copies of serde string on disk. I unfortunately don't know about off the shelf solution for string interning during serialization for Rust.

I suspect though that TOML may be one of the lowest hanging fruit to speed up (although not super easy either)

Overall, Context::compile takes less time than resovle.

@alexcrichton
Copy link
Member

Yeah overall I'm also sort of perplexed why Cargo is taking so long. Nothing Cargo should really be more than O(n^2) and if the dependency graph has 100 crates then O(10_000) allocations isn't really that bad? Weird!

@matklad
Copy link
Member Author

matklad commented Apr 17, 2018

This is flamegraph for no-op build of servo: https://github.com/matklad/cargo/blob/aa181fd28a9418fc5c03dafbe05b4b7a06d88816/servo.svg. The culprit there is PathSource::list_files_git, which we have to do because build scripts in Servo do not use rerun-if-changed.

@alexcrichton
Copy link
Member

@matklad btw for the "best" flamegraphs I've found that I've needed to:

  • Force compilation of zlib and ensure frame pointers are enabled
  • Precompile OpenSSL with frame pointers
  • Compile all Rust code w/ debuginfo (which I think enables frame pointers)
  • Compile with LTO to give all functions in libstd frame pointers
  • Switch away from jemalloc which doesn't have frame pointers

AKA I think if you get frame pointers everywhere the stack traces look a little nicer, and even moreso with the musl target nowadays which I think has debuginfo as well by default.

@Eh2406
Copy link
Contributor

Eh2406 commented Jul 23, 2018

Just re-did the benchmarks and master (773e3b6) looks the same.

@Eh2406
Copy link
Contributor

Eh2406 commented May 24, 2019

I think this got worked on a lot since this issue, for example #6880 and #6867. Most of the profiling work here is now out of date. I will close this issue, we can open a new one if we find a new slow case.

@Eh2406 Eh2406 closed this as completed May 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants