-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make resolution backtracking smarter #4834
Conversation
r? @matklad (rust_highfive has picked a reviewer for you, use r? to override) |
I'm not sure that my two new tests should actually be in tests, since in the context of this PR they exist as benchmarks. Though they do verify correct behavior I guess. |
Thanks so much for the PR! (and great seeing work in this space!) This is historically a super tricky part of Cargo so I'm currently still trying to wrap my head around the changes here. To make sure I understand this I'm going to say it back to you and you can tell me where I'm totally wrong :) Right now when cargo activates a package A to be part of the dependency graph, it then also queues up a bunch of pending candidates for the dependencies of A into Additionally when Cargo activates a dependency that has a bunch of candidates, it'll push candidates onto the "backtrack stack" in case the current candidate fails. Candidates are always filtered to be "not semver compat with other activated versions of the crate". This means that if B fails to activate for whatever reason, we need to do some backtracking. This... though is where I get lost in how this change works. My naive reading of this is that when B fails to get selected then if A (the parent I think?) is either not activated or a different number of versions of B are activated then backtracking could be successful. If, however, A is activated and B's number of activated versions remains the same we assume that the resolution failure will always remain the same. That last part though isn't quite clicking with me, I'm not sure how A (maybe A isn't the parent?) would affect the resolution of B, and just comparing the lengths seems like it could go awry, right? (as in, different sets of different versions maybe?) Perhaps though you can help me clear up these thoughts! I imagine it's all much more fresh in your mind :) |
88e9699
to
3bb9cb0
Compare
Briefly: your understanding is correct and there is a justification for the two points you raise. I wrote a long comment explaining it all, then in the process realised there's an additional factor I overlooked that means this PR is not correct as-is :( I've saved the comment and will post it here when I either update this PR or give up on it. Either way, I'll make sure I submit the tests at some point since they're interesting edge cases. |
Heh ok sounds good to me! |
@aidanhs Can you elaborate on where this is at? I was just thinking of trying this optimization, for my own edification. So I would love to know what additional factor I had not thout of. And I'd Love to help push this to be done! |
There's no need to try every candidate for every dependency - instead, only try alternative candidates if they *could* change the eventual failure that caused backtracking in the first place.
3bb9cb0
to
46cc02c
Compare
46cc02c
to
034b93c
Compare
I'm back with good news - I think this PR is actually fine! I am indending to run it against the whole of crates.io though, I'll report back when that's done.
Exactly. If you think the comment above the function could be improved to make this clearer, please let me know (I pushed a tweak to make the English a bit better).
It is a bit subtle, so let's see if I can convince you (and if I can't, maybe it's wrong)! Let's say we have a subgraph with A depending on both B and C. A and B were activated, but C failed and we're now backtracking. So, what could change this failure situation?
But let's say we're backtracking and haven't yet gone past A (or 2, 3, 4 would have us trying again by now). So, what other situation could unstick us? The requirements of A won't change before backtracking past it, so we need the version requirements to relax elsewhere to possibly permit one of A's candidates for C. The only thing we take into account when finding remaining candidates is the list of previous activations for a dep: cargo/src/cargo/core/resolver/mod.rs Lines 566 to 569 in fac7e25
So:
And that's it! Either the parent is backtracked or activations have changed - the only two ways dependency requirements may change and make a failed activation succeed. Some additional thoughts: Note that (from my reading) dependency resolution is NP-complete so there are no easy solutions. This link mentions heuristics that aptitude uses (we use the heuristic of choosing most constrained packages first), but it's worth being wary - heuristics tend to just shift the worst cases to be less common. I believe this PR is an actual optimisation, but I'm wary after my previous uncertainty, hence my running against crates.io. One additional route to consider is faster data structures, but given that this improves the terrible case I originally saw, I'm happy. Here are my notes on what I thought was wrong: Imagine a new subgraph:
Let's say we resolved from left to right, up to the D (under A) at which point it failed. We backtrack, trying the candidates of D (under C), but still no joy - the selected versions of A and C (under A) have irreconcilable requirements for the version of D. However, alternatives to C (under A) may have different requirements for D that can reconcile with what A wants. These effects are transitive, as backtracking all the way back to B could affect choices of C, which then affects D. (it turns out that this is fine - as soon as D (under C) is backtracked, activations for D change and we instantly try the next possible candidate. I added a test for this scenario) |
Thank you! This is a very clear explanation, and concurs with the way I read the code. Another explanation, based on conflict-driven SAT concepts:
I.E. We can backjump until our parent is no longer active or until number of activated versions of this crate have changed. If you are running against all of crates.io then can I ask:
Also, I think this would be a good place for property based testing. Bild random graphs and see if they resolve the same. Just a someday thought. |
Ok thanks so much for the explanations @aidanhs and @Eh2406, I'm now thoroughly convinced :) @aidanhs whenever you're ready to land this just let me know, although I'd be ready to land it whenever, I'd be ok not needing all of crates.io. @Eh2406's explanation is also what really made this click for me, so if you want to link that comment from the code, that'd also be great!
Certainly! I'm sort of wary of this helping much though as most of the time the bad cases I've seen are moreso because of O(??) behavior rather than "if I make this 2x faster it's reasonable". In that sense I see switching data structures as not changing the O(??) runtime but rather the constant factors in play. That being said right now cloning the
Definitely! I feel like we've never really had great testing of the resolver... |
I've added the reference to @Eh2406's comment and I think this is ready to go. I did do a run against the whole of crates.io in the end, sha1ing the generated Cargo.locks. It only took about 4 hours (thanks to @Eh2406's
So that's 5 crates that previously wouldn't resolve within a minute, now resolving almost instantly. Of particular note is that this affects the one and only version of slog-retry! @vorner, out of curiosity had you noticed this slow generation of lockfiles for slog-retry? I don't know what's up with I also analysed my results as requested by @Eh2406, based on the user cpu time consumed by cargo. I caution you to verify these yourself - the machine was otherwise idle while doing this, but I did not do repeated runs.
In this collapsible section I've documented the collection/analysis process along with my scripts. Collection and analysis process
The scripts I used are below: bench_locking.py
analyse.py
|
I am glad this looks good! The list of things that timed out may be really interesting for finding more performance wins. |
📌 Commit a85f9b6 has been approved by |
Make resolution backtracking smarter There's no need to try every candidate for every dependency when backtracking - instead, only try candidates if they *could* change the eventual failure that caused backtracking in the first place, i.e. 1. if we've backtracked past the parent of the dep that failed 2. the number of activations for the dep has changed (activations are only ever added during resolution I believe) The two new tests before: ``` $ /usr/bin/time cargo test --test resolve -- --test-threads 1 --nocapture resolving_with_constrained_sibling_ Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs Running target/debug/deps/resolve-19b2a13e5a19eed8 38.45user 2.16system 0:42.00elapsed 96%CPU (0avgtext+0avgdata 47672maxresident)k 0inputs+1664096outputs (0major+10921minor)pagefaults 0swaps ``` After: ``` $ /usr/bin/time cargo test --test resolve -- --test-threads 1 --nocapture resolving_with_constrained_sibling_ Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs Running target/debug/deps/resolve-19b2a13e5a19eed8 [...] 0.36user 0.01system 0:00.49elapsed 76%CPU (0avgtext+0avgdata 47464maxresident)k 0inputs+32outputs (0major+11602minor)pagefaults 0swaps ``` You observe the issue yourself with the following (it should fail, but hangs for a while instead - I didn't bother timing it and waiting for it to finish. With this PR it terminates almost instantly): ``` $ cargo new --bin x Created binary (application) `x` project $ /bin/echo -e 'serde = "=1.0.9"\nrust-s3 = "0.8"' >> x/Cargo.toml $ cd x && cargo build Updating registry `https://github.com/rust-lang/crates.io-index` Resolving dependency graph... ```
☀️ Test successful - status-appveyor, status-travis |
@Eh2406 these are the crates that appeared to time out:
|
Thank you! I will try to use these as a real world benchmark, along with Servo. |
Conflict tracking This is an alternative implementation of #4834. This is slower but hopefully more flexible and clearer. The idea is to keep a list of `PackageId`'s that have caused us to skip a `Candidate`. Then we can use the list when we are backtracking if any items in our list have not been activated then we will have new `Candidate`'s to try so we should stop backtracking. Or to say that another way; We can continue backtracking as long as all the items in our list is still activated. Next this new framework was used to make the error messages more focused. We only need to list the versions that conflict, as opposed to all previously activated versions. Why is this more flexible? 1. It is not limited to conflicts within the same package. If `RemainingCandidates.next` skips something because of a links attribute, that is easy to record, just add the `PackageId` to the set `conflicting_prev_active`. 2. Arbitrary things can add conflicts to the backtracking. If we fail to activate because some dependency needs a feature that does not exist, that is easy to record, just add the `PackageId` to the set `conflicting_activations`. 3. All things that could cause use to fail will be in the error messages, as the error messages loop over the set. 4. With a simple extension, replacing the `HashSet` with a `HashMap<_, Reason>`, we can customize the error messages to show the nature of the conflict. @alexcrichton, @aidanhs, Does the logic look right? Does this seem clearer to you?
Faster resolver: Cache past conflicting_activations, prevent doing the same work repeatedly. This work is inspired by @alexcrichton's [comment](#4066 (comment)) that a slow resolver can be caused by all versions of a dependency being yanked. Witch stuck in my brain as I did not understand why it would happen. If a dependency has no candidates then it will be the most constrained and will trigger backtracking in the next tick. Eventually I found a reproducible test case. If the bad dependency is deep in the tree of dependencies then we activate and backtrack `O(versions^depth)` times. Even tho it is fast to identify the problem that is a lot of work. **The set up:** 1. Every time we backtrack cache the (dep, `conflicting_activations`). 2. Build on the work in #5000, Fail to activate if any of its dependencies will just backtrack to this frame. I.E. for each dependency check if any of its cached `conflicting_activations` are already all activated. If so we can just skip to the next candidate. We also add that bad `conflicting_activations` to our set of `conflicting_activations`, so that we can... **The pay off:** If we fail to find any candidates that we can activate in lite of 2, then we cannot be activated in this context, add our (dep, `conflicting_activations`) to the cache so that next time our parent will not bother trying us. I hear you saying "but the error messages, what about the error messages?" So if we are at the end `!has_another` then we disable this optimization. After we mark our dep as being not activatable then we activate anyway. It won't resolve but it will have the same error message as before this PR. If we have been activated for the error messages then skip straight to the last candidate, as that is the only backtrack that will end with the user. I added a test in the vain of #4834. With the old code the time to run was `O(BRANCHING_FACTOR ^ DEPTH)` and took ~3min with DEPTH = 10; BRANCHING_FACTOR = 5; with the new code it runs almost instantly with 200 and 100.
Faster resolver: clean code and the `backtrack_stack` This is a small extension to #5168 and is inspired by #4834 (comment) After #5168 these work (and don't on cargo from nightly.): - `safe_core = "=0.22.4"` - `safe_vault = "=0.13.2"` But these don't work (and do on cargo from this PR.) - `crust = "=0.24.0"` - `elastic = "=0.3.0"` - `elastic = "=0.4.0"` - `elastic = "=0.5.0"` - `safe_vault = "=0.14.0"` It took some work to figure out why they are not working, and make a test case. This PR remove use of `conflicting_activations` before it is extended with the conflicting from next. #5187 (comment) However the `find_candidate(` is still needed so it now gets the conflicting from next before being called. It often happens that the candidate whose child will fail leading to it's failure, will have older siblings that have already set up `backtrack_frame`s. The candidate knows that it's failure can not be saved by its siblings, but sometimes we activate the child anyway for the error messages. Unfortunately the child does not know that is uncles can't save it, so it backtracks to one of them. Leading to a combinatorial loop. The solution is to clear the `backtrack_stack` if we are activating just for the error messages. Edit original end of this message, no longer accurate. #5168 means that when we find a permanent problem we will never **activate** its parent again. In practise there afften is a lot of work and `backtrack_frame`s between the problem and reactivating its parent. This PR removes `backtrack_frame`s where its parent and the problem are present. This means that when we find a permanent problem we will never **backtrack** into it again. An alternative is to scan all cashed problems while backtracking, but this seemed more efficient.
Public dependency refactor and re-allow backjumping There were **three** attempts at vanquishing exponential time spent in Public dependency resolution. All failures. All three started with some refactoring that seams worth saving. Specifically the data structure `public_dependency` that is used to test for Public dependency conflicts is large, tricky, and modified in line. So lets make it a type with a name and move the interactions into methods. Next each attempt needed to know how far back to jump to undo any given dependency edge. I am fairly confident that any full solution will need this functionality. I also think any solution will need a way to represent richer conflicts than the existing "is this pid active". So let's keep the `still_applies` structure from the last attempt. Last each attempt needs to pick a way to represent a Public dependency conflict. The last attempt used three facts about a situation. - `a1`: `PublicDependency(p)` witch can be read as the package `p` can see the package `a1` - `b`: `PublicDependency(p)` witch can be read as the package `p` can see the package `b` - `a2`: `PubliclyExports(b)` witch can be read as the package `b` has the package `a2` in its publick interface. This representation is good enough to allow for `backjumping`. I.E. `find_candidate` can go back several frames until the `age` when the Public dependency conflict was introduced. This optimization, added for normal dependencies in #4834, saves the most time in practice. So having it for Public dependency conflicts is important for allowing real world experimentation of the Public dependencies feature. We will have to alter/improve/replace this representation to unlock all of the important optimizations. But I don't know one that will work for all of them and this is a major step forward. Can be read one commit at a time.
There's no need to try every candidate for every dependency when backtracking - instead, only try candidates if they could change the eventual failure that caused backtracking in the first place, i.e.
The two new tests before:
After:
You observe the issue yourself with the following (it should fail, but hangs for a while instead - I didn't bother timing it and waiting for it to finish. With this PR it terminates almost instantly):