Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort crate resolution if too many candidates have been tried #4066

Open
alexcrichton opened this issue May 17, 2017 · 32 comments
Open

Abort crate resolution if too many candidates have been tried #4066

alexcrichton opened this issue May 17, 2017 · 32 comments
Labels
A-dependency-resolution Area: dependency resolution and the resolver A-diagnostics Area: Error and warning messages generated by Cargo itself. C-bug Category: bug S-triage Status: This issue is waiting on initial triage.

Comments

@alexcrichton
Copy link
Member

We should have a hard limit for the number of steps the resolution algorithm can take before it just hard aborts and exits saying "this probably isn't gonna work".

Right now the crate resolution algorithm will only return an error if it proves that literally every possible crate resolution graph is not workable. This can take quite a long time for some larger graphs! Ideally we'd also abort with nice error messages "these tended to conflict a lot" etc, but at the very least Cargo shouldn't look like it's infinite looping.

@brson
Copy link
Contributor

brson commented May 22, 2017

I also see this with https://github.com/koute/cargo-web

This is a new development that I have not seen before and it has caused cargobomb to stop working because it hangs when building the lockfile for that crate.

@brson
Copy link
Contributor

brson commented May 22, 2017

cc @koute

@brson
Copy link
Contributor

brson commented May 22, 2017

I'm not sure on how this behavior would have begun to appear on stable between the last cargobomb and this cargobomb run. Both should have run the generate-lockfiles phase using the same stable compiler.

Hm...

@brson
Copy link
Contributor

brson commented May 22, 2017

Oh cargobomb should kill this process after a lengthy timeout. I probably just need to wait 10 minutes or so. So maybe cargobomb is still working and this is not a new behavior for cargo.

@brson
Copy link
Contributor

brson commented May 22, 2017

cargobomb doesn't seem to be timing out. Just hanging forever.

@alexcrichton
Copy link
Member Author

@brson does this reproduce on cargo-web? O rwas this related to something else?

@koute
Copy link
Member

koute commented May 23, 2017

@alexcrichton If you try to do cargo build in cargo-web it just hangs indefinitely.

It looks like upgrading hyper-rustls from 0.3 to at least 0.5 fixes the issue. (It starts to compile properly again.) Also, if in my Cargo.toml I change:

hyper-rustls = "0.3"

into:

hyper-rustls = "0.3.3"

it also fixes the hanging and instead spews out this error:

error: no matching version `^0.10` found for package `webpki` (required by `rustls`)
location searched: registry https://github.com/rust-lang/crates.io-index
versions found: 0.12.1

I've already pushed a fix to master, but in case you'd want to reproduce this issue again you can use commit 3b17eff93402bfd160ee944932aedb45298a0421.

@alexcrichton
Copy link
Member Author

@koute I think that's an example of an un-resolvable resolution graph again as well (what this issue is). All versions of webpki but the most recent are yanked, so if you need to update a dependency on webpki there's nothing to choose from.

@drandreaskrueger
Copy link

Is this related? cargo install parity needs 4 hours ... to find out it can't.

@alexcrichton
Copy link
Member Author

@drandreaskrueger ah yeah unfortunately that's this issue :(

@drandreaskrueger
Copy link

thx, @alexcrichton ... 6 hours now ;-)

@sunshowers
Copy link
Contributor

sunshowers commented Sep 8, 2017

Just chiming in here a little bit -- FB has been hitting this problem for a couple of months now. This prevents us from updating our pseudocrate which contains all our third-party dependencies.

Has there been any discussion around using a SAT or SMT solver like Z3 for dependency resolution? I found #2064 which has some brief discussion, but the only substantive thing there appears to be Graydon's test which shows how to offload semver optimization to Z3.

(I generally believe that people should be reaching for SAT solvers far more often than they do today.)

@alexcrichton
Copy link
Member Author

@Sid0 there's been discussion of "we should probably do it" but no serious design of how it would actually be done yet.

Also to be clear, this issue is primarily about "this crate graph is not resolvable, and Cargo isn't quickly telling you that". In more cases than not we've found today that if there's a resolvable graph Cargo will reach it quickly-ish. Not true for all cases, but true for most. We likely want a different issue for "this crate graph can be resolved can Cargo can take forever to conclude that"

@sunshowers
Copy link
Contributor

Hmm, interesting. Is it possible for a Cargo.toml that just has a bunch of * dependencies like in https://gist.github.com/sid0/064738981955029054d74de50c90c49f to lead to an unresolvable graph?

@alexcrichton
Copy link
Member Author

IIRC last I investigate an apparent deadlock w/ y'alls Cargo.toml it was a case of "cargo takes too long to reach the right solution" rather than "there is no solution", so you'd fall in the category of "we'd benefit from a SAT solver" for sure! (my memory may be hazy though)

@sunshowers
Copy link
Contributor

sunshowers commented Sep 8, 2017

Yeah -- that matches my understanding as well.

FWIW I've had a cargo update running on our Cargo.toml for the last 90 minutes or so. Still hasn't completed :)

@alexcrichton
Copy link
Member Author

Heh you may have more luck with ctrl-c :) (in that each time Cargo resolves it may take different paths, so new resolutions may complete quicker).

Note though that the crate graph may still be unresolvable for whatever reason, it's not guaranteed to be resolvable even with * dependencies.

facebook-github-bot pushed a commit to facebookarchive/mononoke that referenced this issue Sep 9, 2017
Summary:
Finally got an update working by removing the `mysql_async` crate.

Some notes:

* The `mysql_async` crate was responsible in this case: see rust-lang/cargo#4066 (comment) for why.
* tokio/futures deprecated a bunch of stuff. I've filed a TODO for now.
* We finally pulled in error-chain 0.11, which has a bunch of nice improvements.

Reviewed By: kulshrax

Differential Revision: D5798282

fbshipit-source-id: a38a7b17ee0205428e2ea63334722aa408582493
@alexcrichton
Copy link
Member Author

why do we not pull in multiple semver-compatible versions of a crate if necessary? I think that if it's a private dependency (as defined in #2064) it should be safe.

You're 100% correct! Right now the thinking is that we're "somewhat in the middle" between the npm strategy of duplicate everything and the rubygems strategy of "only one version of anything". That is, we were initially wary to go the npm route of allowing duplicates everywhere because of binary size blowup concerns, but we wanted to be more flexible than rubygems by allowing multiple version to coexist.

Additionally, though, up to this point we haven't actually had the knowledge of public/private dependencies. My hope is that as soon as we have public/private dependencies we can lift the restriction here and allow semver-compatible duplicates in a dependency graph, so long as the public/shared types all have the same version number.

This is an ecosystem change that would need to be carefully communicated and managed, but hopefully it will make upstreams more careful and the crates.io ecosystem healthier overall.

Yeah I don't think we've done a great job discouraging usage of the ~ operator and encouraging "semver compatible", I think we can definitely do better here!

@carols10cents carols10cents added A-dependency-resolution Area: dependency resolution and the resolver A-diagnostics Area: Error and warning messages generated by Cargo itself. C-bug Category: bug labels Sep 26, 2017
@jacwah
Copy link

jacwah commented Oct 21, 2017

Natalie Weizenbaum talked about how they intend to solve this same problem in the Dart package manager on a recent episode of The Manifest. I'll drop a link if anyone is interested: https://manifest.fm/5. It sound like it is in early research stages, but boils down to using a version of DPLL.

@fenollp
Copy link

fenollp commented Dec 6, 2017

Hopefully the solution to this issue will be concurrent. It can take ages and also uses only one of my many cores! Is it taunting me? :)

EDIT: happening while building git@github.com:gnunicorn/clippy-service.git with cargo 0.24.0-nightly (5bb478a51 2017-11-29).
EDIT: no progress after 59 hours (haha)

@Eh2406
Copy link
Contributor

Eh2406 commented Feb 26, 2018

This I think would just involve changing:

if config.shell().is_err_tty() && !printed && ticks % 1000 == 0
to bail if ticks gets to big. Extra credit to make to big an argument up to the user.

I am still learning cargo internals, but I would be willing to help someone who wanted to take this on!

bors added a commit that referenced this issue Mar 14, 2018
Faster resolver: Cache past conflicting_activations, prevent doing the same work repeatedly.

This work is inspired by @alexcrichton's [comment](#4066 (comment)) that a slow resolver can be caused by all versions of a dependency being yanked. Witch stuck in my brain as I did not understand why it would happen. If a dependency has no candidates then it will be the most constrained and will trigger backtracking in the next tick. Eventually I found a reproducible test case. If the bad dependency is deep in the tree of dependencies then we activate and backtrack `O(versions^depth)`  times. Even tho it is fast to identify the problem that is a lot of work.

**The set up:**
1. Every time we backtrack cache the (dep, `conflicting_activations`).
2. Build on the work in #5000, Fail to activate if any of its dependencies will just backtrack to this frame. I.E. for each dependency check if any of its cached `conflicting_activations` are already all activated. If so we can just skip to the next candidate. We also add that bad `conflicting_activations` to our set of  `conflicting_activations`, so that we can...

**The pay off:**
 If we fail to find any candidates that we can activate in lite of 2, then we cannot be activated in this context, add our (dep, `conflicting_activations`) to the cache so that next time our parent will not bother trying us.

I hear you saying "but the error messages, what about the error messages?" So if we are at the end `!has_another` then we disable this optimization. After we mark our dep as being not activatable then we activate anyway. It won't resolve but it will have the same error message as before this PR. If we have been activated for the error messages then skip straight to the last candidate, as that is the only backtrack that will end with the user.

I added a test in the vain of #4834. With the old code the time to run was `O(BRANCHING_FACTOR ^ DEPTH)` and took ~3min with DEPTH = 10; BRANCHING_FACTOR = 5; with the new code it runs almost instantly with 200 and 100.
@Eh2406
Copy link
Contributor

Eh2406 commented Jul 26, 2018

@dwijnand want to pick this one off next :-)

@dwijnand
Copy link
Member

Alright, I'll have a go.

@dwijnand
Copy link
Member

Ok, I've tried about 4-5 different reproduction in or linked to here and everything's either succeeded or failed fast. 😄 Probably due to the related fixed that have landed.

Does anyone have a reproduction that fails with recent cargo?

@Eh2406
Copy link
Contributor

Eh2406 commented Jul 27, 2018

Sorry, I should have clarified that. I don't know of any that are currently slow. I think I have gone thru every bug report to look for them, I was unable to reproduce or have fixed the resolver for each. (Mostly by reading up on suggestions made in hear.) Then I spent several days fuzzing to find them, nothing over 40 second and then only because I disable one of the Fixes by accident.

So at this point anything that takes more then say a minute, and more than 5_000_000 ticks is something I would like to see! So I'd like there to be cargo to stop and report "It is always possible that this is valid but taking a long time, but It is probably a bug so please report it".

The reason for the message is that legitimately I want to know and fix if anyone hits a new bad case in use.
The reason to stop is that it will probably come up in a CI/Build Server type context and I want people to see it. Including the tests for the fixes I have made, currently if you comment out any of the fixes then the test sweet hangs, it would be better for it to error.

@dwijnand
Copy link
Member

Ah, right, I see! I'll have a look at those tests and try and work off those. Thanks for the info.

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 2, 2018

#5921 added something like this with hard coded stops that is using debug_asserts. So this would now involve making them configurable and return an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dependency-resolution Area: dependency resolution and the resolver A-diagnostics Area: Error and warning messages generated by Cargo itself. C-bug Category: bug S-triage Status: This issue is waiting on initial triage.
Projects
None yet
Development

No branches or pull requests