-
-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An option to test/run/repl against the entire lockfile. #13732
Conversation
This cuts out the subsetting, which can add a lot of time to runs, and can also bloat the cache dramatically. One downside is less hermetic runs - you may be incidentally relying on requirements that you don't depend on but happen to be present in the lockfile (but this seems like less of an issue with dep inference). A more significant downside is that any changes to the lockfile will invalidate all tests. Still, this is a performance knob that users can decide how to use. In many cases that may be a very acceptable consequence. [ci skip-rust] [ci skip-build-wheels]
What ended up being the status on data collected backing this? It's no doubt true, but it would be great to have reference to some numbers too. Having to resort to this is a real problem it would be great to solve 1st class. |
That's actually surprising (to me) and would be super good to have numbers for. The packed layout should mean we have less cache bloating. If a wheel version doesn't change, its packed zip shouldn't either, and any subset including that wheel / zip should use the same cache entry for that particular wheel / zip - this is where the merkle should be dancing in our favor.. |
In the case I was debugging the traces were so huge that they crashed the Zipkin browser, so it's been hard to get hard data. But anecdotally this has been a big problem (those users noticing that significant time is spent in subsetting, at least according to the dynamic UI). The cache bloat in question is not the lmdb_store but the pex_root named cache, which as far as I know contains loose files in venvs. I believe they are hard links under the covers, so actual storage is not being consumed unnecessarily, but the huge number of files appears to be a problem (it can take those users several minutes to clean out the cache after just a few hours of usage). And it also looks like this interacts badly with CI caches - I don't believe those are optimized for hard link farms, so my guess is you end up with duplicate inodes after saving and restoring the cache dir. Anecdotally, I'm having what appears to be a snappier experience even running in the Pants repo, by avoiding all the subsetting, but that could just be bias. |
BTW I'm not sure why having to resort to this is a problem per se - the more rigorous degree of isolation may be overkill for many users. If you're not using Pants you're very likely running all your tests in a single venv already, and many people are fine with that. |
Aha - OK. That's good to know since it at least says packed wasn't wasted effort.
Well, by definition, it should be snappier. Subsetting is more work full stop. I tried to minimize the work, but, IIRC its still O(100ms) typically.
Well, its just that this means we need to be continually worried as devs that we may need to provide an escape hatch / be less precise. Since Pants currently has 0 always run in CI benchmarks or load tests - that's problematic in my mind. |
@stuhood any thoughts on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it's ok to land this given how small the implementation ended up being after the refactoring.
In the case I was debugging the traces were so huge that they crashed the Zipkin browser, so it's been hard to get hard data. But anecdotally this has been a big problem (those users noticing that significant time is spent in subsetting, at least according to the dynamic UI).
Traces can be filtered: let me know if you need help with that.
It would be really good to actually collect performance numbers with/without and to then work on more deeply understanding why there is a difference. Not having done that before implementing this feature remains unsettling, but since it is not safe enough to be enabled by default, we're not free from the burden of figuring the problem out and fixing the performance of the default.
I think why there is a difference is fairly obvious: subsetting is extra work, that can massively bloat the pex_root named cache. There is only so much fixing we can do to that. TBH I don't think I agree that this is unsafe to enable by default (although I'm not necessarily advocating that we do so). Running all tests in a single virtualenv with the union of all dependencies is pretty much the standard behavior in ~every non-Pants repo (with the caveat that I don't know what Bazel does here). What is an example of a case where having extra deps present at test time is a correctness/safety problem? The only likely one I can think of is that you're missing a dep and that gets masked, so your binary fails at runtime. But with dep inference that is far less likely, and in cases where that happens the fix should probably be to improve dep inference. But in cases where this performance issue kicks in users may well want to also deploy against the entire lockfile (possibly using a "skinny pex" a la #13358). In which case testing against the entire lockfile is arguably the right thing to do. More generally, since code under test is not deployed on its own, the exact set of runtime deps present for some code is ~always going to be different than the exact set of test-time deps present for its unittest. And the difference can already run both ways: Obviously the code might see deps at runtime that it hasn't seen at test time, but we can also see deps at test time that we won't see at runtime, due to deps of the test file itself. We accept both of these cases as reasonable today, and it's not clear to me that one is worse than the other. So I don't think there's a general-purpose correctness argument here. Therefore I think the choice should come down to performance - cache granularity cost vs subsetting cost, and that tradeoff will be different for different repos (and maybe in some cases correctness will play a role in the decision) but giving users this tradeoff knob seems valuable in practice. |
[ci skip-rust] [ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
Re correctness, good point on |
The subsetting itself should not affect the pex_root size at all, since "packed" PEXes have completely shared content (each unpacked whl should exist exactly once in the pex_root, afaik). I expect that the size of the pex_root is because running the test will create an actual long lived virtualenv under the pex_root with loose files, etc. But actually profiling is the answer, rather than asserting that the problem is unsolvable.
Any tool which loads things reflectively based on filename patterns would be subject to loading "too much" stuff in the test: so, dependencies with plugin systems. A common example is
Sure, but that's something that would clearly be good to fix, for all the same reasons this isn't safe to enable by default. |
What constitutes a fix depends on specific circumstances, I don't see a universally correct answer here. There may be cases in which depending on the entire lockfile at test time is actually the most correct thing to do (e.g., if the entire lockfile is known to be present at runtime). Declaring that "testing code only against its inferred and explicit deps and nothing else" is de-facto the most correct behavior seems arbitrary, even if we could actually do that, which would require shading all the test's deps. |
Looking at the description of pex-tool/pex#1438 again, the fact that the |
This cuts out the subsetting, which can add a lot of time
to runs, and can also bloat the cache dramatically.
One downside is less hermetic runs - you may be incidentally
relying on requirements that you don't depend on but happen to
be present in the lockfile (but this seems like less
of an issue with dep inference).
A more significant downside is that any changes to the lockfile
will invalidate all tests. Still, this is a performance knob that
users can decide how to use. In many cases that may be a very
acceptable consequence.
[ci skip-rust]
[ci skip-build-wheels]