Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An option to test/run/repl against the entire lockfile. #13732

Merged
merged 3 commits into from
Dec 2, 2021

Conversation

benjyw
Copy link
Contributor

@benjyw benjyw commented Nov 29, 2021

This cuts out the subsetting, which can add a lot of time
to runs, and can also bloat the cache dramatically.

One downside is less hermetic runs - you may be incidentally
relying on requirements that you don't depend on but happen to
be present in the lockfile (but this seems like less
of an issue with dep inference).

A more significant downside is that any changes to the lockfile
will invalidate all tests. Still, this is a performance knob that
users can decide how to use. In many cases that may be a very
acceptable consequence.

[ci skip-rust]

[ci skip-build-wheels]

This cuts out the subsetting, which can add a lot of time
to runs, and can also bloat the cache dramatically.

One downside is less hermetic runs - you may be incidentally
relying on requirements that you don't depend on but happen to
be present in the lockfile (but this seems like less
of an issue with dep inference).

A more significant downside is that any changes to the lockfile
will invalidate all tests. Still, this is a performance knob that
users can decide how to use. In many cases that may be a very
acceptable consequence.

[ci skip-rust]

[ci skip-build-wheels]
@jsirois
Copy link
Contributor

jsirois commented Nov 29, 2021

What ended up being the status on data collected backing this? It's no doubt true, but it would be great to have reference to some numbers too. Having to resort to this is a real problem it would be great to solve 1st class.

@jsirois
Copy link
Contributor

jsirois commented Nov 29, 2021

and can also bloat the cache dramatically

That's actually surprising (to me) and would be super good to have numbers for. The packed layout should mean we have less cache bloating. If a wheel version doesn't change, its packed zip shouldn't either, and any subset including that wheel / zip should use the same cache entry for that particular wheel / zip - this is where the merkle should be dancing in our favor..

@benjyw
Copy link
Contributor Author

benjyw commented Nov 29, 2021

What ended up being the status on data collected backing this? It's no doubt true, but it would be great to have reference to some numbers too. Having to resort to this is a real problem it would be great to solve 1st class.

In the case I was debugging the traces were so huge that they crashed the Zipkin browser, so it's been hard to get hard data. But anecdotally this has been a big problem (those users noticing that significant time is spent in subsetting, at least according to the dynamic UI).

The cache bloat in question is not the lmdb_store but the pex_root named cache, which as far as I know contains loose files in venvs. I believe they are hard links under the covers, so actual storage is not being consumed unnecessarily, but the huge number of files appears to be a problem (it can take those users several minutes to clean out the cache after just a few hours of usage). And it also looks like this interacts badly with CI caches - I don't believe those are optimized for hard link farms, so my guess is you end up with duplicate inodes after saving and restoring the cache dir.

Anecdotally, I'm having what appears to be a snappier experience even running in the Pants repo, by avoiding all the subsetting, but that could just be bias.

@benjyw
Copy link
Contributor Author

benjyw commented Nov 29, 2021

BTW I'm not sure why having to resort to this is a problem per se - the more rigorous degree of isolation may be overkill for many users. If you're not using Pants you're very likely running all your tests in a single venv already, and many people are fine with that.

@jsirois
Copy link
Contributor

jsirois commented Nov 29, 2021

The cache bloat in question is not the lmdb_store but the pex_root named cache

Aha - OK. That's good to know since it at least says packed wasn't wasted effort.

Anecdotally, I'm having what appears to be a snappier experience even running in the Pants repo, by avoiding all the subsetting, but that could just be bias.

Well, by definition, it should be snappier. Subsetting is more work full stop. I tried to minimize the work, but, IIRC its still O(100ms) typically.

BTW I'm not sure why having to resort to this is a problem per se

Well, its just that this means we need to be continually worried as devs that we may need to provide an escape hatch / be less precise. Since Pants currently has 0 always run in CI benchmarks or load tests - that's problematic in my mind.

@benjyw
Copy link
Contributor Author

benjyw commented Dec 2, 2021

@stuhood any thoughts on this?

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it's ok to land this given how small the implementation ended up being after the refactoring.

In the case I was debugging the traces were so huge that they crashed the Zipkin browser, so it's been hard to get hard data. But anecdotally this has been a big problem (those users noticing that significant time is spent in subsetting, at least according to the dynamic UI).

Traces can be filtered: let me know if you need help with that.

It would be really good to actually collect performance numbers with/without and to then work on more deeply understanding why there is a difference. Not having done that before implementing this feature remains unsettling, but since it is not safe enough to be enabled by default, we're not free from the burden of figuring the problem out and fixing the performance of the default.

src/python/pants/backend/python/subsystems/setup.py Outdated Show resolved Hide resolved
@benjyw
Copy link
Contributor Author

benjyw commented Dec 2, 2021

I think why there is a difference is fairly obvious: subsetting is extra work, that can massively bloat the pex_root named cache. There is only so much fixing we can do to that.

TBH I don't think I agree that this is unsafe to enable by default (although I'm not necessarily advocating that we do so). Running all tests in a single virtualenv with the union of all dependencies is pretty much the standard behavior in ~every non-Pants repo (with the caveat that I don't know what Bazel does here).

What is an example of a case where having extra deps present at test time is a correctness/safety problem? The only likely one I can think of is that you're missing a dep and that gets masked, so your binary fails at runtime. But with dep inference that is far less likely, and in cases where that happens the fix should probably be to improve dep inference.

But in cases where this performance issue kicks in users may well want to also deploy against the entire lockfile (possibly using a "skinny pex" a la #13358). In which case testing against the entire lockfile is arguably the right thing to do.

More generally, since code under test is not deployed on its own, the exact set of runtime deps present for some code is ~always going to be different than the exact set of test-time deps present for its unittest. And the difference can already run both ways: Obviously the code might see deps at runtime that it hasn't seen at test time, but we can also see deps at test time that we won't see at runtime, due to deps of the test file itself. We accept both of these cases as reasonable today, and it's not clear to me that one is worse than the other.

So I don't think there's a general-purpose correctness argument here. Therefore I think the choice should come down to performance - cache granularity cost vs subsetting cost, and that tradeoff will be different for different repos (and maybe in some cases correctness will play a role in the decision) but giving users this tradeoff knob seems valuable in practice.

[ci skip-rust]

[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano
Copy link
Contributor

So I don't think there's a general-purpose correctness argument here.

Re correctness, good point on test already diverging from production dependencies. The main place this PR really risks hurting with correctness is run vs. package diverging more than before. I agree that it's acceptable.

@stuhood
Copy link
Member

stuhood commented Dec 2, 2021

I think why there is a difference is fairly obvious: subsetting is extra work, that can massively bloat the pex_root named cache. There is only so much fixing we can do to that.

The subsetting itself should not affect the pex_root size at all, since "packed" PEXes have completely shared content (each unpacked whl should exist exactly once in the pex_root, afaik).

I expect that the size of the pex_root is because running the test will create an actual long lived virtualenv under the pex_root with loose files, etc.

But actually profiling is the answer, rather than asserting that the problem is unsolvable.

TBH I don't think I agree that this is unsafe to enable by default (although I'm not necessarily advocating that we do so). Running all tests in a single virtualenv with the union of all dependencies is pretty much the standard behavior in ~every non-Pants repo (with the caveat that I don't know what Bazel does here).

What is an example of a case where having extra deps present at test time is a correctness/safety problem? The only likely one I can think of is that you're missing a dep and that gets masked, so your binary fails at runtime. But with dep inference that is far less likely, and in cases where that happens the fix should probably be to improve dep inference.

Any tool which loads things reflectively based on filename patterns would be subject to loading "too much" stuff in the test: so, dependencies with plugin systems. A common example is django, where config is loaded reflectively from certain file names: fetching django apps as thirdparty may not be common, but having every test load all of the modules would definitely affect its behavior (and cause it to fail in many cases).

We accept both of these cases as reasonable today, and it's not clear to me that one is worse than the other.

Sure, but that's something that would clearly be good to fix, for all the same reasons this isn't safe to enable by default.

@benjyw
Copy link
Contributor Author

benjyw commented Dec 2, 2021

Sure, but that's something that would clearly be good to fix, for all the same reasons this isn't safe to enable by default.

What constitutes a fix depends on specific circumstances, I don't see a universally correct answer here. There may be cases in which depending on the entire lockfile at test time is actually the most correct thing to do (e.g., if the entire lockfile is known to be present at runtime). Declaring that "testing code only against its inferred and explicit deps and nothing else" is de-facto the most correct behavior seems arbitrary, even if we could actually do that, which would require shading all the test's deps.

@benjyw benjyw merged commit beefb1a into pantsbuild:main Dec 2, 2021
@benjyw benjyw deleted the use_repository_pex2 branch December 2, 2021 21:21
@stuhood
Copy link
Member

stuhood commented Dec 3, 2021

Looking at the description of pex-tool/pex#1438 again, the fact that the venv mode uses more space compared to the default is called out pretty directly. Perhaps another interesting knob to expose would be to toggle between latency and space savings for internal PEXes... although obviously it would be great to be able to have our cake and eat it too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants