An option to test/run/repl against the entire lockfile. #13732

benjyw · 2021-11-29T18:43:42Z

This cuts out the subsetting, which can add a lot of time
to runs, and can also bloat the cache dramatically.

One downside is less hermetic runs - you may be incidentally
relying on requirements that you don't depend on but happen to
be present in the lockfile (but this seems like less
of an issue with dep inference).

A more significant downside is that any changes to the lockfile
will invalidate all tests. Still, this is a performance knob that
users can decide how to use. In many cases that may be a very
acceptable consequence.

[ci skip-rust]

[ci skip-build-wheels]

This cuts out the subsetting, which can add a lot of time to runs, and can also bloat the cache dramatically. One downside is less hermetic runs - you may be incidentally relying on requirements that you don't depend on but happen to be present in the lockfile (but this seems like less of an issue with dep inference). A more significant downside is that any changes to the lockfile will invalidate all tests. Still, this is a performance knob that users can decide how to use. In many cases that may be a very acceptable consequence. [ci skip-rust] [ci skip-build-wheels]

src/python/pants/backend/python/util_rules/pex_from_targets.py

jsirois · 2021-11-29T19:51:44Z

What ended up being the status on data collected backing this? It's no doubt true, but it would be great to have reference to some numbers too. Having to resort to this is a real problem it would be great to solve 1st class.

jsirois · 2021-11-29T19:55:47Z

and can also bloat the cache dramatically

That's actually surprising (to me) and would be super good to have numbers for. The packed layout should mean we have less cache bloating. If a wheel version doesn't change, its packed zip shouldn't either, and any subset including that wheel / zip should use the same cache entry for that particular wheel / zip - this is where the merkle should be dancing in our favor..

benjyw · 2021-11-29T20:59:52Z

What ended up being the status on data collected backing this? It's no doubt true, but it would be great to have reference to some numbers too. Having to resort to this is a real problem it would be great to solve 1st class.

In the case I was debugging the traces were so huge that they crashed the Zipkin browser, so it's been hard to get hard data. But anecdotally this has been a big problem (those users noticing that significant time is spent in subsetting, at least according to the dynamic UI).

The cache bloat in question is not the lmdb_store but the pex_root named cache, which as far as I know contains loose files in venvs. I believe they are hard links under the covers, so actual storage is not being consumed unnecessarily, but the huge number of files appears to be a problem (it can take those users several minutes to clean out the cache after just a few hours of usage). And it also looks like this interacts badly with CI caches - I don't believe those are optimized for hard link farms, so my guess is you end up with duplicate inodes after saving and restoring the cache dir.

Anecdotally, I'm having what appears to be a snappier experience even running in the Pants repo, by avoiding all the subsetting, but that could just be bias.

benjyw · 2021-11-29T21:00:46Z

BTW I'm not sure why having to resort to this is a problem per se - the more rigorous degree of isolation may be overkill for many users. If you're not using Pants you're very likely running all your tests in a single venv already, and many people are fine with that.

jsirois · 2021-11-29T21:12:45Z

The cache bloat in question is not the lmdb_store but the pex_root named cache

Aha - OK. That's good to know since it at least says packed wasn't wasted effort.

Anecdotally, I'm having what appears to be a snappier experience even running in the Pants repo, by avoiding all the subsetting, but that could just be bias.

Well, by definition, it should be snappier. Subsetting is more work full stop. I tried to minimize the work, but, IIRC its still O(100ms) typically.

BTW I'm not sure why having to resort to this is a problem per se

Well, its just that this means we need to be continually worried as devs that we may need to provide an escape hatch / be less precise. Since Pants currently has 0 always run in CI benchmarks or load tests - that's problematic in my mind.

src/python/pants/backend/python/subsystems/setup.py

benjyw · 2021-12-02T16:25:57Z

@stuhood any thoughts on this?

stuhood

I think that it's ok to land this given how small the implementation ended up being after the refactoring.

In the case I was debugging the traces were so huge that they crashed the Zipkin browser, so it's been hard to get hard data. But anecdotally this has been a big problem (those users noticing that significant time is spent in subsetting, at least according to the dynamic UI).

Traces can be filtered: let me know if you need help with that.

It would be really good to actually collect performance numbers with/without and to then work on more deeply understanding why there is a difference. Not having done that before implementing this feature remains unsettling, but since it is not safe enough to be enabled by default, we're not free from the burden of figuring the problem out and fixing the performance of the default.

src/python/pants/backend/python/subsystems/setup.py

benjyw · 2021-12-02T20:33:14Z

I think why there is a difference is fairly obvious: subsetting is extra work, that can massively bloat the pex_root named cache. There is only so much fixing we can do to that.

TBH I don't think I agree that this is unsafe to enable by default (although I'm not necessarily advocating that we do so). Running all tests in a single virtualenv with the union of all dependencies is pretty much the standard behavior in ~every non-Pants repo (with the caveat that I don't know what Bazel does here).

What is an example of a case where having extra deps present at test time is a correctness/safety problem? The only likely one I can think of is that you're missing a dep and that gets masked, so your binary fails at runtime. But with dep inference that is far less likely, and in cases where that happens the fix should probably be to improve dep inference.

But in cases where this performance issue kicks in users may well want to also deploy against the entire lockfile (possibly using a "skinny pex" a la #13358). In which case testing against the entire lockfile is arguably the right thing to do.

More generally, since code under test is not deployed on its own, the exact set of runtime deps present for some code is ~always going to be different than the exact set of test-time deps present for its unittest. And the difference can already run both ways: Obviously the code might see deps at runtime that it hasn't seen at test time, but we can also see deps at test time that we won't see at runtime, due to deps of the test file itself. We accept both of these cases as reasonable today, and it's not clear to me that one is worse than the other.

So I don't think there's a general-purpose correctness argument here. Therefore I think the choice should come down to performance - cache granularity cost vs subsetting cost, and that tradeoff will be different for different repos (and maybe in some cases correctness will play a role in the decision) but giving users this tradeoff knob seems valuable in practice.

[ci skip-rust] [ci skip-build-wheels]

# Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano · 2021-12-02T20:49:12Z

So I don't think there's a general-purpose correctness argument here.

Re correctness, good point on test already diverging from production dependencies. The main place this PR really risks hurting with correctness is run vs. package diverging more than before. I agree that it's acceptable.

stuhood · 2021-12-02T21:11:22Z

I think why there is a difference is fairly obvious: subsetting is extra work, that can massively bloat the pex_root named cache. There is only so much fixing we can do to that.

The subsetting itself should not affect the pex_root size at all, since "packed" PEXes have completely shared content (each unpacked whl should exist exactly once in the pex_root, afaik).

I expect that the size of the pex_root is because running the test will create an actual long lived virtualenv under the pex_root with loose files, etc.

But actually profiling is the answer, rather than asserting that the problem is unsolvable.

TBH I don't think I agree that this is unsafe to enable by default (although I'm not necessarily advocating that we do so). Running all tests in a single virtualenv with the union of all dependencies is pretty much the standard behavior in ~every non-Pants repo (with the caveat that I don't know what Bazel does here).

What is an example of a case where having extra deps present at test time is a correctness/safety problem? The only likely one I can think of is that you're missing a dep and that gets masked, so your binary fails at runtime. But with dep inference that is far less likely, and in cases where that happens the fix should probably be to improve dep inference.

Any tool which loads things reflectively based on filename patterns would be subject to loading "too much" stuff in the test: so, dependencies with plugin systems. A common example is django, where config is loaded reflectively from certain file names: fetching django apps as thirdparty may not be common, but having every test load all of the modules would definitely affect its behavior (and cause it to fail in many cases).

We accept both of these cases as reasonable today, and it's not clear to me that one is worse than the other.

Sure, but that's something that would clearly be good to fix, for all the same reasons this isn't safe to enable by default.

benjyw · 2021-12-02T21:20:32Z

Sure, but that's something that would clearly be good to fix, for all the same reasons this isn't safe to enable by default.

What constitutes a fix depends on specific circumstances, I don't see a universally correct answer here. There may be cases in which depending on the entire lockfile at test time is actually the most correct thing to do (e.g., if the entire lockfile is known to be present at runtime). Declaring that "testing code only against its inferred and explicit deps and nothing else" is de-facto the most correct behavior seems arbitrary, even if we could actually do that, which would require shading all the test's deps.

stuhood · 2021-12-03T18:37:00Z

Looking at the description of pex-tool/pex#1438 again, the fact that the venv mode uses more space compared to the default is called out pretty directly. Perhaps another interesting knob to expose would be to toggle between latency and space savings for internal PEXes... although obviously it would be great to be able to have our cake and eat it too.

benjyw requested review from stuhood, Eric-Arellano and jsirois November 29, 2021 18:43

benjyw commented Nov 29, 2021

View reviewed changes

src/python/pants/backend/python/util_rules/pex_from_targets.py Show resolved Hide resolved

Eric-Arellano approved these changes Dec 1, 2021

View reviewed changes

src/python/pants/backend/python/subsystems/setup.py Outdated Show resolved Hide resolved

src/python/pants/backend/python/subsystems/setup.py Outdated Show resolved Hide resolved

src/python/pants/backend/python/subsystems/setup.py Show resolved Hide resolved

stuhood approved these changes Dec 2, 2021

View reviewed changes

src/python/pants/backend/python/subsystems/setup.py Outdated Show resolved Hide resolved

benjyw added 2 commits December 2, 2021 12:33

Merge branch 'main' into use_repository_pex2

875282d

[ci skip-rust] [ci skip-build-wheels]

Improve option help

8835ac2

# Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

benjyw merged commit beefb1a into pantsbuild:main Dec 2, 2021

benjyw deleted the use_repository_pex2 branch December 2, 2021 21:21

stuhood mentioned this pull request Dec 22, 2021

Improve MyPy performance #10864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An option to test/run/repl against the entire lockfile. #13732

An option to test/run/repl against the entire lockfile. #13732

benjyw commented Nov 29, 2021

jsirois commented Nov 29, 2021

jsirois commented Nov 29, 2021 •

edited

Loading

benjyw commented Nov 29, 2021

benjyw commented Nov 29, 2021

jsirois commented Nov 29, 2021 •

edited

Loading

benjyw commented Dec 2, 2021

stuhood left a comment

benjyw commented Dec 2, 2021

Eric-Arellano commented Dec 2, 2021

stuhood commented Dec 2, 2021 •

edited

Loading

benjyw commented Dec 2, 2021

stuhood commented Dec 3, 2021

An option to test/run/repl against the entire lockfile. #13732

An option to test/run/repl against the entire lockfile. #13732

Conversation

benjyw commented Nov 29, 2021

jsirois commented Nov 29, 2021

jsirois commented Nov 29, 2021 • edited Loading

benjyw commented Nov 29, 2021

benjyw commented Nov 29, 2021

jsirois commented Nov 29, 2021 • edited Loading

benjyw commented Dec 2, 2021

stuhood left a comment

Choose a reason for hiding this comment

benjyw commented Dec 2, 2021

Eric-Arellano commented Dec 2, 2021

stuhood commented Dec 2, 2021 • edited Loading

benjyw commented Dec 2, 2021

stuhood commented Dec 3, 2021

jsirois commented Nov 29, 2021 •

edited

Loading

jsirois commented Nov 29, 2021 •

edited

Loading

stuhood commented Dec 2, 2021 •

edited

Loading