Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MyPy performance #10864

Closed
Eric-Arellano opened this issue Sep 25, 2020 · 22 comments · Fixed by #16276
Closed

Improve MyPy performance #10864

Eric-Arellano opened this issue Sep 25, 2020 · 22 comments · Fixed by #16276

Comments

@Eric-Arellano
Copy link
Contributor

This is non-trivial because MyPy's cache is not append-only and mutates itself. We need a way to safely use its cache in a way that still works with things like remote execution and how we partition MyPy runs.

@stuhood
Copy link
Member

stuhood commented Sep 25, 2020

Improve MyPy performance by leveraging its cache

Or by parenting instances of its daemon: see https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit#bookmark=id.rla7dsf86tib

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Sep 27, 2020

It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.

I'm less familiar with how we plan to implement parenting instances of the mypy daemon, but I think it seems very clear we would want to support persistent processes somehow and this seems like exactly the right use case to try that on. It's possible that these two approaches are 100% complementary (parenting the daemon, and snapshotting the cache directory). I can't tell which approach would be more immediately useful.

@stuhood
Copy link
Member

stuhood commented Sep 27, 2020

It's possible this could be accomplished with an extension of the proposal in #10870 to allow non-append-only caches, by re-snapshotting the cache directory on each process execution. This would make it remote-friendly without any further work after #10870 is implemented. I have noted this use case as a possible extension in that issue.

The non-append-only-ness has more to do with not being concurrency-safe/shareable, for a few reasons:

  1. it expects to be located inside the buildroot, and uses absolute paths. locating it outside of the buildroot in a non-buildroot-specific absolute path might require rewriting files (similar to what we did for Zinc, which was ... painful/error-prone)
  2. it doesn't expect to have multiple writers (which is probably the key differentiator for an "append-only" or well behaved cache) but we do expect to have multiple concurrent mypy processes (although perhaps we could constrain to one instance per interpreter-constraint at a time)

A few pieces that don't really fit together into a complete story:

  • Maybe possible to invoke recursively (per-target), and do lots of re-writing (both paths and timestamps) of process cache CAS entries (nb: not mutable caches) in wrapper code during per-process setup/teardown to allow them to be used at relative paths within the sandbox directory, but be stored elsewhere.
  • Keep instance(s) of the mypy daemon (or a Bazel-worker-API-like wrapper around the basic CLI API) up in dedicated/owned working directories, and then "sync" relevant files into the workspace (if they use file watching... or maybe FUSE otherwise?)

@cosmicexplorer
Copy link
Contributor

I had totally missed the absolute paths part of mypy. Either of the bullet points at the bottom are quite exciting to consider. I really like the idea of invoking recursively, as I suspect we can make the path rewriting extremely fast as well as generic enough to apply to other tools that write absolute paths.

I would also really like to dive into using FUSE, like so, so much. I created this project which makes it super easy (ideally) to virtualize all of the I/O of a JVM process: https://github.com/cosmicexplorer/upc/blob/7dc4d0ae3219e014b00de62adfb415432f6c0850/local/virtual-cli/client/MainWrapper.scala#L108-L122, but I think possibly the entire thing could be deleted if we could instead have pants create very-high-performance FUSE instances. Especially given @eed3si9n's success with virtualizing zinc's I/O (http://eed3si9n.com/cached-compilation-for-sbt), I think that there is reason to believe FUSE would be a winning strategy that could be generic enough to avoid tons of separate incidental complexity.

I'll probably look into reviving brfs and incorporating some of the ideas from https://github.com/cosmicexplorer/upc.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Sep 27, 2020

Hm. So after investigating a .mypy_cache/ directory from the pex repo a bit, I found something somewhat surprising: it appears that the only absolute paths it contains are for types from the stdlib (which I believe should not change at all). The rest actually seem to be...relative to the working directory? I'm not 100% sure how to fit this together yet, but here are the results of my quick investigation:

# The paths are all located at a top-level key 'path'.
> <.mypy_cache/3.5/uuid.meta.json jq '.' | grep '/Users'
(standard input):62:  "path": "/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/2and3/uuid.pyi"
# The only files in the cache are '.json' files, with a lone '.gitignore'.
> find .mypy_cache -type f | sed -re 's#.*(\.[^\.]+)#\1#g' | sort -u
.gitignore
.json
# And it appears all of the paths are relative to the working directory, except ones from the stdlib!
> find .mypy_cache -type f | parallel "echo -n '{}:' && jq -r '.path' <{}" | head -n3
.mypy_cache/3.5/test_resolver.meta.json:tests/test_resolver.py
.mypy_cache/3.5/atexit.meta.json:/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/atexit.pyi
.mypy_cache/3.5/pex/testing.data.json:pex/testing.py

It feels like we might be able to make use of a mutable cache here for the .mypy_cache directory, without having to rewrite any files. It would end up reaching out to somewhere on the disk to get the stdlib typeshed stubs, but I believe that can also be configured with a command-line option (so we could resolve mypy into a pex, extract the .deps/, then point it into there).

I think that materializing the typeshed stubs would definitely be a (smallish) slowdown -- but we could consider something like #8905 in that case, extending the symlinks we add to include specific large directories that we don't want to keep materializing. The API could be similar to the existing mutable cache API.

Let me know if something's missing from that.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Sep 27, 2020

Ok, after some more investigation, there are some other unstable fields. For posterity, I'm going to post a gist containing the cache contents for the uuid module: https://gist.github.com/cosmicexplorer/258a9947589fb8d3500da2bd6ec5cea5.

At first glance, the only bad entries in uuid.data.json are just path, which is only absolute for typeshed types, which could be mitigated by materializing them into the execution directory as per #10864 (comment).

The bad entries in uuid.meta.json are more numerous, but the file is also much smaller. path is still there with the same caveats, but in addition we have data_mtime and mtime, which presumably contain the modification times for uuid.meta.json and uuid.data.json. I believe that everything else should remain constant if the mypy version and command-line args are kept constant. There's a platform key, but that should be fine since we are discussing a local cache.

The files aren't that large, so while we could attempt to rewrite them with a regex, it's likely to be safer and not much slower to rewrite them by parsing some json (this could be way wrong). I'm not quite sure yet where we would rewrite them -- if someone could give some pointers as to what we need to be making deterministic here, I would love a tip. This should be enough info to go ahead and implement this, though.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Sep 27, 2020

Also, the mypy cache partitions itself by python version. I believe that should satisfy (from the OP):

how we partition MyPy runs.

I think that the fully recursive method described in #10864 (comment) should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0 for all json files in the cache after all invocations, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.

There is a concern there about having to scan all the json files in the cache after each mypy invocation. If we could just look for the module owned by the current target, that would involve only scanning two files (the *.data.json and the *.meta.json) after each invocation, resetting their modified timestamps, then digesting them. I think that's what we would do if we did a fully recursive invocation as Stu described above, but I would want someone to check my work on that. If we do really have to check all the files in the cache on each invocation, we should consider using a mutable cache, or we should consider using regex from rust instead of parsing and rewriting json from python.

If we strictly merge cache directories from leaf to root, we should be able to avoid redoing a ton of work. However, since the cache directory is (I believe) supposed to be consistent across the whole repo, it's possible we could be a little more intense, and at each depth of a breadth-first search of the dependency closure, merge the cache digests and use them for all mypy invocations at the next depth level of the BFS. I think the latter might be necessary, but would be a little bit (I don't think too much) more complex to implement.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Sep 27, 2020

To be clear, I'm thinking of:

  • Put typeshed types into the input_digest for each mypy execution, and use the CLI option to make it look for types there.
  • Run mypy recursively, once per python target (which I believe is now once per python source file, plus any stub).
  • After every invocation, read all the files (?) in the "cache dir" (which is just a normal process exec output dir), and rewrite their json contents to ensure data_mtime and mtime are set to 0 in all *.meta.json files.
  • In the mypy recursive invocation, merge digests for the mypy "cache dir" at each depth level of a BFS of all targets and their transitive dependencies (see details above).

To make it remotable, we need one more thing (I think):

  • Rewrite the platform key in all *.meta.json files to point to the correct platform.

@stuhood
Copy link
Member

stuhood commented Sep 28, 2020

I think that the fully recursive method described in #10864 (comment) should possibly be doable now? And I think I'm getting the idea -- if we can reset the data_mtime and mtime to 0, and materialize typeshed into the dir instead of letting it pick an absolute path, I think we should be able to possibly just digest the "cache directory" in each invocation for each target and merge the results? That seems 100% remotable too.

Yep, that's what I was thinking. As mentioned in slack, the main issue with it is just that it's hard to know how fast or slow it would be (relative to daemonization approaches) without prototyping it.

@stuhood stuhood changed the title Improve MyPy performance by leveraging its cache Improve MyPy performance Nov 5, 2020
@jsirois
Copy link
Contributor

jsirois commented Feb 11, 2021

This is non-trivial because MyPy's cache is not append-only and mutates itself.

This is mostly false if we use the sqlite cache:
https://github.com/python/mypy/blob/538d36481526135c44b90383663eaa177cfc32e3/mypy/metastore.py#L140-L223

Actually ... ditto for the FS based cache which uses an atomic rename for inserts depending on how they use their own store API.

Currently entries are keyed on (path, mtime) so we'd need to get in a patch to make this hash-based. With that though, afaict, we could store a sqlite database in a Process.append_only_caches entry.

@yoav-orca
Copy link
Contributor

We were looking at improving our mypy performance and we saw there's mypy bazel integration.
There's an undocumented flag --bazel that changes the way mypy works in a few ways:

  1. normalizes the path to be relative
  2. returns 0 mtime
  3. changes the cache dir to be the cwd of mypy
  4. Generally delegates the cache consistency checks to bazel
  5. Changes the search path of modules

python/mypy#3912

Can pants exploit this to provide better performance in both CI and local builds?

@stuhood
Copy link
Member

stuhood commented Dec 20, 2021

Sorry for the long delay responding!

We were looking at improving our mypy performance and we saw there's mypy bazel integration.
There's an undocumented flag --bazel that changes the way mypy works in a few ways:

Yes, this would likely be beneficial. I had read through that thread, but totally missed that that flag actually existed on main as undocumented.

Based on your summary (which is great, thanks), I think that the outcome is that you could treat mypy more like a compiler which consumes the outputs of "compiling"/checking its dependencies. The cache would become a relocatable artifact passed between invokes of mypy through the dependency graph.

If so, there are some great incrementality benefits there: although cold performance for an entire repository might be a little bit lower, check would actually be able hit caches, and would have much much better incremental performance by running only for transitively changed files (again, similar to compilation).

So yea: definitely worth looking into.

@yoav-orca
Copy link
Contributor

@stuhood do we have any other compiler examples that I can look at and try to hack? do you feel it's a weekend project?

@stuhood
Copy link
Member

stuhood commented Dec 22, 2021

This is likely to be a little bit of an undertaking.

Python allows import cycles and would be checked at the file level, so it's probably most like the JVM. If you have familiarity with how JVM compilation works, then I could try to work up an explanation for this one. To deal with cycles, it would involve launching mypy recursively on CoarsenedTarget nodes: each "coarsened" (i.e. cycle broken: see the link) target would then recursively depend on the mypy output for each of its dependencies.

All that is relatively straightforward, and very similar to what we do for the JVM. But virtualenvs are built up differently from how a classpath can be composed (classpaths are "just" lists of files, whereas Python artifacts have an install step). We've experimented with recursive composition of Python artifacts, but it's likely that for the purposes of mypy we'd need to provide all of the thirdparty dependencies as a single virtualenv instead (rather than building a precise virtualenv per recursive mypy invoke), similar to what @benjyw did with #13732.

So: all together then (but in relatively low detail, because this is probably a 1-2k line change):

  1. Compute CoarsenedTarget roots for the input addresses (example).
  2. For each root, determine which thirdparty resolve to use (cc @Eric-Arellano), and build a "repository"-"all-requirements" PEX for it (á la An option to test/run/repl against the entire lockfile. #13732).
  3. Create a dataclass "request" type that contains both the CT and the resolve, and kick off recursive mypy checking with the request.
  4. The recursive @rule to check mypy would look a lot like a JVM compile @rule: the best example of that is probably scalac, which:
    a. Requests the outputs of mypy for its dependencies,
    b. Mashes together the sources of all of the Targets inside the CoarsenedTargets (basically: all of the files within an import cycle),
    c. Sets up the thirdparty inputs and dependencies for the tool invoke,
    d. Invokes the tool, and captures its outputs.

The existing mypy rules would also be really helpful to look at.

@stuhood
Copy link
Member

stuhood commented Jan 5, 2022

Also, if we were going to be invoking mypy on much smaller sets of files, #14070 is likely a blocker (in order to reduce per-invoke overheads). There is also some discussion on pex-tool/pex#1423 of whether the overheads would even be low enough to support invoking at this granularity.

@stuhood
Copy link
Member

stuhood commented Jan 20, 2022

Also, if we were going to be invoking mypy on much smaller sets of files, #14070 is likely a blocker (in order to reduce per-invoke overheads). There is also some discussion on pantsbuild/pex#1423 of whether the overheads would even be low enough to support invoking at this granularity.

OR, it's possible that CoarsenedTarget could be expanded to batch files (á la #14186) while coarsening. That would have the effect of invoking mypy on a "chunkier" graph of files, which could still get cache hits for chunks, without paying the sandbox setup cost for individual files. This would also benefit the JVM.

@thejcannon
Copy link
Member

We were looking at improving our mypy performance and we saw there's mypy bazel integration. There's an undocumented flag --bazel that changes the way mypy works in a few ways:
1. normalizes the path to be relative
2. returns 0 mtime
3. changes the cache dir to be the cwd of mypy
4. Generally delegates the cache consistency checks to bazel
5. Changes the search path of modules

python/mypy#3912
Can pants exploit this to provide better performance in both CI and local builds?

I looked into this today. mypy certainly doesn't make this easy.

2 is the hot ticket here really.

3 is actually really annoying because we've littered CWD (and additional subdirs) with cache files. We can work around this by playing tricks with CWD, however we run into a second problem: this cache would ideally be shared (a la append_only_caches) but then we need to atomically write the files and figure out the "how to add and not edit" issue which is almost prohibibitively hard. (The atomicity is hard too because mypy uses 2 files for metadata, so there's almost guaranteed to be a race condition).

I think if we reeeeeeeeeally wanted to we could:

  1. Initialize the sandbox with a copy of the cache (using append_only_caches)
  2. Run mypy using the copy cache as CWD. Also for every source file, pass --cache-mapping and use a file whose name is unique based on the file's hash (this way we only append)
  3. After mypy runs, atomically copy files from CWD to the real cache (and here "atomically" doesn't just mean a single file, but the cache data file-pairs. Or copy them "in the right order" e.g. the data then the meta)

@Eric-Arellano
Copy link
Contributor Author

Something like that might be worth it...disregarding Pants and running MyPy directly, its performance is pretty dramatic between cold vs warm runs. We are wasting a lot of developer productivity time by not leveraging MyPy's cache.

@thejcannon
Copy link
Member

Actually --skip-cache-mtime-check "works". So the only challenge here might be a repo-specific subdir in the named cache.

thejcannon added a commit that referenced this issue Jul 23, 2022
Leverages mypy's cache to allow for "incremental" runs. The cache is an `append_only_cache` which is a bit disingenuous here, but AFAICT mypy is process-safe w.r.t. the cache (I'll do more digging to be very sure).

Fixes #10864

[ci skip-rust]
[ci skip-build-wheels]
@benjyw
Copy link
Contributor

benjyw commented Jul 24, 2022

@thejcannon , this is awesome! For posterity, can you explain why it is OK after all to use an append_only_cache ?

@thejcannon
Copy link
Member

I'll comment on PR

jyggen pushed a commit to jyggen/pants that referenced this issue Jul 27, 2022
…16276)

Leverages mypy's cache to allow for "incremental" runs. The cache is an `append_only_cache` which is a bit disingenuous here, but AFAICT mypy is process-safe w.r.t. the cache (I'll do more digging to be very sure).

Fixes pantsbuild#10864

[ci skip-rust]
[ci skip-build-wheels]
@wimax-grapl
Copy link
Contributor

(Also linking #16290 to this ticket)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

8 participants