Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track mypy performance changes automatically #14187

Closed
JukkaL opened this issue Nov 25, 2022 · 15 comments
Closed

Track mypy performance changes automatically #14187

JukkaL opened this issue Nov 25, 2022 · 15 comments
Assignees
Labels
performance topic-developer Issues relevant to mypy developers

Comments

@JukkaL
Copy link
Collaborator

JukkaL commented Nov 25, 2022

Currently it's difficult to find the root cause of a performance regression. We could automatically profile each mypy commit and generate a report with historical performance information, similar to how we track the performance of mypyc (e.g. https://github.com/mypyc/mypyc-benchmark-results/blob/master/reports/summary-main.md).

Add another benchmark to the mypyc benchmarks runner that type checks a fixed version of mypy using the current compiled mypy.

The benchmark will live in https://github.com/mypyc/mypyc-benchmarks.

@JukkaL JukkaL added topic-developer Issues relevant to mypy developers performance labels Nov 25, 2022
@JukkaL JukkaL self-assigned this Nov 25, 2022
@JukkaL
Copy link
Collaborator Author

JukkaL commented Nov 25, 2022

I've started working on this. Some details:

  • We'll have a vendored copy of mypy in the mypyc-benchmarks repo so that the workload is always the same. This also makes it easy to make minor tweaks to e.g. fix errors when compiled with a newer mypy.
  • We'll need to pin PEP 561 dependencies such as pytest for repeatable results.
  • It would be nice to generate historical data to find older regressions that we might still want to fix, but it's not clear whether we can properly type check a recent mypy with very old mypy versions.
  • When updating to a newer Python or Ubuntu version there will be a discontinuity. We can adjust the old results with a relative performance factor to preserve a continuous serious of measurements -- this is similar to what we've done before with mypyc benchmark results. Re-generating all historical benchmark results is too time consuming.

@hauntsaninja
Copy link
Collaborator

Related hauntsaninja/mypy_primer#30

@JukkaL
Copy link
Collaborator Author

JukkaL commented Nov 28, 2022

Additional context: The mypyc benchmarks are run on a dedicated host that has been tweaked for precise performance measurements (e.g. no turbo boost, automatic updates disabled). I hope that we will able to detect performance regressions below 0.5% (stretch goal: detect a 0.1% regression).

@JukkaL
Copy link
Collaborator Author

JukkaL commented Nov 30, 2022

Based on initial results, detecting regressions that are under 0.5% seems hard. A 1.0% regression is probably clearly visible, and a 0.5% regression might be visible. Perhaps we can improve precision by tweaking the measurement logic (e.g. by dropping outliers), but it's not clear yet. I'll perform a few additional experiments.

@hauntsaninja
Copy link
Collaborator

@JukkaL I think it would be helpful to type check repositories other than mypy. Here are the 20 slowest repositories to type check from mypy_primer (which also contains the exact commands and deps used).

https://github.com/pandas-dev/pandas
https://github.com/sympy/sympy
https://github.com/home-assistant/core
https://github.com/graphql-python/graphql-core
https://github.com/MaterializeInc/materialize
https://github.com/canonical/cloud-init
https://github.com/arviz-devs/arviz
https://github.com/pandas-dev/pandas-stubs
https://github.com/apache/spark
https://github.com/python/mypy
https://github.com/common-workflow-language/cwltool
https://github.com/Rapptz/discord.py
https://github.com/trailofbits/manticore
https://github.com/streamlit/streamlit
https://github.com/astropenguin/xarray-dataclasses
https://github.com/google/jax
https://github.com/python-poetry/poetry
https://github.com/urllib3/urllib3
https://github.com/pypa/pip
https://github.com/psycopg/psycopg

If I had to pick five from these, I'd maybe do:

https://github.com/pandas-dev/pandas
https://github.com/sympy/sympy
https://github.com/home-assistant/core
https://github.com/Rapptz/discord.py
https://github.com/FasterSpeeding/Tanjun

@A5rocks
Copy link
Contributor

A5rocks commented Nov 30, 2022

Based on initial results, detecting regressions that are under 0.5% seems hard. A 1.0% regression is probably clearly visible, and a 0.5% regression might be visible. Perhaps we can improve precision by tweaking the measurement logic (e.g. by dropping outliers), but it's not clear yet. I'll perform a few additional experiments.

I notice that wall time is used. Maybe CPU instruction count would have less noise? I know it's what stuff like Rust's perf site uses by default (https://perf.rust-lang.org).

See also stuff like rust-lang/rust#104646 (comment)

I'm not sure it would be as useful given that as I understand it some amount of mypy time is spent in IO.

@JukkaL
Copy link
Collaborator Author

JukkaL commented Dec 7, 2022

I'm not sure it would be as useful given that as I understand it some amount of mypy time is spent in IO.

Mypy does some IO and it would be good to track regressions related to IO as well, so this doesn't sound ideal for mypy.

I think it would be helpful to type check repositories other than mypy.

Agreed, this would be nice! I'm focusing on mypy at first, but we could add some extra repositories once the whole thing works reliably for mypy.

@JukkaL
Copy link
Collaborator Author

JukkaL commented Dec 7, 2022

I tried a few different things to get more precise results, but I'm still stuck at about 1% noise floor. It's still enough to detect major performance regressions. I'm now collecting historical data, over the most recent 1000 mypy commits or so. It will take about two weeks of calendar time to complete (if everything goes smoothly).

I will also experiment with collecting performance data about interpreted mypy runs, in the hope of getting more accurate timings.

@JukkaL
Copy link
Collaborator Author

JukkaL commented Dec 19, 2022

I now have collected performance data for about one year of commits, though the most recent commits are still missing.

Here's the data: https://github.com/mypyc/mypyc-benchmark-results/blob/master/reports/benchmarks/mypy_self_check.md

Performance has regressed by about 50% over a year, so it's time to focus on fixing the regressions! I hope that in the future we can keep the overall level of slowdown under 10% in any year (and preferably under 5%) so that perceived mypy performance stays at least roughly the same over time, when accounting for hardware improvements. Hardware has improved about 5-10% per year for single-threaded workloads in the last 10 years, on average.

The noise floor is about 1.5%. Perhaps with some smoothing we can bring it closer to 1.0%. Anyway, it's already good enough to find major regressions and improvements.

I started looking at reducing the impact of 48c4a47 some time ago. This was a previously known regression that got me interested in finding other regressions like that.

@JukkaL
Copy link
Collaborator Author

JukkaL commented Dec 19, 2022

Here are some optimizations based on the performance data: #14316

@A5rocks
Copy link
Contributor

A5rocks commented Dec 20, 2022

These changes are very nice! Would it be possible to get a preview of whether a PR introduces a performance regression before merging it (a la mypy_primer)? I'm not sure if that's planned or not, but IMO would be useful.

@JukkaL
Copy link
Collaborator Author

JukkaL commented Dec 20, 2022

Would it be possible to get a preview of whether a PR introduces a performance regression before merging it (a la mypy_primer)?

I agree that this would be useful. I don't have immediate plans to implement this, since this would likely require additional dedicated runner host to avoid noisy results on shared infra.

@A5rocks
Copy link
Contributor

A5rocks commented Dec 22, 2022

I was looking at some other topics and found scala/scala-dev#338

Presumably most viable things listed there have already been tried (given e.g. turbo boost has already been disabled), but it's incredibly comprehensive and it would be great to catch even smaller regressions!

@JukkaL
Copy link
Collaborator Author

JukkaL commented Dec 28, 2022

I'm closing this now, as we have basic tracking of performance changes. I'll create another issue to track further improvements. In particular, it would be great to get the noise floor down to below 0.5%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance topic-developer Issues relevant to mypy developers
Projects
None yet
Development

No branches or pull requests

3 participants