Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Airspeed Velocity for Regression Testing #35046

Draft
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

saraedum
Copy link
Member

@saraedum saraedum commented Feb 9, 2023

📚 Description

…the changeset is probably not useful anymore. I just resurrected what was on #25262.

📝 Checklist

  • I have made sure that the title is self-explanatory and the description concisely explains the PR.
  • I have linked an issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation accordingly.

TODO

See comments in the source code. Additionally:

  • Import in asv is slow for methods with a [...] in the name.
  • Setting docstring as the module name fixed the import timings. However, now everything shows up as docstring in the benchmark grid :(

@mantepse
Copy link
Collaborator

mantepse commented Feb 9, 2023

I'd still find this wonderful!

@@ -0,0 +1,128 @@
from sage.all import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: in #35049 (trivial PR moving some existing benchmark files to the same place) I used sage.tests.benchmarks instead of sage.benchmark. No opinion on which is best, but I guess we don't want to keep both.

@codecov-commenter
Copy link

codecov-commenter commented Feb 9, 2023

Codecov Report

Patch coverage: 29.46% and project coverage change: -0.01 ⚠️

Comparison is base (52a81cb) 88.57% compared to head (5bdfec0) 88.57%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #35046      +/-   ##
===========================================
- Coverage    88.57%   88.57%   -0.01%     
===========================================
  Files         2140     2141       +1     
  Lines       397273   397370      +97     
===========================================
+ Hits        351891   351963      +72     
- Misses       45382    45407      +25     
Impacted Files Coverage Δ
src/sage/benchmark/docstring.py 0.00% <0.00%> (ø)
src/sage/doctest/forker.py 78.91% <18.75%> (-1.87%) ⬇️
src/sage/doctest/reporting.py 68.01% <48.00%> (-6.88%) ⬇️
src/sage/doctest/control.py 71.97% <54.54%> (-0.34%) ⬇️
src/sage/doctest/parsing.py 87.31% <100.00%> (+0.04%) ⬆️
src/sage/doctest/sources.py 83.51% <100.00%> (+0.18%) ⬆️
src/sage/groups/affine_gps/euclidean_group.py 88.88% <0.00%> (-11.12%) ⬇️
src/sage/groups/affine_gps/affine_group.py 96.62% <0.00%> (-2.25%) ⬇️
...e/combinat/cluster_algebra_quiver/mutation_type.py 50.11% <0.00%> (-2.02%) ⬇️
... and 27 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

roed314 and others added 8 commits February 10, 2023 13:43
it appears that you cannot send arbitrary amounts of data through a
message queue in Python, see
https://stackoverflow.com/questions/10028809/maximum-size-for-multiprocessing-queue-item.

The queue will block pretty early. So we take load off the queue by
using the local file system for some of the transfers.

We still need to delete the files which is not done in this proof of
concept yet.
… can be resolved by get_benchmark_from_name() in asv.benchmark
by making sure that a module.class.method lookup works, i.e., by not
having any . in the wrong places.
@saraedum
Copy link
Member Author

saraedum commented Feb 11, 2023

I ran matrix/ and rings/ for all the releases between 9.7 and 9.8.rc1. The results are https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/.

tl;dr the output is already useful but also very noisy. We should improve the regression detection to get rid of much of the noise.

If you go to regression there's a bunch of noise in the beginning (I guess) where just the rc1 was much slower than all the previous ones. So likely, there was just load on the machine when these samples were taken (maybe we can tune asv's regression detection to not detect these?)

[more noise]

The first benchmark that is not of this kind is [docstring.sage․rings․power_series_poly.track_PowerSeries_poly․truncate](https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/#docstring.sage%E2%80%A4rings%E2%80%A4power_series_poly.track_PowerSeries_poly%E2%80%A4truncate?commits=047281e0-9116c558). This seems to be noise however. At least I cannot reproduce this.

The first benchmark that is not of this kind either is docstring.sage․rings․polynomial․multi_polynomial.track_MPolynomial․is_symmetric. However, you can see that this one has recovered already in the latest rc. (Upon closer inspection, this seems to also have been noise.)

Then there's docstring.sage․matrix․matrix0.track_Matrix․is_singular and this I can finally reproduce. So, there seems to be a speed regression here.

I can also reproduce docstring.sage․rings․lazy_series.track_LazyPowerSeries_gcd_mixin․gcd; however, since this feature was only introduced in the 9.8 series, it's likely just a bugfix that makes the computation slower.

Some things look quite striking in the graphs like docstring.sage․rings․polynomial․polynomial_ring.track_PolynomialRing_general․element_constructor but I just cannot reproduce them. While this could be noise, I doubt it somewhat. It could simply be that the surrounding tests have changed, so due to some changed caching, the timings change at some point. To support this theory somewhat, I get the slow timings on a 9.7 build when just running this one doctest in a SageMath session.

PS: I note that printing of exceptions is slightly slower with 9.8.rc1 than with 9.7. So maybe this explains some of this noise in the doctest timings? I should maybe run all the tests again and see if I observe the same thing again.

@mantepse
Copy link
Collaborator

Very general question: what is actually compared, in particular, if the doctests are modified?

@saraedum
Copy link
Member Author

saraedum commented Feb 12, 2023

Very general question: what is actually compared, in particular, if the doctests are modified?

The doctests are extracted (hover over a test and you should see the extracted doctest.) When the doctest changes, the results are not related anymore. Currently, the old results are not shown at all when this happens. We could probably change this behavior and still show them but label them differently.

@saraedum
Copy link
Member Author

saraedum commented Feb 16, 2023

Also, is it useful if we try to reproduce your experiments on a different system to have a second machine in the results?

I don't think that's very useful. We are going to have similar (or worse) amounts of noise when we use GitHub CI instead. So what we come up with should work for a very noisy runner.

How did you run the code with older versions of Sage, given that it does not rebase cleanly? (I suppose git rebase -X ours should suffice?...)

I just extracted (the relevant bits of) develop...asv into a patch file and it applied cleanly to all these versions that I built from source.

@saraedum
Copy link
Member Author

saraedum commented Feb 16, 2023

Let me run things for 9.8. Let's see how that changes the pictures.

Same picture:

image

@mantepse And actually, I can reproduce this regression. I paste the following in a sage repl (the preceding doctests)

DecoratedPermutation([2, 1, 3])
DecoratedPermutation([2, 1, -3])
S = DecoratedPermutations(3)
elt = S([2, 1, -3])
TestSuite(elt).run()
S = DecoratedPermutations(3)
elt = S([2, 1, -3])
elt.check()
elt = S([2, -1, 3])
len(set(hash(pi) for pi in DecoratedPermutations(3)))
S = DecoratedPermutations(3)
elt1 = S([2, 1, -3])
elt2 = DecoratedPermutation([2, 1, -3])
elt1 == elt2
S = DecoratedPermutations(3)
elt1 = S([2, 1, -3])
elt2 = DecoratedPermutation([2, 1, 3])
elt1 != elt2
DecoratedPermutation([2, 1, -3]).size()

and then I call

reset()
%time DecoratedPermutation([2, 1, -3]).to_signed_permutation()
Wall time: 3.64 ms  # 9.7
Wall time: 11.3 ms  # 10.0.beta0

This is on a different CPU than the one that created the graphs. So the timings are different. But the change is on a similar scale.

@saraedum
Copy link
Member Author

So, one conclusion here could be that we need a little script to run all doctests up to some line to make it easier to "debug" these kinds of reports.

@saraedum
Copy link
Member Author

Without any tricks, I can reproduce this slowdown (since @mantepse probably cares about species?)

image

@saraedum
Copy link
Member Author

I tweaked the regression detection data a bit and updated the deployed demo at https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/#regressions?sort=3&dir=desc

Now the detected major regressions look legit to me (at least the graphs on the first page are mostly convincing.)

@mantepse
Copy link
Collaborator

@mantepse And actually, I can reproduce this regression. I paste the following in a sage repl (the preceding doctests)
...
%time DecoratedPermutation([2, 1, -3]).to_signed_permutation()

Hm, on my box (ubuntu 22.04.1, but I guess that's irrelevant), I can only reproduce a slight difference in a fresh session, I get about 2.6ms vs 3.45ms.

If I do the test (including the stuff that came before) again, the difference is gone. What puzzles me, however, is that this regression (if it is real) should be visible also in other doctests.

Still, I'm not sure how relevant this is. I really think it would be better to have a few specially marked tests (i.e., # benchmark which take slightly longer, and only for some exposed methods. I guess that this wouldn't change the framework a lot, would it? Do you currently run the long doctests, too?

In the case at hand, for example:

sage: l = DecoratedPermutations(8)
sage: _ = [pi.to_signed_permutation() for pi in l]

sage: set_random_seed(1783)
sage: pi = DecoratedPermutations(100).random_element()
sage: pi.to_signed_permutation()

(and this won't work, because it exhibits the missing random generator, which is #34925)

@mezzarobba
Copy link
Member

mezzarobba commented Feb 19, 2023

How hard would it be to aggregate the timings from several runs (and detect regressions based on the minimum run time of the test for each revision, say)?

That would be very easy. But the idea here is to use the timings from the test run in GitHub's CI where we only want to run all the tests once.

Yes, I was thinking more about older releases.

I don't think that's very useful. We are going to have similar (or worse) amounts of noise when we use GitHub CI instead. So what we come up with should work for a very noisy runner.

Same here.

@saraedum
Copy link
Member Author

@mantepse did you run the test as I had posted it, i.e., with the reset that happens between doctests? Without it, I also don't get the regression.

@saraedum
Copy link
Member Author

Still, I'm not sure how relevant this is. I really think it would be better to have a few specially marked tests (i.e., # benchmark which take slightly longer, and only for some exposed methods.

For benchmarking explicitly, I propose to use asv the way it is meant to be used, i.e., just drop some benchmarks in the src/sage/benchmark/ directory, see https://asv.readthedocs.io/en/stable/writing_benchmarks.html#writing-benchmarks

I guess that this wouldn't change the framework a lot, would it? Do you currently run the long doctests, too?

No, I only run the normal doctests. This is also what is being run by the GitLab CI on each merge currently.

@mantepse
Copy link
Collaborator

Still, I'm not sure how relevant this is. I really think it would be better to have a few specially marked tests (i.e., # benchmark which take slightly longer, and only for some exposed methods.

For benchmarking explicitly, I propose to use asv the way it is meant to be used, i.e., just drop some benchmarks in the src/sage/benchmark/ directory, see https://asv.readthedocs.io/en/stable/writing_benchmarks.html#writing-benchmarks

The problem is that this disconnects code from tests. I would guess that we would be much more encouraged to write benchmark tests, if they are kept in the same source.

Possibly one could also automate some for certain categories, but I think it would be great if we could start somehow and refine later.

If benchmark tests take around 1 second, I suspect that noise would also be less of an issue.

I guess that this wouldn't change the framework a lot, would it? Do you currently run the long doctests, too?

No, I only run the normal doctests. This is also what is being run by the GitLab CI on each merge currently.

Did you try to run the long doctests, or do they take too long?

More explicitly, I propose to provide infrastructure for either a new section BENCHMARKS::, or a new marker # benchmark, which is ignored for testing.

I think we could specify that these tests should be independent of other tests and should, if at all possible, never be modified.

What do you think?

@saraedum
Copy link
Member Author

What do you think?

I am a bit torn here but I think we should try to just use the tools in the way they are meant to be used. We have some benchmarks in SageMath already and nobody cares about them it seems, so that shows that we should maybe not create a sage.benchmark namespace. And using the doctest approach proposed in this PR, we are actually not using asv as it is meant to be used. But I guess it just feels like the minimal amount of work necessary to get some benchmarks at least while not creating lots of logic for benchmark filtering that needs to be maintained.

A marker # benchmark would indeed be easy to implement. A marker BENCHMARKS:: is much more work I think. If we even call it # optional: benchmark then it's quite easy I think. It is, however, not clear to me how these benchmarks get executed? Do the lines above that do not say # benchmark also get executed first? In any case, the downside is that we need to write documentation and maintain this feature. So maybe it's just easier to say: all our doctests are benchmarks (long or not, optional or not, depending on how our CI is configured) and for proper benchmarks, please read the asv documentation.

@mantepse
Copy link
Collaborator

I am mostly interested in avoiding performance regressions, because I spent quite some time fixing constructors that were unnecessarily slow. In particular, these were an obstacle for findstat, some, for example the constructor for posets, still are.

Thus, I would love to see a bot that runs on every ticket, or at least on every beta relaease, which reliably warns us of (major) regressions.

Is this possible with the current approach? Are the timings always taken on the same machine, or is there some magic that accounts for different machines?

If we can run the long doctests on github, then maybe we do not need an extra marker - just mention in the developer's guide that these are used to detect performance regressions. However, I'm afraid that we will hit the problem that we cannot compare doctests which changed. Do I understand correctly that doctests for a method cannot be compared if even one of them was modified?

Here is yet another idea to avoid the problem of changing doctests and to get started more quickly. We could have a marker # optional: benchmark XX.X[.betaX], with the following meaning:

  • for every method only doctests marked such are run and measured.
  • XX.X[.betaX] specifies the first version number for which the doctest should be run (most importantly, any version number for which the test passes)

Then (if this is possible at all), the bot would run this test with all binaries available which are younger than XX.X[.betaX] and additionally with the patch proposed in the ticket.

A prerequisite is, of course, that we can access pre-built binaries. Is this possible - difficult - easy?

To get started, we could add this marker to the long doctests for which it makes sense.

This approach might have some advantages:

  • timings would be taken on the same machine
  • by focussing on doctests which do not just take a few milliseconds we might have less noise
  • we would avoid the problem of changing doctests

If this approach is easy to implement, there are some later optimizations possible:

  • we might be able to store timings taken on a particular machine for a particular doctest.
  • if this is not possible, I guess that most of the time, it would only be interesting to compare with the betas of the current release, so we could avoid running the tests on old releases.

@saraedum
Copy link
Member Author

Thus, I would love to see a bot that runs on every ticket, or at least on every beta relaease, which reliably warns us of (major) regressions.
Is this possible with the current approach? Are the timings always taken on the same machine, or is there some magic that accounts for different machines?

I don't think this is possible reliably. Other projects have experimented with this in the past and from the examples I know are not using this setup anymore. Even if you run the benchmarks on the same machine every time, you need to go to some length to get reproducible timings (disable power management, make sure there's nobody else on the same hardware, so no virtual machines, make sure nothing else runs in parallel, so no parallel execution of tests, …)

In principle the timings are on very similar machines. The tests run on GitHub CI virtual machines that have in my experience very similar hardware. If the hardware changes a lot, one can detect the generation of the machine and set a machine in asv to separate runs on different hardware. (There's no magic to relate timings on different hardware, they are treated as independent data.)

However, to your Is this possible? question. You could add actual benchmarks that test the constructors properly to sage.benchmark. You can configure these to run the constructors many times so you get meaningful timings. The asv website publishes an rss feed of regressions that you could subscribe to and filter for the things that you care about. Then you would get notified about regressions in your benchmarks with every merge into the develop branch. You can then verify these regressions and do something about them, e.g., complain with the PR that broke things. (I strongly advise against automating this. The project I know removed this again after ignoring it for a while because it was way too noisy.)

Do I understand correctly that doctests for a method cannot be compared if even one of them was modified?

Yes if one line of the the actual doctested content changes, the old timings are not shown anymore. I think this is the correct approach. I don't want to use doctests of actively developed code for benchmarking. We should write proper benchmarks for this. But I want to see that much of number theory got a bit slower because somebody changed a small detail of how right_kernel works ;)

Here is yet another idea to avoid the problem of changing doctests [...]

Sorry, this is not easily possible. We don't even have all of the recent releases available as docker images or conda builds and they even have dependencies moving out of sync with how the SageMath distribution has updated them in some cases. We have no recent betas available as binaries in that format.

@saraedum
Copy link
Member Author

I don't think that's very useful. We are going to have similar (or worse) amounts of noise when we use GitHub CI instead. So what we come up with should work for a very noisy runner.

Same here.

@mezzarobba do you think that the regressions shown at https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/#regressions?sort=3&dir=desc now look reasonable enough?

@mezzarobba
Copy link
Member

mezzarobba commented Feb 21, 2023

@mezzarobba do you think that the regressions shown at https://www.imo.universite-paris-saclay.fr/~ruth/asv/html/#regressions?sort=3&dir=desc now look reasonable enough?

I only see regressions in combinat/, did you remove the data from matrix/ and rings/ that you used in the first experiments?

@saraedum
Copy link
Member Author

saraedum commented Feb 21, 2023

Yes, I only ran combinat. Let me run all of SageMath. I'll ping you tomorrow with the full output :)

Edit: The machine I use to run these tests is currently not functioning properly (somebody spawned a job that ate all the RAM) so it might take some more time for the admins to take care of this first.

@mantepse
Copy link
Collaborator

Sorry for asking so many questions: if it is not possible to reliably detect major regressions, what is the aim of this ticket?

@saraedum
Copy link
Member Author

I think we can reliably detect major regressions a while after they happened and we can then also detect where they happened. We can also get good hints at where a regression might have happened. For actual benchmarks in sage.benchmark, we get a tool to run these benchmarks during development locally and time things properly without having to manually resort to %%timeit.

@mantepse
Copy link
Collaborator

Concerning

Sorry, this is not easily possible. We don't even have all of the recent releases available as docker images or conda builds and they even have dependencies moving out of sync with how the SageMath distribution has updated them in some cases. We have no recent betas available as binaries in that format.

and

I think we can reliably detect major regressions a while after they happened and we can then also detect where they happened. We can also get good hints at where a regression might have happened. For actual benchmarks in sage.benchmark, we get a tool to run these benchmarks during development locally and time things properly without having to manually resort to %%timeit.

I don't really understand.

  1. Why would it be hard to create binaries for the last few releases and betas? I admit, I don't know how much space a binary takes, is this the problem? To make benchmarking as proposed interesting, I think that having binaries for a few releases after sage 9.0 and for the last betas until the current one would suffice completely. If some are missing (because of dependency problems) that wouldn't really hurt either, would it?
  2. What is sage.benchmark - is this a proposal, or something I am missing? In any case, what I believe is most important is that benchmarks are not separated from other code. The mental barrier to write a simple benchmark in the same file is much lower.

Having such a benchmarking tool as a bot would be nice, but I don't think it is absolutely necessary. In fact, I think it would even be sufficient if it is run for every beta, and not for every ticket, if that's a problem.

If it is less work for you: maybe it would be more efficient to have a video conference, so you could explain the problems to me in person?

@saraedum
Copy link
Member Author

Why would it be hard to create binaries for the last few releases and betas?

We have failed to find a maintainer for the docker images for the past few years, i.e., have somebody build binaries and make sure that the script used to build those binaries works somewhat reliably on some sort of CI hardware. (Part of the story was also that we couldn't find anybody to maintain that CI infrastructure …)

I am very reluctant to create any process that does become more than just a tiny added step in our current setup because it creates a burden to maintain. (We run the CI on the doctests anyway, so let's just export the data it collects. And while we parse that data, asv will also run the benchmarks that are explicitly marked as such.)

What is sage.benchmark?

Sorry, if that's not clear. With the changes in this PR, asv runs all benchmarks that are in sage.benchmark. Currently, there's only the "benchmark" that parses the output from a previous doctest run and reports those timings to asv. We should for a start also convert the existing benchmarks (somewhat spread out in the special benchmark files that probably no one is aware of...) to be asv benchmarks that reside in sage.benchmark.

If you want to write benchmarks that are implemented outside of sage.benchmark, you can just add a benchmark class to your module and make sure it gets imported in some way by say sage.benchmark.all or sage.benchmark.combinat.all or whatever feels most natural over time.

If it is less work for you: maybe it would be more efficient to have a video conference, so you could explain the problems to me in person?

Sure. I can do today at 6pm Germany time or basically any time tomorrow. Let's see if we can agree on some time at https://sagemath.zulipchat.com/#narrow/stream/271079-doctest/topic/tracking.20performance.20regressions

@github-actions
Copy link

github-actions bot commented Mar 2, 2023

Documentation preview for this PR is ready! 🎉
Built with commit: 5bdfec0

@mantepse
Copy link
Collaborator

Any news here?

@saraedum
Copy link
Member Author

saraedum commented Jul 6, 2023

No, I had no time to work on this. (And had missed your question.)

@mantepse
Copy link
Collaborator

mantepse commented Jul 7, 2023

No, I had no time to work on this. (And had missed your question.)

Looking forward to the invention of the time turner :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants