Track performance regressions in CI #25262

saraedum · 2018-04-29T14:02:05Z

I am currently playing with airspeed velocity to track speed regressions in Sage. I would like to benchmark every doctest that has a long time or benchmark marker in it and also benchmark every method that has a time_ prefix (probably only in some benchmark module.)

We have something similar set up for https://github.com/MCLF/mclf/tree/master/mclf/benchmarks now. There are only two benchmarks but it works nicely.

I ran the above proposal for all the tags from 8.3.beta0 to 8.3. There's a lot of noise (because there was other activity on the machine) but you get the idea: https://saraedum.github.io/sage/

Another interesting demo of airspeedvelocity that is not related to Sage is here: https://pv.github.io/numpy-bench/#/regressions

Depends on #24655

CC: @roed314 @embray @nthiery @koffie @videlec

Component: doctest framework

Keywords: ContinuousIntegration

Work Issues: documentation, doctests, CI

Author: Julian Rüth

Branch/Commit: public/airspeed_velo @ 68869ae

Issue created by migration from https://trac.sagemath.org/ticket/25262

The text was updated successfully, but these errors were encountered:

saraedum · 2018-04-29T14:42:49Z

comment:1

I think we have to work with the Advanced API (https://docs.python.org/2/library/doctest.html#advanced-api) and hook into DocTestRunner.run() to track timings and export them into an artitfical benchmark/ directory that just prints these timings for asv.

roed314 · 2018-04-29T14:54:18Z

comment:2

This is great, and I'm happy to help!

We're already using the advanced API. See sage/doctest/forker.py, lines 425 to 786 (maybe it would make sense to do the exporting in summarize).

jdemeyer · 2018-04-29T17:23:04Z

comment:3

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

jdemeyer · 2018-04-29T17:24:12Z

comment:4

Duplicate of #12720.

saraedum · 2018-04-29T17:26:42Z

comment:5

I don't think this is a duplicate. This is about integrating speed regression checks into CI (GitLab CI, CircleCI.) Please reopen.

saraedum · 2018-04-29T17:29:24Z

comment:7

Replying to @jdemeyer:

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

That's what airspeed velocity is good for.

embray · 2018-04-29T21:12:46Z

comment:8

I am currently playing with airspeed velocity to track speed regressions in Sage

Great! It's an excellent tool and I've wanted to see it used for Sage for a long time, but wasn't sure where to begin. In case it helps I know and have worked with its creator personally.

embray · 2018-04-29T21:14:55Z

comment:9

Even if #12720 was addressing a similar problem, it's an orthogonal approach, and if Julian can get ASV working this could supersede #12720.

jdemeyer · 2018-04-30T05:08:45Z

comment:10

Replying to @saraedum:

Replying to @jdemeyer:

Just to say something which I have always said before: measuring timings is the easy part. The hard part is doing something useful with those timings.

That's what airspeed velocity is good for.

Well, I'd love to be proven wrong. I thought it was just a tool to benchmark a given set of commands across versions and display fancy graphs.

embray · 2018-04-30T09:52:42Z

comment:12

Not just across versions but across commits, even (though I think you can change the granularity). Here are Astropy's ASV benchmarks: http://www.astropy.org/astropy-benchmarks/

There are numerous benchmark tests for various common and/or time-critical operations. For example, we can track how coordinate transformations perform over time (which is one example of complex code that can fairly easily be thrown into bad performance by just a few small changes somewhere).

videlec · 2018-08-03T19:20:18Z

comment:14

update milestone 8.3 -> 8.4

saraedum · 2018-08-09T00:44:30Z

Author: Julian Rüth

saraedum · 2018-08-09T00:44:30Z

comment:15

Adding this to all doctests is probably hard and would require too much hacking on asv. It's probably best to use the tool as it was intended to be used. Once #24655 is in, I would like to setup a prototype within Sage. Any area that you would like to have benchmarked from the start?

jdemeyer · 2018-08-09T05:35:14Z

comment:16

Replying to @saraedum:

Any area that you would like to have benchmarked from the start?

This is the "hard part" that I mentioned in [comment:3]. Ideally, we shouldn't have to guess where regressions might occur, the tool would do that for us. I believe that the intention of #12720 was to integrate this in the doctest framework such that all(?) doctests would also be regression tests.

But that's probably not feasible, so here is a more productive answer:

All # long time tests should definitely be regression tests.
For each Parent (more precisely: every time that a TestSuite appears in a doctest): test creating a parent, test creating elements, test some basic arithmetic (also with elements of different such that we check the coercion model too).

jdemeyer · 2018-08-09T05:38:55Z

Replying to @saraedum:

We could have specially named methods, say starting in _benchmark_time_…

Adding a new method for each regression tests sounds quite heavy. Could it be possible to integrate this in doctests instead? I would love to do

EXAMPLES::

    sage: some_sage_code()  # airspeed

embray · 2018-08-09T10:40:23Z

comment:18

Replying to @saraedum:

Adding this to all doctests is probably hard and would require too much hacking on asv. It's probably best to use the tool as it was intended to be used. Once #24655 is in, I would like to setup a prototype within Sage. Any area that you would like to have benchmarked from the start?

I didn't realize you were trying to do that. And yeah, I think benchmarking every test would be overkill and would produce too much noise to be useful. Better to write specific benchmark tests, and also add new ones as regression tests whenever some major performance regression is noticed.

saraedum · 2018-08-20T06:37:13Z

comment:45

Replying to @jdemeyer:

Replying to @saraedum:

So, what do you think? Should we try to run time_* methods

I don't like this part because it doesn't mix well with doctests. I would really want to write a doctest like
sage: a = something
sage: b = otherthing
sage: c = computation(a, b)  # benchmark this
and being forced to wrap this in a time_ method is just ugly.

I see. I think it would be easy to track lines that say, e.g., # benchmark time separately. I am not sure if it's a good idea to add more magic comments to our doctesting. I've nothing against them in general, I am just worried that these features are relatively obscure so not many people are going to use them?

Let me try to start with the benchmarking of blocks that say # long time and add more features later.

nthiery · 2018-08-20T07:55:44Z

comment:46

Just two cents without having though two much about it.

I like the # benchmark approach too. It mixes well with how we write doctests and makes it trivial to create new benchmarks / annotate things as useful to benchmark.

I'd rather have a different annotation than # long time; otherwise devs will have to take a decision between benchmarking and running the tests always, not just with --long.

Of course, at this stage using # long time is a good way for experimenting. And it may be reasonable to keep benchmarking # long time lines later on.

Thanks!

embray · 2018-08-20T10:00:06Z

comment:47

Replying to @jdemeyer:

Replying to @saraedum:

So, what do you think? Should we try to run time_* methods

I don't like this part because it doesn't mix well with doctests. I would really want to write a doctest like
sage: a = something
sage: b = otherthing
sage: c = computation(a, b)  # benchmark this
and being forced to wrap this in a time_ method is just ugly.

Yes, something like that could be done. Again, it all comes down to providing a different benchmark discovery plugin for ASV. For discovering benchmarks in our doctest, all lines leading up to a # benchmark line could be considered setup code, with the # benchmark line being the one actually benchmarked (obviously).

Multiple # benchmark tests in the same file would work fine too, with every line prior to it (including other previously benchmarked lines) considered the setup for it.

It might be trickier to do this in such a way that avoids duplication but I'll think about that. I think it could still be done.

mantepse · 2018-08-23T05:00:09Z

comment:48

I think that this is wonderful.

Since I tried to improve performance of certain things recently, and will likely continue to do so, I would like to add doctests for speed regression already now. Should I use long or benchmark or something else?

saraedum · 2018-08-23T17:02:11Z

comment:49

Thanks for the feedback.

Replying to @mantepse:

Since I tried to improve performance of certain things recently, and will likely continue to do so, I would like to add doctests for speed regression already now. Should I use long or benchmark or something else?

Nothing has been decided upon yet. I could imagine something like # benchmark time or # benchmark runtime so that we can add # benchmark memory later. What do you think?

nthiery · 2018-08-24T09:22:29Z

comment:50

Presumably time benchmarking is more usual than memory benchmarking, so I would tend to
have "benchmark" be a short hand for "benchmark time", but that may be just me.

For memory usage, do you foresee using fine grained tools that instrument the code and actually slow down the execution? Otherwise, could "benchmark" just do both always?

embray · 2018-08-24T09:35:06Z

comment:51

I would actually like # benchmark - time, # benchmark - memory, etc. (syntax similar to # optional -) because this would fit very nicely with the existing model for ASV, which implements different benchmark types as subclasses of Benchmark, which are selected from by doing a string match--currently on function and class names--but the same string match could also be performed on a parameter to # benchmark. This would be the most extensible choice--the parameters allowed following # benchmark need not be hard-coded.

Of course, I agree time benchmarks are going to be the most common, so we could still have # benchmark without a parameter default to "time".

embray · 2018-08-24T09:36:23Z

comment:52

Once we're past this deliverable due date I'll spend some more time poking at ASV to get the features we would need in it to make it easier to extend how benchmark collection is performed, and also to integrate it more directly into our existing test runner.

nthiery · 2018-08-24T10:06:21Z

comment:53

Replying to @embray:

I would actually like # benchmark - time, # benchmark - memory, ...

I very much like this (well informed!) proposal.

simon-king-jena · 2019-01-13T12:18:11Z

comment:54

What is the status of this ticket? There is a branch attached. So, is it really new? Are people working on it?

For the record, I too think that having # benchmark - time and # benchmark - memory would be nice and very useful.

embray · 2019-01-14T10:40:09Z

comment:55

Right now we need to get the GitLab CI pipeline going again. I need to about getting some more build runners up and running; it's been on my task list for ages. That, or if we can get more time from GCE (if anyone knows anyone at Google or other cloud computing providers who can help getting CPU time donated to the project it would be very helpful).

sagetrac-git · 2019-09-12T15:57:05Z

Changed commit from d7ff532 to 89d4afb

sagetrac-git · 2019-09-12T15:57:05Z

Branch pushed to git repo; I updated commit sha1. New commits:

`89d4afb`	`Merge remote-tracking branch 'trac/develop' into HEAD`

saraedum · 2019-09-12T16:07:44Z

comment:58

Now that the CI seems to be mostly stable (except for the docbuild timing out for test-dev) we should probably look into this again?

I would like to get a minimal version of this working somehow. We should probably not attempt to get the perfect solution in the first run. The outputs this created are actually quite useful already imho. If our contributors actually end up looking at the results, we can add more features (more keywords, more iterations, memory benchmarking, comparisons to other CAS,…)

So, my proposal would be to go with this version (modulo cleanup & documentation & CI integration.) If somebody wants to improve/reimplement this in a good way, I am very happy to review that later.

I am not sure how much time I will have to work on this so if anybody wants to get more involved, please let me know :)

saraedum · 2019-09-12T16:08:12Z

Work Issues: documentation, doctests, CI

saraedum · 2020-01-21T16:05:54Z

Changed keywords from none to ContinuousIntegration

fchapoton · 2021-09-17T12:51:16Z

comment:61

rebased

New commits:

`68869ae`	`Merge branch 'u/saraedum/25262' in 9.5.b1`

fchapoton · 2021-09-17T12:51:16Z

Changed branch from u/saraedum/25262 to public/airspeed_velo

fchapoton · 2021-09-17T12:51:16Z

Changed commit from 89d4afb to 68869ae

fchapoton · 2021-09-17T14:43:18Z

comment:62

this needs adaptation to python3, apparently

saraedum · 2022-02-02T02:12:08Z

comment:63

I am thinking about reviving this issue with a different application in mind that is a bit easier than regression testing. Namely, to have a better understanding how different values for the algorithm keyword affect runtime.

I find that we rarely update the default algorithms. However, this could be quite beneficial say when we upgrade a dependency such as PARI or FLINT. It would be very nice to easily see how the different algorithms perform after an update and also a way to document the instances that have been used to determine the cutoffs that we are using.

Currently, we are using some homegrown solutions for this, e.g., matrix/benchmark.py or misc/benchmark.py.

mantepse · 2022-02-02T09:22:38Z

comment:64

What is actually the problem with the original goal?

saraedum · 2022-02-02T17:47:43Z

comment:65

Replying to @mantepse:

What is actually the problem with the original goal?

There's no fundamental problem. But doing the CI setup is quite a bit of work.

saraedum · 2023-02-09T12:57:19Z

cc @roed314 @seblabbe @alexjbest @mezzarobba.

@roed314 and I started to work on this again at days 117.

mezzarobba · 2023-02-09T16:00:38Z

Sorry that I missed the discussion. I'm happy to help too, but will have very little time for that after the end of the Sage Days.

saraedum added this to the sage-8.3 milestone Apr 29, 2018

saraedum added c: doctest framework labels Apr 29, 2018

jdemeyer closed this as completed Apr 29, 2018

jdemeyer added r: duplicate and removed p: major / 3 labels Apr 29, 2018

jdemeyer removed this from the sage-8.3 milestone Apr 29, 2018

saraedum added this to the sage-8.3 milestone Apr 29, 2018

embray removed the r: duplicate label Apr 29, 2018

embray reopened this Apr 29, 2018

videlec modified the milestones: sage-8.3, sage-8.4 Aug 3, 2018

This comment has been minimized.

Sign in to view

mkoeppe removed this from the sage-8.4 milestone Dec 29, 2022

saraedum mentioned this issue Feb 9, 2023

Merge and improve sage's benchmarking frameworks #16510

Open

saraedum mentioned this issue Feb 9, 2023

Use Airspeed Velocity for Regression Testing #35046

Draft

6 tasks

mezzarobba mentioned this issue Feb 9, 2023

collect misc old benchmarks in tests.benchmarks #35049

Open

Track performance regressions in CI #25262

Track performance regressions in CI #25262

Comments

saraedum commented Apr 29, 2018

saraedum commented Apr 29, 2018

roed314 commented Apr 29, 2018

jdemeyer commented Apr 29, 2018

jdemeyer commented Apr 29, 2018

saraedum commented Apr 29, 2018

saraedum commented Apr 29, 2018

embray commented Apr 29, 2018

embray commented Apr 29, 2018

jdemeyer commented Apr 30, 2018

embray commented Apr 30, 2018

videlec commented Aug 3, 2018

saraedum commented Aug 9, 2018

saraedum commented Aug 9, 2018

This comment has been minimized.

jdemeyer commented Aug 9, 2018

jdemeyer commented Aug 9, 2018

embray commented Aug 9, 2018

saraedum commented Aug 20, 2018

nthiery commented Aug 20, 2018

embray commented Aug 20, 2018

mantepse commented Aug 23, 2018

saraedum commented Aug 23, 2018

nthiery commented Aug 24, 2018

embray commented Aug 24, 2018

embray commented Aug 24, 2018

nthiery commented Aug 24, 2018

simon-king-jena commented Jan 13, 2019

embray commented Jan 14, 2019

This comment has been minimized.

sagetrac-git mannequin commented Sep 12, 2019

sagetrac-git mannequin commented Sep 12, 2019

saraedum commented Sep 12, 2019

saraedum commented Sep 12, 2019

saraedum commented Jan 21, 2020

fchapoton commented Sep 17, 2021

fchapoton commented Sep 17, 2021

fchapoton commented Sep 17, 2021

fchapoton commented Sep 17, 2021

saraedum commented Feb 2, 2022

mantepse commented Feb 2, 2022

saraedum commented Feb 2, 2022

saraedum commented Feb 9, 2023

mezzarobba commented Feb 9, 2023