Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce randomness in doctests #35522

Open
1 task done
tobiasdiez opened this issue Apr 16, 2023 · 21 comments
Open
1 task done

Reduce randomness in doctests #35522

tobiasdiez opened this issue Apr 16, 2023 · 21 comments

Comments

@tobiasdiez
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues for a bug report that matches the one I want to file, without success.

Problem Description

Currently a few doctests use random input (e.g. random_element()), which leads to unpredictable doctest failures (since depending on the input the tests might fail) and unreliable code coverage metrics (since depending on the input different code paths are executed).

Proposed Solution

Replace random input by deterministic/static input. If a test requires many different inputs to be convincing, it can be refactored using pytest parameterized tests.

Alternatives Considered

Accepts random test failures and unreliable coverage metrics.

Additional Information

A good starting point to see which tests are particularly affected is to go to https://app.codecov.io/gh/sagemath/sage/pulls, pick a PR at random, and look at the "Indirect changes" tab. These are changes in the coverage metrics that (most likely) are not coming from changes in the PR. For example, comparing https://app.codecov.io/gh/sagemath/sage/pull/35521/indirect-changes and https://app.codecov.io/gh/sagemath/sage/pull/35507/indirect-changes indicates that the tests in the following files are prone to lead different coverage:

and a few more. This can then be compared to the codecov report for the develop branch. For example, this line https://app.codecov.io/gh/sagemath/sage/blob/develop/src/sage/modular/modsym/boundary.py#L1385 is not covered in the develop branch but covered in the the PRs.

@tornaria
Copy link
Contributor

Maybe codecov could be run with --random-seed=0 so it is more deterministic.

Also, is there a way you can setup the codecov/project check to be just "informational", i.e. report the same as now and offer a link to coodecov, but make the test result a PASS (or netutral if that's an option) so the PR as a whole has a chance to PASS.

@tobiasdiez
Copy link
Contributor Author

tobiasdiez commented Apr 18, 2023

Maybe codecov could be run with --random-seed=0 so it is more deterministic.

I'm not sure about the other implications the random-seed has, but sure this would be a workaround. Feel free to open a PR if you think its a good change. The following needs to be changed:

../sage -python -m coverage run ./bin/sage-runtests --all -p2

Also, is there a way you can setup the codecov/project check to be just "informational", i.e. report the same as now and offer a link to coodecov, but make the test result a PASS (or netutral if that's an option) so the PR as a whole has a chance to PASS.

There is https://docs.codecov.com/docs/commit-status#informational, but I'm not sure if this is a good idea. The coverage check is not a "required check" for merging. And only other devs can judge if a missing line coverage is acceptable or not. I agree a "warning " status would be better than a complete red failure - but sadly, github only has a binary concept of pass or failure.

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 21, 2023

is there a way you can setup the codecov/project check to be just "informational", i.e. report the same as now and offer a link to coodecov, but make the test result a PASS (or netutral if that's an option) so the PR as a whole has a chance to PASS.

I agree with setting "informational" flag on. The failure in codecov/project is not directly related with the PR and insignificant. The coverage result (even from codecov/patch) was only "informational" in the trac era.

@tobiasdiez
Copy link
Contributor Author

Can we try out the "random-seed" fix first? A PR should almost never decrease the code coverage, so I think its a good idea to have the /project check fail in case the coverage does decrease.

@tornaria
Copy link
Contributor

Can we try out the "random-seed" fix first? A PR should almost never decrease the code coverage, so I think its a good idea to have the /project check fail in case the coverage does decrease.

It does, if I only add lines that are visited by # long time tests.

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 21, 2023

As I understand it,

(1) We want to give positive review only to a PR that passes all checks.
(2) It is very easy to decrease codecov/project because of random doctests
(3) It is very easy to decrease codecov/patch because we need many doctests to fully cover all code paths except simple PRs.

Then we have choices:

(a) If we want to keep the same less stringent level of code coverage as in the trac era, make both codecov/project and codecov/patch informational.

(b) If we aim for full code coverage only for the patch, make codecov/project informational.

(c) If we aim for full code coverage for both the project and the patch, do nothing but add more doctests.

(d) If we aim for full code coverage for both the project and the patch, do only the "random-seed" fix.

I think the random-seed doctesting was introduced for code robustness for varied inputs, which is the same aim with the full code coverage. So it seems to me that (d) is self-contradictory.

@tornaria
Copy link
Contributor

Also, it seems to be comparing only percentages, so it's not quite that useful and easy to confuse. If a patch adds coverage for some lines, then other lines that are no longer covered will not be found.

It'd be much more useful if we get a report on which lines were covered before but are not covered now. And if we keep it random (with the check informational for now) we might be able to figure out which lines randomly stop being tested, etc.

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 21, 2023

Also, it seems to be comparing only percentages, so it's not quite that useful and easy to confuse. If a patch adds coverage for some lines, then other lines that are no longer covered will not be found.

codecov works basically by counting lines that are run by doctests and dividing it by the number of all lines. If there are lines added by a patch that are not run for the doctests, the coverage is sure to decrease. So not to decrease, we need many tests to cover all code paths. We didn't do this in the trac era. If we want to change the policy, I think we need sage-devel discussion (or voting...)

@tornaria
Copy link
Contributor

Also, it seems to be comparing only percentages, so it's not quite that useful and easy to confuse. If a patch adds coverage for some lines, then other lines that are no longer covered will not be found.

codecov works basically by counting lines that are run by doctests and dividing it by the number of all lines. If there are lines added by a patch that are not run for the doctests, the coverage is sure to decrease. So not to decrease, we need many tests to cover all code paths. We didn't do this in the trac era. If we want to change the policy, I think we need sage-devel discussion (or voting...)

My point was: in a given PR there could be some new lines of code that are not covered by tests, or some tests are modified in such a way that old lines of code stop being covered. This in principle is what the test wants to catch. But maybe the same PR also adds some new coverage (for a different set of lines of code). Then the percentage may be the same, but the PR does not maintain coverage.

IMO there are two tests that should be made:
a. whether all the new lines of code are covered (here it's good to aim for 100%, but *)
b. whether all the old lines of code that were already covered are still covered.

  • not sure if we must require coverage with non-long tests... so either we need to run the whole testsuite in long mode, or at least run the modified files in long mode.

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 22, 2023

My point was: in a given PR there could be some new lines of code that are not covered by tests, or some tests are modified in such a way that old lines of code stop being covered. This in principle is what the test wants to catch. But maybe the same PR also adds some new coverage (for a different set of lines of code). Then the percentage may be the same, but the PR does not maintain coverage.

I see. Yes.

... or at least run the modified files in long mode.

+1

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 22, 2023

Can we try out the "random-seed" fix first? A PR should almost never decrease the code coverage, so I think its a good idea to have the /project check fail in case the coverage does decrease.

Isn't the constantly-failing codecode/project a CI breakage? Why not apply a quick fix first, and then take a good time to do the "random-seed" fix right?

On the other hand, I am not sure of the "random-seed" fix. There is really nothing broken with the (old) random-seed doctests. The problem is with our (new) code coverage tool.

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 22, 2023

I am testing the quick fix #35544. I don't know how it works yet.

@tobiasdiez
Copy link
Contributor Author

Can we try out the "random-seed" fix first? A PR should almost never decrease the code coverage, so I think its a good idea to have the /project check fail in case the coverage does decrease.

It does, if I only add lines that are visited by # long time tests.

Sorry for not being clear. Sure, right now it's not working as it should be and there is room for improvement. What I meant is that after the random tests being fixed, the project test coverage should almost never decrease. At least I cannot come up with a scenario where a decrease is expected. More likely is that a change by accident disables some tests or code branches, and you do want to be warned about such situations, ideally by a failing check.

@tobiasdiez
Copy link
Contributor Author

IMO there are two tests that should be made: a. whether all the new lines of code are covered (here it's good to aim for 100%, but *) b. whether all the old lines of code that were already covered are still covered.

While not perfect, the codecov checks are trying to do exactly this. a) is covered by the "/patch" check and b) by the "/project" check.

@tobiasdiez
Copy link
Contributor Author

On the other hand, I am not sure of the "random-seed" fix. There is really nothing broken with the (old) random-seed doctests. The problem is with our (new) code coverage tool.

There are quite a few reports about "random" doctest failures that disappear when run again. That codecov picks up on these instances, is a symptom of the same issue. In my opinion randomness is not a good tool to test different test inputs, except for in very exceptional situations.

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 22, 2023

On the other hand, I am not sure of the "random-seed" fix. There is really nothing broken with the (old) random-seed doctests. The problem is with our (new) code coverage tool.

There are quite a few reports about "random" doctest failures that disappear when run again.

Do you mean doctests that randomly fail? These are different from random doctests, which take random inputs. If a random doctest randomly fails, then the doctest is doing its job. No?

@tobiasdiez
Copy link
Contributor Author

On the other hand, I am not sure of the "random-seed" fix. There is really nothing broken with the (old) random-seed doctests. The problem is with our (new) code coverage tool.

There are quite a few reports about "random" doctest failures that disappear when run again.

Do you mean doctests that randomly fail? These are different from random doctests, which take random inputs. If a random doctest randomly fails, then the doctest is doing its job. No?

Sometimes yes, sometimes you just hit an edge case that was not accounted for when designed the test input (e.g. #35449). There are a few instances where random inputs are a good idea (advantages and disadvantages are for examples briefly discussed at https://en.wikipedia.org/wiki/Monkey_testing). But usually you would like to complement them by static tests that cover all relevant code branches: say a random-input test discovers a bug, then you add the input for that test as a static test to make sure the bug doesn't resurface in the future. The flaky codecov reports suggests that some random tests often hit very different code paths, which are not covered by static-input tests. But ideally you would only rarely hit paths that are not covered by static tests.

@kwankyu
Copy link
Collaborator

kwankyu commented Apr 22, 2023

Testing of the quick fix #35544 succeeded -- A PR decreasing codecov/project still passes the check.

My suggestion is to merge #35544 as a quick fix for broken CI.

@tobiasdiez
Copy link
Contributor Author

Sorry, but I still don't get why this is better than the "random-seed" fix (nor why there is need for a 'quick fix' instead of a proper fix).

@tornaria
Copy link
Contributor

tornaria commented May 6, 2023

Sorry, but I still don't get why this is better than the "random-seed" fix (nor why there is need for a 'quick fix' instead of a proper fix).

Because meanwhile all PR are getting FAIL CI. Top priority should be that CI works always so the state of a PR is meaningful.

The coverage check is really useful, as long as it's possible to make it pass in a sensible and deterministic way! Otherwise we will just learn to ignore (and hiding other failures, because the state of a PR is a single bit which is the logical and of all checks).

@kwankyu
Copy link
Collaborator

kwankyu commented May 8, 2023

I learned to ignore "codecov/project" check.

vbraun pushed a commit that referenced this issue May 28, 2023
    
<!-- Please provide a concise, informative and self-explanatory title.
-->
<!-- Don't put issue numbers in the title. Put it in the Description
below. -->
<!-- For example, instead of "Fixes #12345", use "Add a new method to
multiply two integers" -->

### 📚 Description

Related to #35522 and #35544. Should at least fix the changed codecov
between runs.

<!-- Describe your changes here in detail. -->
<!-- Why is this change required? What problem does it solve? -->
<!-- If this PR resolves an open issue, please link to it here. For
example "Fixes #12345". -->
<!-- If your change requires a documentation PR, please link it
appropriately. -->

### 📝 Checklist

<!-- Put an `x` in all the boxes that apply. It should be `[x]` not `[x
]`. -->

- [x] The title is concise, informative, and self-explanatory.
- [x] The description explains in detail what this PR is about.
- [x] I have linked a relevant issue or discussion.
- [ ] I have created tests covering the changes.
- [ ] I have updated the documentation accordingly.

### ⌛ Dependencies

<!-- List all open PRs that this PR logically depends on
- #12345: short description why this is a dependency
- #34567: ...
-->

<!-- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->
    
URL: #35632
Reported by: Tobias Diez
Reviewer(s): Matthias Köppe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants