Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better primer tests (False positives detection) #5364

Closed
Tracked by #5306
Pierre-Sassoulas opened this issue Nov 22, 2021 · 36 comments
Closed
Tracked by #5306

Better primer tests (False positives detection) #5364

Pierre-Sassoulas opened this issue Nov 22, 2021 · 36 comments
Labels
Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve primer

Comments

@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Nov 22, 2021

Current problem

Right now the primer tests are only checking for fatal messages and crashes. But anticipating false positive would be a huge plus in term of quality of releases.

Desired solution

A solution to anticipate the false positives by running pylint on external repositories.

Additional context

Originally posted by @DanielNoord (#5173 (comment)):

Some points I found while working on this:

  1. The initial approach of asserting ex.code == 0 doesn't work since many packages will return error messages. Even the stdlib packages. We should therefore only look for crashes (ex.code == 32) or fatal messages (ex.code == 1) I think.

  2. I have created a new CI job to do the primer tests on Linux on every push or pull requests. We don't want to run them only when bumping a version (as we did with the previous acceptance tests), but I think it is good to separate them from the other tests. Especially since they will probably take much longer to complete (more on that later) and sometimes early fails on the tests are helpful in finalising a PR (especially after GitHub review comments break a test).

  3. The lazy loading of repo's checks for the SHA1 hashes of the local commit and remote commit and reclones when it finds a difference.

  4. tests/primer/primer_external_packages.py is used to store a list of packages to download. We currently include numpy but I would argue against including it. numpy has its tests included in the sources files. Normally we don't want to run pylint over tests as this can create problems when tests use frameworks that use non-normal python code. Besides, running over the tests also really inflates the time it takes to run the primer tests. I would think we could come up with 2/3 other projects that might be better. Note that these projects do not need to use pylint, we just need to be able to assume that their code is "normal" and therefore shouldn't crash pylint or astroid.

  5. We might also want to improve the message that gets raised when the test fails. Currently it is not immediately clear where pylint crashed. Improving the message might help expedite the process of fixing it.
    Perhaps we should add --only-fatal? To make the output only print fatal errors?

Originally posted by @cdce8p (#5173 (comment)):

Maybe the --capture=tee-sys option works?
https://pytest.org/en/6.2.x/capture.html#setting-capturing-methods-or-disabling-capturing

Originally posted by @DanielNoord (#5173 (comment)):

The reasoning for checking for error is that if there is an "error" in a well tested largely used package main branch, it means pylint created a false positive. We don't need to use the --error-only option as we can bit decode the exit code to check if there were errors or not. But it does force us to install the project dependencies.

I'm not sure this is the case. Take undefined-variable for example:
https://github.com/PyCQA/pylint/blob/e5082d3f9401dbcf65b40ce6a819d2a09beccb5c/pylint/checkers/variables.py#L393-L397

We know there are issues with this message. There are false positives and negatives and due to the lack of control-flow we are not able to solve this (now).
If we make it so the primer tests fail whenever an Error-level message is found in a project there cannot be any undefined-variable messages. This is (currently) unfeasible. There are many more Error-level messages where we know there are issues, which we cannot always fix so easily.

We can (sort of) avoid this by only including projects that use pylint as they are likely to have already disabled current false positives, but that doesn't fully help us.
What if a commit changes the way undefined-variable behaves and the project emits 10 new warnings. We would need to investigate whether any of these are correct or false positive. For larger projects this number might be much higher. Even if we identify 1 false positive and fix it, the primer will still fail because of 9 undefined-variable messages. We know these messages are correct (the project just hasn't updated to the WIP-pylint version) and can merge the commit, but from now on every primer CI job will fail because of those 9 messages.

By only checking for Fatal and Crash we make sure that pylint can parse any type of code (pylint-enforced and non-enforced) without breaking/crashing.
For reference the only Fatal messages are: method-check-failed, fatal, astroid-error and parse-error.

We might want to create a command --use-all-extensions

I agree with your three point, but this one is really good as this can be useful for those who want to use all extensions. I'm also thinking about making is possible to disable some extensions because preventing while loop is really opinionated for example. Maybe a disable-plugin option to mirror the load-plugin one ?

So you would want to allow the use of --use-all-extensions and disable-plugin at the same time?

@Pierre-Sassoulas Pierre-Sassoulas added Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve labels Nov 22, 2021
@cdce8p cdce8p added the primer label Nov 22, 2021
Pierre-Sassoulas added a commit that referenced this issue Nov 24, 2021
* Add changelog and warning about unstable API in testutil

* Add primer tests, (running pylint on external libs during tests)

In order to anticipate crash/fatal messages, false positives are harder
to anticipate. Follow-up will be #5359 and later on #5364 

Add '__tracebackhide__ = True' so the traceback is manageable

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
@jacobtylerwalls
Copy link
Member

What if a commit changes the way undefined-variable behaves and the project emits 10 new warnings. We would need to investigate whether any of these are correct or false positive. For larger projects this number might be much higher.

A data point. Since I'm back for round 2 on used-before-assignment, I linted django against main and my open PRs.
I spotted 1 new warning on main, and distinct new warnings on each of my PRs. One was a true positive. Two were false positives (#5586 and #5583 (comment)). So I agree with Pierre that this isn't necessarily true:

The reasoning for checking for error is that if there is an "error" in a well tested largely used package main branch, it means pylint created a false positive.

Still, it was easy to glance by eye whether they were true or false positives (easier than debugging the root cause in pylint), so even some artifact printed out during the CI job would be nifty.

@jacobtylerwalls
Copy link
Member

Likely doing #5403 first would help.

@jacobtylerwalls
Copy link
Member

@DanielNoord this might be the place to continue the discussion about false negative/positive changes in primer repos?

home-assistant is awesome as a primer project but we would need to run the primer over it twice to catch false negatives and that would really inflate CI runs.

What if we committed the baseline json to the repo? So we only lint a project once and compare it to the baseline on disk. Or am I missing why it would need to be run twice?

@DanielNoord
Copy link
Collaborator

One thing I considered is uploading the primer json output as an artefact on main and then downloading it in CI of PRs.

However, I noticed that the mypy primer doesn't do this. They use git merge-base to calculate the common ancestor and then do a diff. I implemented a similar process in DanielNoord/pydocstringformatter#78.
My rationale for this was/is:

  1. It makes sure that the primer uses the two correct commits
  2. It sounds logical to upload the main artefact and re-use it, people over at mypy will have thought of it as well. The fact that they (with all their contributors) haven't implemented it and also don't mention it on the todo list for mypy primer tells me that there are probably problems with that approach.

Of course, we could definitely explore this, but I feel like they would do it too if was actually feasible.

@cdce8p
Copy link
Member

cdce8p commented Mar 26, 2022

Artifacts might be difficult to deal with since you would probably need to know the actions run that created them. What might work however is using the cache for that. With a good cache key, we should be able to store almost all results we need. The fallback can still be to just run it twice.

@DanielNoord
Copy link
Collaborator

Yeah, I mean it's definitely something we should explore but perhaps in V3? V4? (😄).

You know what, I'm going to ask the guys over at mypy primer if there is a specific reason why they don't use a cache.

@cdce8p
Copy link
Member

cdce8p commented Mar 26, 2022

Yeah, I mean it's definitely something we should explore but perhaps in V3? V4? (😄).

You know what, I'm going to ask the guys over at mypy primer if there is a specific reason why they don't use a cache.

If I had to guess, nobody has had the idea and/or time to implement it yet.

@cdce8p
Copy link
Member

cdce8p commented Mar 26, 2022

@DanielNoord but both for baseline and primer we just need to fix the json output. There has been some discussion about that already.

Don't know. Isn't a simple line diff also possible. I believe that's what mypy_primer does. With that, no additional formatting would be necessary.

--
You might have already though of that. I run my Home Assistant pylint tests with useless-suppression enabled. Should be useful to see improvements.

@DanielNoord
Copy link
Collaborator

@DanielNoord but both for baseline and primer we just need to fix the json output. There has been some discussion about that already.

Don't know. Isn't a simple line diff also possible. I believe that's what mypy_primer does. With that, no additional formatting would be necessary.

-- You might have already though of that. I run my Home Assistant pylint tests with useless-suppression enabled. Should be useful to see improvements.

Yeah I was thinking about that, but then started to think about what would happen if:
main:
module a: no message
module b: 1 message
PR:
module a: 1 message
module b: 2 messages

The diff would include the Module a "name line" (How do we call that? The line that gets printed at the start of a module's messages) and the second message from b would be directly under that as well.
Actually there is an open issue about that line and how it makes the standard output awful to programmatically parse.

I foresaw some headaches getting that to look pretty while a parseable JSON makes lives just that much easier (and is requested by other projects and users as well). Felt like it would be better to catch two birds with one stone here rather than spend some time working with the less parseable diff output.

@cdce8p
Copy link
Member

cdce8p commented Mar 26, 2022

Yeah I was thinking about that, but then started to think about what would happen if: main: module a: no message module b: 1 message PR: module a: 1 message module b: 2 messages

The diff would include the Module a "name line" (How do we call that? The line that gets printed at the start of a module's messages) and the second message from b would be directly under that as well. Actually there is an open issue about that line and how it makes the standard output awful to programmatically parse.

I would just strip out every line that starts with ****. Should be fairly strait forward. However, one issue might be the non-deterministic output. Especially with multiple jobs, I don't think the order is guaranteed. That would require at least some parsing, ordering, and set comparisons 🤔

************* Module homeassistant.components.decora.light
homeassistant/components/decora/light.py:73:16: W0212: Access to a protected member _switch of a client class (protected-access)
************* Module homeassistant.components.spotify.media_player
homeassistant/components/spotify/media_player.py:101:12: W0212: Access to a protected member _attr_available of a client class (protected-access)
homeassistant/components/spotify/media_player.py:104:12: W0212: Access to a protected member _attr_available of a client class (protected-access)
homeassistant/components/spotify/media_player.py:106:12: W0212: Access to a protected member _attr_available of a client class (protected-access)

@DanielNoord
Copy link
Collaborator

To get back on the response from mypy. They just don't have a real need for caching of results as their unittests are slower than the primer runs.

I might have some time to look at this in the near future. Although I do think that fixing the json output at the same time might be helpful as well.

@DanielNoord
Copy link
Collaborator

I've got a first proof of concept over at: DanielNoord#145

That PR creates an intentional false positive for method-cache-max-size-none.
The proof of concept only runs on astroid and does nothing to compare the output, but at least it shows that it is possible.

It does cache the projects that we run over, although very roughly. Also it gets the main run from the main branch, so we don't need to duplicate those runs.

@Pierre-Sassoulas
Copy link
Member Author

I've got a first proof of concept over at: DanielNoord#145

It looks pretty dandy ❤️ ! Wondering if showing the code around the new warning (maybe ~= 10 lines, 5 before, 5 after) is reasonable. It would facilitate the review for sure by allowing to understand what's happening in most case without opening the analyzed code in an external IDE.

@DanielNoord
Copy link
Collaborator

With #6723 this is now merged on main. However, let's keep this open for a little bit to have a central place for any discussion about any potential follow-up issues.

@Pierre-Sassoulas Pierre-Sassoulas added this to the 2.14.0 milestone May 28, 2022
@DanielNoord
Copy link
Collaborator

@jacobtylerwalls @Pierre-Sassoulas See #6636 (comment).

Do we want to exclude messages were only the actual message changes? Or do we want to keep warning about those?

@Pierre-Sassoulas
Copy link
Member Author

Pierre-Sassoulas commented May 29, 2022

I think we can show the diff of the message in that case. We know there's a change, but we can probably benefit from comparing the message generated.

Ie: Something like:

The following message content changed:

  1. consider-using-in:
- *Consider merging these comparisons with "in" to 't in (token.LPAR, token.RPAR)'*
+ *Consider merging these comparisons with "in" by using "t in (token.LPAR, token.RPAR, )". Use a set instead if elements are hashable."*

https://github.com/psf/black.git/blob/main/src/black/nodes.py#L313

@DanielNoord
Copy link
Collaborator

See:
https://github.com/DanielNoord/pylint/runs/6641534527?check_suite_focus=true

We don't actually fail on a crash. We should probably do so.

@Pierre-Sassoulas
Copy link
Member Author

If we fail on a crash but its bleeding edge astroid, I don't know... Do we also have a check for astroid stable ? I don't want a situation where pylint pipeline is unstable because it depends on astroid

@DanielNoord
Copy link
Collaborator

I just added an assert False to see how we would handle fatal errors. Please see:
DanielNoord#149 and the last run.

Although they do show up, I don't like how they can be easily missed. I'll try to think of something to make them more visible. Perhaps a new New fatal messages: section at the top of the comment.

@Pierre-Sassoulas
Copy link
Member Author

💥 (:boom:) 💣 (:bomb:), 💀 (:skull:) ❌ (:x:) or ⁉️ (:interrobang:) might come handy 😄

@DanielNoord
Copy link
Collaborator

I'm assigning myself to the fatal error issue.

@jacobtylerwalls I just thought of another issue which we should probably solve:

  1. commit 1 gets merged on main and Primer / Main is trigger with astroid at commit a
  2. astroid merges commit b which creates a regression which we don't catch
  3. A PR gets opened with commit 2 which triggers Primer / Run with commit b on astroid
  4. PR fails and shows as if commit 2 is triggering fatal messages

I think we should probably do something about that..

@jacobtylerwalls
Copy link
Member

If we truly end up with a "race condition" like that I'm happy to explain it to a pylint contributor. The pipeline wouldn't be failing. And they would probably complain "but my changes couldn't have done THAT!" so I doubt they would spin too many wheels. If they do want to pop the hood it's never too early to start learning about astroid!

Anyway, as soon as something else is pushed to main, there would be a new run and no longer any drift the next time someone pushes to the PR. You can construct a similar false negative problem. Until pylint and astroid become the same project (or a submodule, or we run the pylint test suite on astroid's CI...) we're always going to have small corners of problems. We push to main so much and to nontrivial PRs so much I doubt it would amount to much in practice.

I also think we shouldn't put outsized attention on fatal errors and crashes; astroid changes can involve a lot more than just crashes! :-)

@jacobtylerwalls
Copy link
Member

I guess I'm saying I much prefer one commit of drift across astroid during a one-commit interval on a PR to a four-month drift between CI and what we already know we will have to depend on for four months of PRs!

@DanielNoord
Copy link
Collaborator

Hmm although I agree that this probably won't happen often and can be easily explained I do think we can improve the situation somewhat. What about running a cron-timed Primer job every morning at 12am and let it comment/ping us whenever it finds an issue with bleeding edge astroid? That way we have best of both worlds.

I'm just thinking of the previous issues we had with the CI. Although these should be resolved after a single commit to pylint main, having to explain the fault commit to all contributors become annoying quite fast 😄

@Pierre-Sassoulas
Copy link
Member Author

Pierre-Sassoulas commented May 29, 2022

having to explain the fault commit to all contributors become annoying quite fast

Yeah this is what I'm trying to avoid too. Maybe if bleeding edge fail we could launch on the normal astroid. Something like: primer bleeding-edge|echo "Someting is wrong with bleeding edge astroid"&&primer stable-astroid. If stable pass, we add a message to the github comment that we maintainer will notice but should not alarm occasional contributor something like "bleeding edge astroid create new fatals, do not bother with this, it's not related to your changes". It should happen seldom enough that it's not a performance issue.

@jacobtylerwalls
Copy link
Member

I'm just thinking of the previous issues we had with the CI.

Right, this is sort of what I'm worried about. I don't want to focus too much on "fighting the last war" -- if we're talking about changes in astroid that cause failing pipelines in pylint, not merely vanilla behavior changes, we should solve it directly: run a subset of pylint tests in astroid, enforce pylint-tested labels on every astroid PR, or just manually trigger primer-run / main after every astroid merge, which is one click. We should be encouraging returning pylint contributors to be developing against bleeding-edge astroid anyway. We just don't want to scare away the first-timers, which I'm sympathetic to!

Something like: primer bleeding-edge|echo "Someting is wrong with bleeding edge astroid"&&primer stable-astroid.

This is potentially even scarier than the status quo for a first-timer, no?

Maybe if bleeding edge fail we could launch on the normal astroid

Stable astroid might not be stable, of course. We might have multiple contributors waiting a week for a patch release instead of a "race condition" for a few hours.

having to explain the fault commit to all contributors become annoying quite fast

I would push back on this -- we're talking about people who happen to push once and then stop pushing completely during a very small window of time.

I won't veto anything, I just am not seeing the upside of working on this. If we really wanted to do it, my alternative proposal would be to have the compare job compare the astroid SHA and then relaunch the primer | main job. That should prevent any drift right at the source.

@jacobtylerwalls
Copy link
Member

What about running a cron-timed Primer job every morning at 12am and let it comment/ping us whenever it finds an issue with bleeding edge astroid?

I can't remember the last time pylint didn't have a commit every morning, so the cron job would probably need to be every four hours! :D

@jacobtylerwalls
Copy link
Member

jacobtylerwalls commented May 29, 2022

Actually, the cron job is a good idea for the old primer, i.e. the one that doesn't compare messages, but just looks for crashes. If we had a daily job that ran the old primer on bleeding edge astroid and exited with a failing code on fatal messages it wouldn't block/interact with any PRs, and the failing jobs would be useful information.

@jacobtylerwalls
Copy link
Member

OTOH, a more efficient way to get to the same result is to just run the "old pylint primer" on every commit to main in astroid.

@DanielNoord
Copy link
Collaborator

Or just run the current primer with #6746.

Let's tweak the current on a bit more and then decide what to do. We could (for example) extract it into it's own repository and package with associated shareable Github Actions actions. mypy primer does this as well so that the primer is available both for mypy and typeshed.

@DanielNoord
Copy link
Collaborator

Some issue analysis in #6769 (comment). Might be relevant for later discussion. It shows why we need to run primer on all pushes to main.

The run that I have just restarted in https://github.com/PyCQA/pylint/actions/runs/2414623971 shows that the system is still correctly taking new runs after they have completed. With the longer run time it just means that quick successive PRs can use different main runs depending on when Primer / Run was triggered.

@DanielNoord
Copy link
Collaborator

Some weird things are going on with the runs in:
#6791

We have been seeing it on other PRs as well. For some reason the redefined-variable-type output is not consistent across main and PR. I can't reproduce locally, so any help with fixing this is much appreciated.
It's coming to a point where I'm considering turning off the comments until we can fix this as they are now mostly becoming noise...

@jacobtylerwalls
Copy link
Member

Maybe just replace pandas and sentry with other packages in the meantime?

@jacobtylerwalls
Copy link
Member

It seems to be the first push to a PR after a new push to main passes. And then the next push to a PR fails again.

@DanielNoord
Copy link
Collaborator

Maybe just replace pandas and sentry with other packages in the meantime?

Yeah, although I'm not sure which packages wouldn't create this issue. I can't really test that locally 😅

@DanielNoord
Copy link
Collaborator

I'm considering this closed. We have the primer comment and some detection.

Any remaining issues can be tracked in #5359.

@DanielNoord DanielNoord removed this from the 2.15.0 milestone Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve primer
Projects
None yet
Development

No branches or pull requests

4 participants