Better primer tests (False positives detection) #5364

Pierre-Sassoulas · 2021-11-22T09:38:34Z

Current problem

Right now the primer tests are only checking for fatal messages and crashes. But anticipating false positive would be a huge plus in term of quality of releases.

Desired solution

A solution to anticipate the false positives by running pylint on external repositories.

Additional context

Originally posted by @DanielNoord (#5173 (comment)):

Some points I found while working on this:

The initial approach of asserting ex.code == 0 doesn't work since many packages will return error messages. Even the stdlib packages. We should therefore only look for crashes (ex.code == 32) or fatal messages (ex.code == 1) I think.
I have created a new CI job to do the primer tests on Linux on every push or pull requests. We don't want to run them only when bumping a version (as we did with the previous acceptance tests), but I think it is good to separate them from the other tests. Especially since they will probably take much longer to complete (more on that later) and sometimes early fails on the tests are helpful in finalising a PR (especially after GitHub review comments break a test).
The lazy loading of repo's checks for the SHA1 hashes of the local commit and remote commit and reclones when it finds a difference.
tests/primer/primer_external_packages.py is used to store a list of packages to download. We currently include numpy but I would argue against including it. numpy has its tests included in the sources files. Normally we don't want to run pylint over tests as this can create problems when tests use frameworks that use non-normal python code. Besides, running over the tests also really inflates the time it takes to run the primer tests. I would think we could come up with 2/3 other projects that might be better. Note that these projects do not need to use pylint, we just need to be able to assume that their code is "normal" and therefore shouldn't crash pylint or astroid.
We might also want to improve the message that gets raised when the test fails. Currently it is not immediately clear where pylint crashed. Improving the message might help expedite the process of fixing it.
Perhaps we should add --only-fatal? To make the output only print fatal errors?

Originally posted by @cdce8p (#5173 (comment)):

Maybe the --capture=tee-sys option works?
https://pytest.org/en/6.2.x/capture.html#setting-capturing-methods-or-disabling-capturing

Originally posted by @DanielNoord (#5173 (comment)):

The reasoning for checking for error is that if there is an "error" in a well tested largely used package main branch, it means pylint created a false positive. We don't need to use the --error-only option as we can bit decode the exit code to check if there were errors or not. But it does force us to install the project dependencies.

I'm not sure this is the case. Take undefined-variable for example:
https://github.com/PyCQA/pylint/blob/e5082d3f9401dbcf65b40ce6a819d2a09beccb5c/pylint/checkers/variables.py#L393-L397

We know there are issues with this message. There are false positives and negatives and due to the lack of control-flow we are not able to solve this (now).
If we make it so the primer tests fail whenever an Error-level message is found in a project there cannot be any undefined-variable messages. This is (currently) unfeasible. There are many more Error-level messages where we know there are issues, which we cannot always fix so easily.

We can (sort of) avoid this by only including projects that use pylint as they are likely to have already disabled current false positives, but that doesn't fully help us.
What if a commit changes the way undefined-variable behaves and the project emits 10 new warnings. We would need to investigate whether any of these are correct or false positive. For larger projects this number might be much higher. Even if we identify 1 false positive and fix it, the primer will still fail because of 9 undefined-variable messages. We know these messages are correct (the project just hasn't updated to the WIP-pylint version) and can merge the commit, but from now on every primer CI job will fail because of those 9 messages.

By only checking for Fatal and Crash we make sure that pylint can parse any type of code (pylint-enforced and non-enforced) without breaking/crashing.
For reference the only Fatal messages are: method-check-failed, fatal, astroid-error and parse-error.

We might want to create a command --use-all-extensions

I agree with your three point, but this one is really good as this can be useful for those who want to use all extensions. I'm also thinking about making is possible to disable some extensions because preventing while loop is really opinionated for example. Maybe a disable-plugin option to mirror the load-plugin one ?

So you would want to allow the use of --use-all-extensions and disable-plugin at the same time?

The text was updated successfully, but these errors were encountered:

* Add changelog and warning about unstable API in testutil * Add primer tests, (running pylint on external libs during tests) In order to anticipate crash/fatal messages, false positives are harder to anticipate. Follow-up will be #5359 and later on #5364 Add '__tracebackhide__ = True' so the traceback is manageable Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>

jacobtylerwalls · 2021-12-22T05:12:31Z

What if a commit changes the way undefined-variable behaves and the project emits 10 new warnings. We would need to investigate whether any of these are correct or false positive. For larger projects this number might be much higher.

A data point. Since I'm back for round 2 on used-before-assignment, I linted django against main and my open PRs.
I spotted 1 new warning on main, and distinct new warnings on each of my PRs. One was a true positive. Two were false positives (#5586 and #5583 (comment)). So I agree with Pierre that this isn't necessarily true:

The reasoning for checking for error is that if there is an "error" in a well tested largely used package main branch, it means pylint created a false positive.

Still, it was easy to glance by eye whether they were true or false positives (easier than debugging the root cause in pylint), so even some artifact printed out during the CI job would be nifty.

jacobtylerwalls · 2022-02-10T16:02:13Z

Likely doing #5403 first would help.

jacobtylerwalls · 2022-03-26T22:49:37Z

@DanielNoord this might be the place to continue the discussion about false negative/positive changes in primer repos?

home-assistant is awesome as a primer project but we would need to run the primer over it twice to catch false negatives and that would really inflate CI runs.

What if we committed the baseline json to the repo? So we only lint a project once and compare it to the baseline on disk. Or am I missing why it would need to be run twice?

DanielNoord · 2022-03-26T22:56:03Z

One thing I considered is uploading the primer json output as an artefact on main and then downloading it in CI of PRs.

However, I noticed that the mypy primer doesn't do this. They use git merge-base to calculate the common ancestor and then do a diff. I implemented a similar process in DanielNoord/pydocstringformatter#78.
My rationale for this was/is:

It makes sure that the primer uses the two correct commits
It sounds logical to upload the main artefact and re-use it, people over at mypy will have thought of it as well. The fact that they (with all their contributors) haven't implemented it and also don't mention it on the todo list for mypy primer tells me that there are probably problems with that approach.

Of course, we could definitely explore this, but I feel like they would do it too if was actually feasible.

cdce8p · 2022-03-26T23:00:08Z

Artifacts might be difficult to deal with since you would probably need to know the actions run that created them. What might work however is using the cache for that. With a good cache key, we should be able to store almost all results we need. The fallback can still be to just run it twice.

DanielNoord · 2022-03-26T23:06:36Z

Yeah, I mean it's definitely something we should explore but perhaps in V3? V4? (😄).

You know what, I'm going to ask the guys over at mypy primer if there is a specific reason why they don't use a cache.

cdce8p · 2022-03-26T23:08:44Z

Yeah, I mean it's definitely something we should explore but perhaps in V3? V4? (😄).

You know what, I'm going to ask the guys over at mypy primer if there is a specific reason why they don't use a cache.

If I had to guess, nobody has had the idea and/or time to implement it yet.

cdce8p · 2022-03-26T23:40:53Z

@DanielNoord but both for baseline and primer we just need to fix the json output. There has been some discussion about that already.

Don't know. Isn't a simple line diff also possible. I believe that's what mypy_primer does. With that, no additional formatting would be necessary.

--
You might have already though of that. I run my Home Assistant pylint tests with useless-suppression enabled. Should be useful to see improvements.

DanielNoord · 2022-03-26T23:47:36Z

@DanielNoord but both for baseline and primer we just need to fix the json output. There has been some discussion about that already.

Don't know. Isn't a simple line diff also possible. I believe that's what mypy_primer does. With that, no additional formatting would be necessary.

-- You might have already though of that. I run my Home Assistant pylint tests with useless-suppression enabled. Should be useful to see improvements.

Yeah I was thinking about that, but then started to think about what would happen if:
main:
module a: no message
module b: 1 message
PR:
module a: 1 message
module b: 2 messages

The diff would include the Module a "name line" (How do we call that? The line that gets printed at the start of a module's messages) and the second message from b would be directly under that as well.
Actually there is an open issue about that line and how it makes the standard output awful to programmatically parse.

I foresaw some headaches getting that to look pretty while a parseable JSON makes lives just that much easier (and is requested by other projects and users as well). Felt like it would be better to catch two birds with one stone here rather than spend some time working with the less parseable diff output.

cdce8p · 2022-03-26T23:55:20Z

Yeah I was thinking about that, but then started to think about what would happen if: main: module a: no message module b: 1 message PR: module a: 1 message module b: 2 messages

The diff would include the Module a "name line" (How do we call that? The line that gets printed at the start of a module's messages) and the second message from b would be directly under that as well. Actually there is an open issue about that line and how it makes the standard output awful to programmatically parse.

I would just strip out every line that starts with ****. Should be fairly strait forward. However, one issue might be the non-deterministic output. Especially with multiple jobs, I don't think the order is guaranteed. That would require at least some parsing, ordering, and set comparisons 🤔

************* Module homeassistant.components.decora.light
homeassistant/components/decora/light.py:73:16: W0212: Access to a protected member _switch of a client class (protected-access)
************* Module homeassistant.components.spotify.media_player
homeassistant/components/spotify/media_player.py:101:12: W0212: Access to a protected member _attr_available of a client class (protected-access)
homeassistant/components/spotify/media_player.py:104:12: W0212: Access to a protected member _attr_available of a client class (protected-access)
homeassistant/components/spotify/media_player.py:106:12: W0212: Access to a protected member _attr_available of a client class (protected-access)

DanielNoord · 2022-03-28T08:32:16Z

To get back on the response from mypy. They just don't have a real need for caching of results as their unittests are slower than the primer runs.

I might have some time to look at this in the near future. Although I do think that fixing the json output at the same time might be helpful as well.

DanielNoord · 2022-05-25T14:38:34Z

I've got a first proof of concept over at: DanielNoord#145

That PR creates an intentional false positive for method-cache-max-size-none.
The proof of concept only runs on astroid and does nothing to compare the output, but at least it shows that it is possible.

It does cache the projects that we run over, although very roughly. Also it gets the main run from the main branch, so we don't need to duplicate those runs.

Pierre-Sassoulas · 2022-05-25T19:53:41Z

I've got a first proof of concept over at: DanielNoord#145

It looks pretty dandy ❤️ ! Wondering if showing the code around the new warning (maybe ~= 10 lines, 5 before, 5 after) is reasonable. It would facilitate the review for sure by allowing to understand what's happening in most case without opening the analyzed code in an external IDE.

DanielNoord · 2022-05-28T21:31:13Z

With #6723 this is now merged on main. However, let's keep this open for a little bit to have a central place for any discussion about any potential follow-up issues.

DanielNoord · 2022-05-29T05:45:17Z

@jacobtylerwalls @Pierre-Sassoulas See #6636 (comment).

Do we want to exclude messages were only the actual message changes? Or do we want to keep warning about those?

Pierre-Sassoulas · 2022-05-29T05:48:38Z

I think we can show the diff of the message in that case. We know there's a change, but we can probably benefit from comparing the message generated.

Ie: Something like:

The following message content changed:

consider-using-in:

- *Consider merging these comparisons with "in" to 't in (token.LPAR, token.RPAR)'*
+ *Consider merging these comparisons with "in" by using "t in (token.LPAR, token.RPAR, )". Use a set instead if elements are hashable."*

https://github.com/psf/black.git/blob/main/src/black/nodes.py#L313

DanielNoord · 2022-05-29T06:27:09Z

See:
https://github.com/DanielNoord/pylint/runs/6641534527?check_suite_focus=true

We don't actually fail on a crash. We should probably do so.

Pierre-Sassoulas · 2022-05-29T06:47:46Z

If we fail on a crash but its bleeding edge astroid, I don't know... Do we also have a check for astroid stable ? I don't want a situation where pylint pipeline is unstable because it depends on astroid

DanielNoord · 2022-05-29T13:01:16Z

I just added an assert False to see how we would handle fatal errors. Please see:
DanielNoord#149 and the last run.

Although they do show up, I don't like how they can be easily missed. I'll try to think of something to make them more visible. Perhaps a new New fatal messages: section at the top of the comment.

Pierre-Sassoulas · 2022-05-29T13:59:10Z

💥 (:boom:) 💣 (:bomb:), 💀 (:skull:) ❌ (:x:) or ⁉️ (:interrobang:) might come handy 😄

DanielNoord · 2022-05-29T20:45:33Z

I'm assigning myself to the fatal error issue.

@jacobtylerwalls I just thought of another issue which we should probably solve:

commit 1 gets merged on main and Primer / Main is trigger with astroid at commit a
astroid merges commit b which creates a regression which we don't catch
A PR gets opened with commit 2 which triggers Primer / Run with commit b on astroid
PR fails and shows as if commit 2 is triggering fatal messages

I think we should probably do something about that..

jacobtylerwalls · 2022-05-29T21:22:49Z

If we truly end up with a "race condition" like that I'm happy to explain it to a pylint contributor. The pipeline wouldn't be failing. And they would probably complain "but my changes couldn't have done THAT!" so I doubt they would spin too many wheels. If they do want to pop the hood it's never too early to start learning about astroid!

Anyway, as soon as something else is pushed to main, there would be a new run and no longer any drift the next time someone pushes to the PR. You can construct a similar false negative problem. Until pylint and astroid become the same project (or a submodule, or we run the pylint test suite on astroid's CI...) we're always going to have small corners of problems. We push to main so much and to nontrivial PRs so much I doubt it would amount to much in practice.

I also think we shouldn't put outsized attention on fatal errors and crashes; astroid changes can involve a lot more than just crashes! :-)

jacobtylerwalls · 2022-05-29T21:26:35Z

I guess I'm saying I much prefer one commit of drift across astroid during a one-commit interval on a PR to a four-month drift between CI and what we already know we will have to depend on for four months of PRs!

DanielNoord · 2022-05-29T21:45:37Z

Hmm although I agree that this probably won't happen often and can be easily explained I do think we can improve the situation somewhat. What about running a cron-timed Primer job every morning at 12am and let it comment/ping us whenever it finds an issue with bleeding edge astroid? That way we have best of both worlds.

I'm just thinking of the previous issues we had with the CI. Although these should be resolved after a single commit to pylint main, having to explain the fault commit to all contributors become annoying quite fast 😄

Pierre-Sassoulas · 2022-05-29T22:10:41Z

having to explain the fault commit to all contributors become annoying quite fast

Yeah this is what I'm trying to avoid too. Maybe if bleeding edge fail we could launch on the normal astroid. Something like: primer bleeding-edge|echo "Someting is wrong with bleeding edge astroid"&&primer stable-astroid. If stable pass, we add a message to the github comment that we maintainer will notice but should not alarm occasional contributor something like "bleeding edge astroid create new fatals, do not bother with this, it's not related to your changes". It should happen seldom enough that it's not a performance issue.

jacobtylerwalls · 2022-05-29T22:56:44Z

I'm just thinking of the previous issues we had with the CI.

Right, this is sort of what I'm worried about. I don't want to focus too much on "fighting the last war" -- if we're talking about changes in astroid that cause failing pipelines in pylint, not merely vanilla behavior changes, we should solve it directly: run a subset of pylint tests in astroid, enforce pylint-tested labels on every astroid PR, or just manually trigger primer-run / main after every astroid merge, which is one click. We should be encouraging returning pylint contributors to be developing against bleeding-edge astroid anyway. We just don't want to scare away the first-timers, which I'm sympathetic to!

Something like: primer bleeding-edge|echo "Someting is wrong with bleeding edge astroid"&&primer stable-astroid.

This is potentially even scarier than the status quo for a first-timer, no?

Maybe if bleeding edge fail we could launch on the normal astroid

Stable astroid might not be stable, of course. We might have multiple contributors waiting a week for a patch release instead of a "race condition" for a few hours.

having to explain the fault commit to all contributors become annoying quite fast

I would push back on this -- we're talking about people who happen to push once and then stop pushing completely during a very small window of time.

I won't veto anything, I just am not seeing the upside of working on this. If we really wanted to do it, my alternative proposal would be to have the compare job compare the astroid SHA and then relaunch the primer | main job. That should prevent any drift right at the source.

jacobtylerwalls · 2022-05-29T22:58:55Z

What about running a cron-timed Primer job every morning at 12am and let it comment/ping us whenever it finds an issue with bleeding edge astroid?

I can't remember the last time pylint didn't have a commit every morning, so the cron job would probably need to be every four hours! :D

jacobtylerwalls · 2022-05-29T23:10:50Z

Actually, the cron job is a good idea for the old primer, i.e. the one that doesn't compare messages, but just looks for crashes. If we had a daily job that ran the old primer on bleeding edge astroid and exited with a failing code on fatal messages it wouldn't block/interact with any PRs, and the failing jobs would be useful information.

jacobtylerwalls · 2022-05-29T23:11:53Z

OTOH, a more efficient way to get to the same result is to just run the "old pylint primer" on every commit to main in astroid.

DanielNoord · 2022-05-30T07:19:40Z

Or just run the current primer with #6746.

Let's tweak the current on a bit more and then decide what to do. We could (for example) extract it into it's own repository and package with associated shareable Github Actions actions. mypy primer does this as well so that the primer is available both for mypy and typeshed.

DanielNoord · 2022-05-31T12:15:25Z

Some issue analysis in #6769 (comment). Might be relevant for later discussion. It shows why we need to run primer on all pushes to main.

The run that I have just restarted in https://github.com/PyCQA/pylint/actions/runs/2414623971 shows that the system is still correctly taking new runs after they have completed. With the longer run time it just means that quick successive PRs can use different main runs depending on when Primer / Run was triggered.

DanielNoord · 2022-06-02T13:13:13Z

Some weird things are going on with the runs in:
#6791

We have been seeing it on other PRs as well. For some reason the redefined-variable-type output is not consistent across main and PR. I can't reproduce locally, so any help with fixing this is much appreciated.
It's coming to a point where I'm considering turning off the comments until we can fix this as they are now mostly becoming noise...

jacobtylerwalls · 2022-06-02T13:24:34Z

Maybe just replace pandas and sentry with other packages in the meantime?

jacobtylerwalls · 2022-06-02T13:26:41Z

It seems to be the first push to a PR after a new push to main passes. And then the next push to a PR fails again.

DanielNoord · 2022-06-02T13:26:54Z

Maybe just replace pandas and sentry with other packages in the meantime?

Yeah, although I'm not sure which packages wouldn't create this issue. I can't really test that locally 😅

DanielNoord · 2022-08-23T09:50:54Z

I'm considering this closed. We have the primer comment and some detection.

Any remaining issues can be tracked in #5359.

Pierre-Sassoulas added Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve labels Nov 22, 2021

Pierre-Sassoulas mentioned this issue Nov 22, 2021

Primer tests "à la mypy" #5173

Merged

6 tasks

cdce8p added the primer label Nov 22, 2021

DanielNoord mentioned this issue Nov 22, 2021

Follow-up to primer tests for pylint 2.12.2 #5306

Closed

7 tasks

DanielNoord mentioned this issue Mar 26, 2022

Why doesn't mypy primer use a cache hauntsaninja/mypy_primer#29

Closed

This was referenced Apr 14, 2022

Do not use deprecated methods in PyLinter #6293

Merged

Add a benchmark github action to have a graph #4897

Closed

This was referenced May 26, 2022

Cache and store primer packages via the default branch #6703

Merged

Run and upload primer over packages #6714

Merged

Compare difference between primer runs and post comment #6723

Merged

Pierre-Sassoulas added this to the 2.14.0 milestone May 28, 2022

DanielNoord mentioned this issue May 29, 2022

Put fatal messages at the top in the primer comment #6744

Merged

2 tasks

jacobtylerwalls mentioned this issue May 31, 2022

Filter out astroid errors from the bot comment #6773

Merged

DanielNoord modified the milestones: 2.14.0, 2.15.0 May 31, 2022

DanielNoord closed this as completed Aug 23, 2022

DanielNoord removed this from the 2.15.0 milestone Aug 23, 2022

Better primer tests (False positives detection) #5364

Better primer tests (False positives detection) #5364

Comments

Pierre-Sassoulas commented Nov 22, 2021 • edited Loading

Current problem

Desired solution

Additional context

jacobtylerwalls commented Dec 22, 2021

jacobtylerwalls commented Feb 10, 2022

jacobtylerwalls commented Mar 26, 2022

DanielNoord commented Mar 26, 2022

cdce8p commented Mar 26, 2022

DanielNoord commented Mar 26, 2022

cdce8p commented Mar 26, 2022 • edited Loading

cdce8p commented Mar 26, 2022

DanielNoord commented Mar 26, 2022

cdce8p commented Mar 26, 2022

DanielNoord commented Mar 28, 2022

DanielNoord commented May 25, 2022

Pierre-Sassoulas commented May 25, 2022

DanielNoord commented May 28, 2022

DanielNoord commented May 29, 2022

Pierre-Sassoulas commented May 29, 2022 • edited Loading

DanielNoord commented May 29, 2022

Pierre-Sassoulas commented May 29, 2022

DanielNoord commented May 29, 2022

Pierre-Sassoulas commented May 29, 2022

DanielNoord commented May 29, 2022

jacobtylerwalls commented May 29, 2022

jacobtylerwalls commented May 29, 2022

DanielNoord commented May 29, 2022

Pierre-Sassoulas commented May 29, 2022 • edited Loading

jacobtylerwalls commented May 29, 2022

jacobtylerwalls commented May 29, 2022

jacobtylerwalls commented May 29, 2022 • edited Loading

jacobtylerwalls commented May 29, 2022

DanielNoord commented May 30, 2022

DanielNoord commented May 31, 2022

DanielNoord commented Jun 2, 2022

jacobtylerwalls commented Jun 2, 2022

jacobtylerwalls commented Jun 2, 2022

DanielNoord commented Jun 2, 2022

DanielNoord commented Aug 23, 2022

Pierre-Sassoulas commented Nov 22, 2021 •

edited

Loading

cdce8p commented Mar 26, 2022 •

edited

Loading

Pierre-Sassoulas commented May 29, 2022 •

edited

Loading

Pierre-Sassoulas commented May 29, 2022 •

edited

Loading

jacobtylerwalls commented May 29, 2022 •

edited

Loading