Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing of long lines when missing-final-newline is enabled #5786

Closed
wants to merge 2 commits into from

Conversation

DanielNoord
Copy link
Collaborator

  • Add yourself to CONTRIBUTORS if you are a new contributor.
  • Add a ChangeLog entry describing what your PR does.
  • If it's a new feature, or an important bug fix, add a What's New entry in
    doc/whatsnew/<current release.rst>.
  • Write a good description on what the PR does.

Type of Changes

Type
🐛 Bug fix

Description

Closes #5724.

The stuff at the beginning of this regex pattern isn't actually needed for us to extract our disable comments. This massively speeds up the parsing and unstucks us from the code in the issue.
I haven't added a test case as it is a bit difficult to test "don't look for things we don't need"...

@DanielNoord DanielNoord added Enhancement ✨ Improvement to a component performance labels Feb 10, 2022
@DanielNoord
Copy link
Collaborator Author

The first pattern does not allow:
# a comment # pylint: disable=this.
This can be fixed with the second pattern, which disallows:
# a comment # something in between pylint: disable=this.

However, do we really want to support something in between? Is there any use case for adding words in between the # and pylint?

@DanielNoord
Copy link
Collaborator Author

Seems like we're using the second pattern ourselves in:
# -*- pylint: disable=no-init,too-few-public-methods, useless-object-inheritance.

I think it makes sense to disallow this, but this would probably need to be deprecated. That would also mean that the performance fix can only go in 3.x.

  1. Do we want to disallow any non-whitespace characters between the # and pylint for disable comments with the knowledge that this would boost the performance of the regex pattern that looks for these?
  2. If yes, do we want to make this change in 3.x or in 2.x?

@DanielNoord DanielNoord marked this pull request as draft February 10, 2022 08:50
@@ -129,6 +129,10 @@ Release date: TBA

Closes #5569

* Fix parsing of long lines when ``missing-final-newline`` is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fix parsing of long lines when ``missing-final-newline`` is enabled.
* Optimize parsing of long lines when ``missing-final-newline`` is enabled.

@@ -180,6 +180,10 @@ Other Changes

Closes #5177, #5212

* Fix parsing of long lines when ``missing-final-newline`` is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fix parsing of long lines when ``missing-final-newline`` is enabled.
* Optimize parsing of long lines when ``missing-final-newline`` is enabled.

\# # Beginning of comment
.*? # Anything (as little as possible)
\s*? # Any whitespaces (as little as possible)
Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, because every change is a breaking change (heavy sigh), but it's probably worth it.

We could do:

Suggested change
\s*? # Any whitespaces (as little as possible)
.*? # Anything (as little as possible)

Until we release 3.0, but then we need a way to make the deprecation known. So should we warn the user if we match a pylint comment with .*? but not with \s*? ? it would actually decrease performance in 2.X, possibly by a lot [citation required].

Another solution is to consider that it was an unintended feature and just consider this a performance fix. How many users are going to be pissed off by this decision... ? I wish I knew. We already know that the user affected by the catastrophic backtracking aren't happy 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a real performance change if we make .*? a capturing group and check if it contains anything other than spaces whenever we use this regex pattern. That should be relatively straightforward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it solve the breaking change issue ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The breaking change would be to remove .* completely and replace it with nothing or with \s*. To deprecate this we can use a capturing group and emit a warning when we match anything other than spaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why would we be doing the breaking change if there is no performance issues if we keep the current behavior ?

@skirpichev
Copy link
Contributor

I would expect, that the backtracking issue might be solved without incompatible changes.

@DanielNoord
Copy link
Collaborator Author

I would expect, that the backtracking issue might be solved without incompatible changes.

Please feel free to see if you can find a pattern that satisfies the current tests. I thought it should be quite easy as well, but stumbled upon some roadblocks..

@skirpichev
Copy link
Contributor

This works for me:

--- pragma_parser.py.old        2022-02-11 15:35:07.864685538 +0300
+++ pragma_parser.py    2022-02-11 15:35:14.156481194 +0300
@@ -9,11 +9,8 @@
 # so that an option can be continued with the reasons
 # why it is active or disabled.
 OPTION_RGX = r"""
-    \s*                # Any number of whithespace
-    \#?                # One or zero hash
-    .*                 # Anything (as much as possible)
-    (\s*               # Beginning of first matched group and any number of whitespaces
-    \#                 # Beginning of comment
+    (?:\s*|\s*\#.*(?=\#.*?\bpylint:))
+    (\#                 # Beginning of comment
     .*?                # Anything (as little as possible)
     \bpylint:          # pylint word and column
     \s*                # Any number of whitespaces

@DanielNoord
Copy link
Collaborator Author

This works for me:

--- pragma_parser.py.old        2022-02-11 15:35:07.864685538 +0300
+++ pragma_parser.py    2022-02-11 15:35:14.156481194 +0300
@@ -9,11 +9,8 @@
 # so that an option can be continued with the reasons
 # why it is active or disabled.
 OPTION_RGX = r"""
-    \s*                # Any number of whithespace
-    \#?                # One or zero hash
-    .*                 # Anything (as much as possible)
-    (\s*               # Beginning of first matched group and any number of whitespaces
-    \#                 # Beginning of comment
+    (?:\s*|\s*\#.*(?=\#.*?\bpylint:))
+    (\#                 # Beginning of comment
     .*?                # Anything (as little as possible)
     \bpylint:          # pylint word and column
     \s*                # Any number of whitespaces

Have you tested this against tests/functional/l/line_too_long.py? Locally that fails with this change for me.

@skirpichev
Copy link
Contributor

Hmm, I see. Probably, we should have another alternative to match comments, i.e. ^\s*\#.*. With that change - tests pass for me.

@DanielNoord
Copy link
Collaborator Author

DanielNoord commented Feb 11, 2022

diff --git a/pylint/utils/pragma_parser.py b/pylint/utils/pragma_parser.py
index 5ef4ef481..482e295ef 100644
--- a/pylint/utils/pragma_parser.py
+++ b/pylint/utils/pragma_parser.py
@@ -12,8 +12,8 @@ OPTION_RGX = r"""
-    \s*                # Any number of whitespace
-    \#?                # One or zero hash
-    .*                 # Anything (as much as possible)
-    (\s*               # Beginning of first matched group and any number of whitespaces
-    \#                 # Beginning of comment
+    (?:^\s*\#.*|\s*|\s*\#.*(?=\#.*?\bpylint:))
+    (\#                 # Beginning of comment
     .*?                # Anything (as little as possible)
     \bpylint:          # pylint word and column
     \s*                # Any number of whitespaces

Nice job! I think that actually works. If you want you can create a PR yourself @skirpichev, you are the one that came up with it!

@skirpichev
Copy link
Contributor

If you want you can create a PR yourself

No, I think you did almost everything.

BTW, probably different conditions in the (?:) group require comments. And there should be a regression test, isn't? I don't know how to do this for the pylint. (I use pytest-timeout for the diofant for similar performance issues.)

@Pierre-Sassoulas
Copy link
Member

(I use pytest-timeout for the diofant for similar performance issues.)

We're using.. nothing I guess. We have a global timeout for the tests in CI, and a benchmark using pytest-benchmark that we do not look at (so it's basically a time waster in our CI right now, that we hope to use later). Is pytest-timeout is something that would solve this issue ?

@skirpichev
Copy link
Contributor

Is pytest-timeout is something that would solve this issue?

Yes, that's how I address such issues (local timeouts for tests). It's a little tricky, because you need to find out a timeout, that triggers problem on different machines (e.g. locally for you and on the Github Actions CI).

@DanielNoord
Copy link
Collaborator Author

We have a similar system in place, but don't use pytest-timeout. I have used it before, so I can probably come with something for this once as well!

@DanielNoord
Copy link
Collaborator Author

Superseded by #5925, thanks to @skirpichev 😄

@DanielNoord DanielNoord deleted the missing-final branch March 16, 2022 09:16
Pierre-Sassoulas added a commit that referenced this pull request Mar 16, 2022
…ine' (#5925)

* Fix parsing of long lines when ``missing-final-newline`` is enabled

* Adapt fa31b6b to be backward-compatible

Fixes #5724

Also address comments from the PR #5786

Co-authored-by: Daniël van Noord <13665637+DanielNoord@users.noreply.github.com>
Co-authored-by: Pierre Sassoulas <pierre.sassoulas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'missing-final-newline' is surprisingly slow on a large codebase
3 participants