-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-37723: fix performance regression on regular expression parsing #15030
bpo-37723: fix performance regression on regular expression parsing #15030
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for your contribution, we look forward to reviewing it! |
Lib/sre_parse.py
Outdated
return newitems | ||
seen_items = set() | ||
seen_items_add = seen_items.add | ||
return [item for item in items if not (item in seen_items or seen_items_add(item))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
def _uniq(items):
return list(dict.fromkeys(items))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since dict.fromkeys()
already jumbled up the item unless there's a need to return a list, returning a set might be more optimal when checking for unique, i.e.
def _uniq(items):
return set(dict.fromkeys(items))
But it seems like the original _uniq()
function returns a list. Is there a reason to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set(dict.fromkeys(X))
is totally useless. When you need set
, set(X)
is enoguh.
But in this case, we need list, not set. So I used list(dict.fromkeys(X))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed need a list. And I guess that preserving the order is important, looking at the initial implementation.
I'm fine with list(dict.fromkeys(items))
(and it's much prettier), but I have 2 minor concerns:
list(dict.fromkeys(items))
might be a bit slower (though I didn't check, and the difference is negligible I think)- we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list(dict.fromkeys(items))
might be a bit slower (though I didn't check, and the difference is negligeible I think)
I believe list(dict.fromkeys(items))
is significantly faster than [it for it in items if it not in seen or seen.add(it)]
- we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)
You said "sre_parse._uniq function, which does not exist in Python <= 3.6."
So no need to think about 3.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@methane you were right, list(dict.fromkeys(items)) seems to be a bit faster. Changed ✅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a NEWS entry.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
I have made the requested changes; please review again |
Thanks for making the requested changes! @serhiy-storchaka: please review the changes made to this pull request. |
@@ -0,0 +1 @@ | |||
Fix performance regression on regular expression parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add more details? Like "... on parsing regular expression with huge character set".
Please add also "Patch by your name." at the end and add your name in Misc/ACKS
.
Good first contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add also "Patch by your name." at the end and add your name in Misc/ACKS.
Thanks for the reminder!
Thanks @yannvgn for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8. |
GH-15059 is a backport of this pull request to the 3.8 branch. |
GH-15060 is a backport of this pull request to the 3.7 branch. |
…ythonGH-15030) Improve performance of sre_parse._uniq function. (cherry picked from commit 9f55551) Co-authored-by: yannvgn <hi@yannvgn.io>
…ythonGH-15030) Improve performance of sre_parse._uniq function. (cherry picked from commit 9f55551) Co-authored-by: yannvgn <hi@yannvgn.io>
…ythonGH-15030) Improve performance of sre_parse._uniq function.
…ythonGH-15030) Improve performance of sre_parse._uniq function.
…ythonGH-15030) Improve performance of sre_parse._uniq function.
On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7
Example (ipython):
The test was run on Amazon Linux AMI 2017.03.
On Python 3.6.1, the regexp compiled in ~2.6 seconds:
On Python 3.7.3, the regexp compiled in ~15 minutes (~350x increase in this case):
Doing some profiling with cProfile shows that the issue is caused by
sre_parse._uniq
function, which does not exist in Python <= 3.6.The complexity of this function is on average
O(N^2)
but can be easily reduced toO(N)
.The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.
https://bugs.python.org/issue37723