Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37723: fix performance regression on regular expression parsing #15030

Merged

Conversation

yannvgn
Copy link
Contributor

@yannvgn yannvgn commented Jul 30, 2019

On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7

Example (ipython):

In [1]: import re
In [2]: char_list = ''.join([chr(i) for i in range(0xffff)])
In [3]: long_char_list = char_list * 10
In [4]: pattern = f'[{re.escape(long_char_list)}]'
In [5]: %time compiled = re.compile(pattern)

The test was run on Amazon Linux AMI 2017.03.

On Python 3.6.1, the regexp compiled in ~2.6 seconds:

CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
Wall time: 2.64 s

On Python 3.7.3, the regexp compiled in ~15 minutes (~350x increase in this case):

CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
Wall time: 15min 9s

Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6.

The complexity of this function is on average O(N^2) but can be easily reduced to O(N).

The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.

https://bugs.python.org/issue37723

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

Lib/sre_parse.py Outdated
return newitems
seen_items = set()
seen_items_add = seen_items.add
return [item for item in items if not (item in seen_items or seen_items_add(item))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

def _uniq(items):
    return list(dict.fromkeys(items))

Copy link

@alvations alvations Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dict.fromkeys() already jumbled up the item unless there's a need to return a list, returning a set might be more optimal when checking for unique, i.e.

def _uniq(items):
    return set(dict.fromkeys(items))

But it seems like the original _uniq() function returns a list. Is there a reason to do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set(dict.fromkeys(X)) is totally useless. When you need set, set(X) is enoguh.
But in this case, we need list, not set. So I used list(dict.fromkeys(X)).

Copy link
Contributor Author

@yannvgn yannvgn Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed need a list. And I guess that preserving the order is important, looking at the initial implementation.

I'm fine with list(dict.fromkeys(items)) (and it's much prettier), but I have 2 minor concerns:

  • list(dict.fromkeys(items)) might be a bit slower (though I didn't check, and the difference is negligible I think)
  • we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • list(dict.fromkeys(items)) might be a bit slower (though I didn't check, and the difference is negligeible I think)

I believe list(dict.fromkeys(items)) is significantly faster than [it for it in items if it not in seen or seen.add(it)]

  • we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)

You said "sre_parse._uniq function, which does not exist in Python <= 3.6."
So no need to think about 3.6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@methane you were right, list(dict.fromkeys(items)) seems to be a bit faster. Changed ✅.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a NEWS entry.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@yannvgn
Copy link
Contributor Author

yannvgn commented Jul 31, 2019

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@serhiy-storchaka: please review the changes made to this pull request.

@@ -0,0 +1 @@
Fix performance regression on regular expression parsing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add more details? Like "... on parsing regular expression with huge character set".

Please add also "Patch by your name." at the end and add your name in Misc/ACKS.

Good first contribution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add also "Patch by your name." at the end and add your name in Misc/ACKS.

Thanks for the reminder!

@miss-islington
Copy link
Contributor

Thanks @yannvgn for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-15059 is a backport of this pull request to the 3.8 branch.

@bedevere-bot
Copy link

GH-15060 is a backport of this pull request to the 3.7 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jul 31, 2019
…ythonGH-15030)

Improve performance of sre_parse._uniq function.
(cherry picked from commit 9f55551)

Co-authored-by: yannvgn <hi@yannvgn.io>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jul 31, 2019
…ythonGH-15030)

Improve performance of sre_parse._uniq function.
(cherry picked from commit 9f55551)

Co-authored-by: yannvgn <hi@yannvgn.io>
miss-islington added a commit that referenced this pull request Jul 31, 2019
…H-15030)

Improve performance of sre_parse._uniq function.
(cherry picked from commit 9f55551)

Co-authored-by: yannvgn <hi@yannvgn.io>
miss-islington added a commit that referenced this pull request Jul 31, 2019
…H-15030)

Improve performance of sre_parse._uniq function.
(cherry picked from commit 9f55551)

Co-authored-by: yannvgn <hi@yannvgn.io>
lisroach pushed a commit to lisroach/cpython that referenced this pull request Sep 10, 2019
DinoV pushed a commit to DinoV/cpython that referenced this pull request Jan 14, 2020
websurfer5 pushed a commit to websurfer5/cpython that referenced this pull request Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants