bpo-37723: fix performance regression on regular expression parsing #15030

yannvgn · 2019-07-30T22:03:48Z

On complex cases, parsing regular expressions takes much, much longer on Python >= 3.7

Example (ipython):

In [1]: import re
In [2]: char_list = ''.join([chr(i) for i in range(0xffff)])
In [3]: long_char_list = char_list * 10
In [4]: pattern = f'[{re.escape(long_char_list)}]'
In [5]: %time compiled = re.compile(pattern)

The test was run on Amazon Linux AMI 2017.03.

On Python 3.6.1, the regexp compiled in ~2.6 seconds:

CPU times: user 2.59 s, sys: 30 ms, total: 2.62 s
Wall time: 2.64 s

On Python 3.7.3, the regexp compiled in ~15 minutes (~350x increase in this case):

CPU times: user 15min 6s, sys: 240 ms, total: 15min 7s
Wall time: 15min 9s

Doing some profiling with cProfile shows that the issue is caused by sre_parse._uniq function, which does not exist in Python <= 3.6.

The complexity of this function is on average O(N^2) but can be easily reduced to O(N).

The issue might not be noticeable with simple regexps, but programs like text tokenizers - which use complex regexps - might really be impacted by this regression.

https://bugs.python.org/issue37723

the-knights-who-say-ni · 2019-07-30T22:03:52Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Our records indicate we have not received your CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

methane · 2019-07-31T02:29:45Z

Lib/sre_parse.py

-    return newitems
+    seen_items = set()
+    seen_items_add = seen_items.add
+    return [item for item in items if not (item in seen_items or seen_items_add(item))]


How about this?

def _uniq(items): return list(dict.fromkeys(items))

Since dict.fromkeys() already jumbled up the item unless there's a need to return a list, returning a set might be more optimal when checking for unique, i.e.

def _uniq(items): return set(dict.fromkeys(items))

But it seems like the original _uniq() function returns a list. Is there a reason to do that?

set(dict.fromkeys(X)) is totally useless. When you need set, set(X) is enoguh.
But in this case, we need list, not set. So I used list(dict.fromkeys(X)).

We indeed need a list. And I guess that preserving the order is important, looking at the initial implementation.

I'm fine with list(dict.fromkeys(items)) (and it's much prettier), but I have 2 minor concerns:

list(dict.fromkeys(items)) might be a bit slower (though I didn't check, and the difference is negligible I think)

we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)

What do you think?

list(dict.fromkeys(items)) might be a bit slower (though I didn't check, and the difference is negligeible I think)

I believe list(dict.fromkeys(items)) is significantly faster than [it for it in items if it not in seen or seen.add(it)]

we have to make sure this won't be backported to python < 3.7 as the dict order is not necessarily the insert order for python < 3.7 (but again, the bug only affects >=3.7 and the function does not exist on <3.7, so it won't be backported anyway)

You said "sre_parse._uniq function, which does not exist in Python <= 3.6."
So no need to think about 3.6.

@methane you were right, list(dict.fromkeys(items)) seems to be a bit faster. Changed ✅.

serhiy-storchaka

Please add a NEWS entry.

bedevere-bot · 2019-07-31T16:36:58Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

yannvgn · 2019-07-31T16:50:20Z

I have made the requested changes; please review again

bedevere-bot · 2019-07-31T16:50:23Z

Thanks for making the requested changes!

@serhiy-storchaka: please review the changes made to this pull request.

serhiy-storchaka · 2019-07-31T16:55:30Z

Misc/NEWS.d/next/Library/2019-07-31-16-49-01.bpo-37723.zq6tw8.rst

@@ -0,0 +1 @@
+Fix performance regression on regular expression parsing


Maybe add more details? Like "... on parsing regular expression with huge character set".

Please add also "Patch by your name." at the end and add your name in Misc/ACKS.

Good first contribution!

Please add also "Patch by your name." at the end and add your name in Misc/ACKS.

Thanks for the reminder!

miss-islington · 2019-07-31T18:50:42Z

Thanks @yannvgn for the PR, and @serhiy-storchaka for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒⛏🤖

bedevere-bot · 2019-07-31T18:50:55Z

GH-15059 is a backport of this pull request to the 3.8 branch.

bedevere-bot · 2019-07-31T18:51:03Z

GH-15060 is a backport of this pull request to the 3.7 branch.

…ythonGH-15030) Improve performance of sre_parse._uniq function. (cherry picked from commit 9f55551) Co-authored-by: yannvgn <hi@yannvgn.io>

…H-15030) Improve performance of sre_parse._uniq function. (cherry picked from commit 9f55551) Co-authored-by: yannvgn <hi@yannvgn.io>

…ythonGH-15030) Improve performance of sre_parse._uniq function.

improve performance of sre_parse _uniq function

9ba3f96

the-knights-who-say-ni added the CLA not signed label Jul 30, 2019

bedevere-bot added the awaiting review label Jul 30, 2019

yannvgn mentioned this pull request Jul 30, 2019

first call to MosesTokenizer.tokenize is very slow hplt-project/sacremoses#61

Closed

methane reviewed Jul 31, 2019

View reviewed changes

simplify sre_parse._uniq function

7f556a8

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Jul 31, 2019

serhiy-storchaka requested changes Jul 31, 2019

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Jul 31, 2019

📜🤖 Added by blurb_it.

fa25ce4

bedevere-bot added awaiting change review and removed awaiting changes labels Jul 31, 2019

serhiy-storchaka reviewed Jul 31, 2019

View reviewed changes

more detailed NEWS entry + update ACKS

63b28d9

serhiy-storchaka approved these changes Jul 31, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting change review labels Jul 31, 2019

serhiy-storchaka added needs backport to 3.7 performance Performance or resource usage labels Jul 31, 2019

serhiy-storchaka merged commit 9f55551 into python:master Jul 31, 2019

bedevere-bot removed awaiting merge needs backport to 3.8 labels Jul 31, 2019

bedevere-bot removed the needs backport to 3.7 label Jul 31, 2019

lisroach pushed a commit to lisroach/cpython that referenced this pull request Sep 10, 2019

bpo-37723: Fix performance regression on regular expression parsing. (p…

86162c1

…ythonGH-15030) Improve performance of sre_parse._uniq function.

DinoV pushed a commit to DinoV/cpython that referenced this pull request Jan 14, 2020

bpo-37723: Fix performance regression on regular expression parsing. (p…

bdfaf02

…ythonGH-15030) Improve performance of sre_parse._uniq function.

websurfer5 pushed a commit to websurfer5/cpython that referenced this pull request Jul 20, 2020

bpo-37723: Fix performance regression on regular expression parsing. (p…

d4afaa7

…ythonGH-15030) Improve performance of sre_parse._uniq function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-37723: fix performance regression on regular expression parsing #15030

bpo-37723: fix performance regression on regular expression parsing #15030

yannvgn commented Jul 30, 2019 •

edited by bedevere-bot

Loading

the-knights-who-say-ni commented Jul 30, 2019

methane Jul 31, 2019

alvations Jul 31, 2019 •

edited

Loading

methane Jul 31, 2019

yannvgn Jul 31, 2019 •

edited

Loading

methane Jul 31, 2019

yannvgn Jul 31, 2019

serhiy-storchaka left a comment

bedevere-bot commented Jul 31, 2019

yannvgn commented Jul 31, 2019

bedevere-bot commented Jul 31, 2019

serhiy-storchaka Jul 31, 2019

yannvgn Jul 31, 2019

miss-islington commented Jul 31, 2019

bedevere-bot commented Jul 31, 2019

bedevere-bot commented Jul 31, 2019

		@@ -0,0 +1 @@
		Fix performance regression on regular expression parsing

bpo-37723: fix performance regression on regular expression parsing #15030

bpo-37723: fix performance regression on regular expression parsing #15030

Conversation

yannvgn commented Jul 30, 2019 • edited by bedevere-bot Loading

the-knights-who-say-ni commented Jul 30, 2019

methane Jul 31, 2019

Choose a reason for hiding this comment

alvations Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

methane Jul 31, 2019

Choose a reason for hiding this comment

yannvgn Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

methane Jul 31, 2019

Choose a reason for hiding this comment

yannvgn Jul 31, 2019

Choose a reason for hiding this comment

serhiy-storchaka left a comment

Choose a reason for hiding this comment

bedevere-bot commented Jul 31, 2019

yannvgn commented Jul 31, 2019

bedevere-bot commented Jul 31, 2019

serhiy-storchaka Jul 31, 2019

Choose a reason for hiding this comment

yannvgn Jul 31, 2019

Choose a reason for hiding this comment

miss-islington commented Jul 31, 2019

bedevere-bot commented Jul 31, 2019

bedevere-bot commented Jul 31, 2019

yannvgn commented Jul 30, 2019 •

edited by bedevere-bot

Loading

alvations Jul 31, 2019 •

edited

Loading

yannvgn Jul 31, 2019 •

edited

Loading