gh-96346: Use double caching for re._compile() #96347

serhiy-storchaka · 2022-08-27T20:18:26Z

Issue: Use double caching for re._compile() #96346

rhettinger · 2022-08-27T20:41:53Z

This looks like a reasonable change.

Consider whether OrderedDict should be used in the cache where the order changes. We've shown that next(iter(d)) is quadratic while od.popitem(False) is linear. For the cache sizes in the re module, this might not matter or might be outweighed by OrderedDict's slower __getitem__ and __setitem__.

It is fine to go ahead with this PR since it is improves on what we have now.

serhiy-storchaka · 2022-08-28T05:53:08Z

It was discussed in #76519, performance difference is negligible. On other hand, OrderedDict adds a dependence. And it would be much worse if it is implemented in Python.

serhiy-storchaka · 2022-08-28T08:45:36Z

I added some data in the issue.

Lib/re/__init__.py

rhettinger · 2022-08-30T16:02:48Z

Lib/re/__init__.py

+# _cache2 uses the simple FIFO policy which has better latency.
+# _cache uses the LRU policy which has better hit rate.
+# OrderedDict is not used because it adds a new dependence, and
+# performance difference is negligible.


I would omit the OrderedDict part of the comment. It is debatable and doesn't need to be in the code. The important part is the two lines before that explain the two caches.

There should be an explanation why OrderedDict is not used at first place, no?

Also, I was afraid that some new contributor passing through the code and not aware about the history of this code can submit a PR with an "obvious" improvement, and it can be merged while I am not here. The history of this code contains many changes and reversions.

I removed this, and hope the new comment about next(iter(_cache)) will be enough.

rhettinger · 2022-08-30T16:12:18Z

Side discussion: I was looking at the code in tomllib and its cache has maxsize=None. I wondered whether its use pattern is substantially different or whether the policy should be the same as for re. Perhaps the tomllib module should be bounded, or perhaps the re module didn't really need a limit. Another alternative is to set a limit that is very high so that most users never hit the limit and never trigger an eviction. After all, the objects being cached are very small, so setting a size limit is arguably a premature optimization. The code would be much simpler and a little faster without a size limit.

Lib/re/__init__.py

hukkin · 2022-08-30T17:51:47Z

Side discussion: I was looking at the code in tomllib and its cache has maxsize=None. I wondered whether its use pattern is substantially different or whether the policy should be the same as for re

Hi there, it's tomllib author here! Tomllib's cache does not have an explicit bound, but is implicitly bound to 2880 items. This stems from the fact the the function's input always has to go through a regex before the lru_cached function is called.

The number 2880 comes from 24 hours * 60 minutes * 2 (offset direction).

rhettinger · 2022-08-30T23:59:23Z

@hukkin I suggest adding a comment to that effect.

serhiy-storchaka · 2022-08-31T06:35:44Z

Thank you Raymond for looking at this yet one time.

As for the limit, I think it is needed here, because RE patterns can be created on-fly, depending on the user data, and some programs may use millions of different patterns, but every pattern is only used once.

ezio-melotti · 2022-09-18T12:04:10Z

Lib/re/__init__.py

-    p = _compiler.compile(pattern, flags)
-    if not (flags & DEBUG):
+
+    key = (type(pattern), pattern, flags)


Can't this be defined before the try/except above?

I also wonder if an if key in _cache2: return _cache2[key] would be more efficient than the try/except.

All this would add a small but measurable overhead in the common case. I tested this when I wrote the current implementation.

ezio-melotti · 2022-09-18T12:04:47Z

Lib/re/__init__.py

-    if not (flags & DEBUG):
+
+    key = (type(pattern), pattern, flags)
+    p = _cache.pop(key, None)


Why does it remove it from the _cache?
I think it would be better to add a comment to elaborate a bit.

Because we need to move it to the end.

* main: (38 commits) pythongh-92886: make test_ast pass with -O (assertions off) (pythonGH-98058) pythongh-92886: make test_coroutines pass with -O (assertions off) (pythonGH-98060) pythongh-57179: Add note on symlinks for os.walk (python#94799) pythongh-94808: Fix regex on exotic platforms (python#98036) pythongh-90085: Remove vestigial -t and -c timeit options (python#94941) pythonGH-83901: Improve Signature.bind error message for missing keyword-only params (python#95347) pythongh-61105: Add default param, note on using cookiejar subclass (python#95427) pythongh-96288: Add a sentence to `os.mkdir`'s docstring. (python#96271) pythongh-96073: fix backticks in NEWS entry (pythonGH-98056) pythongh-92886: [clinic.py] raise exception on invalid input instead of assertion (pythonGH-98051) pythongh-97997: Add col_offset field to tokenizer and use that for AST nodes (python#98000) pythonGH-88968: Reject socket that is already used as a transport (python#98010) pythongh-96346: Use double caching for re._compile() (python#96347) pythongh-91708: Revert params note in urllib.parse.urlparse table (python#96699) pythongh-96265: Fix some formatting in faq/design.rst (python#96924) pythongh-73196: Add namespace/scope clarification for inheritance section (python#92840) pythongh-97646: Change `.js` and `.mjs` files mimetype to conform to RFC 9239 (python#97934) pythongh-97923: Always run Ubuntu SSL tests with others in CI (python#97940) pythongh-97956: Mention `generate_global_objects.py` in `AC How-To` (python#97957) pythongh-96959: Update HTTP links which are redirected to HTTPS (python#98039) ...

serhiy-storchaka added the performance Performance or resource usage label Aug 27, 2022

serhiy-storchaka requested a review from rhettinger August 27, 2022 20:18

bedevere-bot added the awaiting core review label Aug 27, 2022

pythongh-96346: Use double caching for re._compile()

9151549

serhiy-storchaka force-pushed the re-compile-cache branch from 5c69476 to 9151549 Compare August 27, 2022 20:20

rhettinger approved these changes Aug 27, 2022

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Aug 27, 2022

Add some comments.

5fe3232

rhettinger reviewed Aug 30, 2022

View reviewed changes

Lib/re/__init__.py Show resolved Hide resolved

Address review comments.

d619f6d

ezio-melotti reviewed Sep 18, 2022

View reviewed changes

serhiy-storchaka added 2 commits October 5, 2022 13:25

Merge branch 'main' into re-compile-cache

077e6dd

Add a comment.

bd618bf

ambv merged commit c11b667 into python:main Oct 7, 2022

bedevere-bot removed the awaiting merge label Oct 7, 2022

mpage pushed a commit to mpage/cpython that referenced this pull request Oct 11, 2022

pythongh-96346: Use double caching for re._compile() (python#96347)

053555c

hukkin mentioned this pull request Oct 28, 2024

tomllib: Add a comment about implicit lru_cache bound #126078

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-96346: Use double caching for re._compile() #96347

gh-96346: Use double caching for re._compile() #96347

serhiy-storchaka commented Aug 27, 2022 •

edited by bedevere-bot

Loading

rhettinger commented Aug 27, 2022 •

edited

Loading

serhiy-storchaka commented Aug 28, 2022

serhiy-storchaka commented Aug 28, 2022

rhettinger Aug 30, 2022

serhiy-storchaka Aug 31, 2022

rhettinger commented Aug 30, 2022

hukkin commented Aug 30, 2022

rhettinger commented Aug 30, 2022

serhiy-storchaka commented Aug 31, 2022

ezio-melotti Sep 18, 2022

serhiy-storchaka Oct 5, 2022

ezio-melotti Sep 18, 2022

serhiy-storchaka Oct 5, 2022

gh-96346: Use double caching for re._compile() #96347

gh-96346: Use double caching for re._compile() #96347

Conversation

serhiy-storchaka commented Aug 27, 2022 • edited by bedevere-bot Loading

rhettinger commented Aug 27, 2022 • edited Loading

serhiy-storchaka commented Aug 28, 2022

serhiy-storchaka commented Aug 28, 2022

rhettinger Aug 30, 2022

Choose a reason for hiding this comment

serhiy-storchaka Aug 31, 2022

Choose a reason for hiding this comment

rhettinger commented Aug 30, 2022

hukkin commented Aug 30, 2022

rhettinger commented Aug 30, 2022

serhiy-storchaka commented Aug 31, 2022

ezio-melotti Sep 18, 2022

Choose a reason for hiding this comment

serhiy-storchaka Oct 5, 2022

Choose a reason for hiding this comment

ezio-melotti Sep 18, 2022

Choose a reason for hiding this comment

serhiy-storchaka Oct 5, 2022

Choose a reason for hiding this comment

serhiy-storchaka commented Aug 27, 2022 •

edited by bedevere-bot

Loading

rhettinger commented Aug 27, 2022 •

edited

Loading