-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-96346: Use double caching for re._compile() #96347
Conversation
serhiy-storchaka
commented
Aug 27, 2022
•
edited by bedevere-bot
Loading
edited by bedevere-bot
- Issue: Use double caching for re._compile() #96346
5c69476
to
9151549
Compare
This looks like a reasonable change. Consider whether OrderedDict should be used in the cache where the order changes. We've shown that It is fine to go ahead with this PR since it is improves on what we have now. |
It was discussed in #76519, performance difference is negligible. On other hand, OrderedDict adds a dependence. And it would be much worse if it is implemented in Python. |
I added some data in the issue. |
Lib/re/__init__.py
Outdated
# _cache2 uses the simple FIFO policy which has better latency. | ||
# _cache uses the LRU policy which has better hit rate. | ||
# OrderedDict is not used because it adds a new dependence, and | ||
# performance difference is negligible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would omit the OrderedDict part of the comment. It is debatable and doesn't need to be in the code. The important part is the two lines before that explain the two caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be an explanation why OrderedDict is not used at first place, no?
Also, I was afraid that some new contributor passing through the code and not aware about the history of this code can submit a PR with an "obvious" improvement, and it can be merged while I am not here. The history of this code contains many changes and reversions.
I removed this, and hope the new comment about next(iter(_cache))
will be enough.
Side discussion: I was looking at the code in |
Hi there, it's tomllib author here! Tomllib's cache does not have an explicit bound, but is implicitly bound to 2880 items. This stems from the fact the the function's input always has to go through a regex before the lru_cached function is called. The number 2880 comes from |
@hukkin I suggest adding a comment to that effect. |
Thank you Raymond for looking at this yet one time. As for the limit, I think it is needed here, because RE patterns can be created on-fly, depending on the user data, and some programs may use millions of different patterns, but every pattern is only used once. |
p = _compiler.compile(pattern, flags) | ||
if not (flags & DEBUG): | ||
|
||
key = (type(pattern), pattern, flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be defined before the try
/except
above?
I also wonder if an if key in _cache2: return _cache2[key]
would be more efficient than the try
/except
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this would add a small but measurable overhead in the common case. I tested this when I wrote the current implementation.
if not (flags & DEBUG): | ||
|
||
key = (type(pattern), pattern, flags) | ||
p = _cache.pop(key, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it remove it from the _cache
?
I think it would be better to add a comment to elaborate a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we need to move it to the end.
* main: (38 commits) pythongh-92886: make test_ast pass with -O (assertions off) (pythonGH-98058) pythongh-92886: make test_coroutines pass with -O (assertions off) (pythonGH-98060) pythongh-57179: Add note on symlinks for os.walk (python#94799) pythongh-94808: Fix regex on exotic platforms (python#98036) pythongh-90085: Remove vestigial -t and -c timeit options (python#94941) pythonGH-83901: Improve Signature.bind error message for missing keyword-only params (python#95347) pythongh-61105: Add default param, note on using cookiejar subclass (python#95427) pythongh-96288: Add a sentence to `os.mkdir`'s docstring. (python#96271) pythongh-96073: fix backticks in NEWS entry (pythonGH-98056) pythongh-92886: [clinic.py] raise exception on invalid input instead of assertion (pythonGH-98051) pythongh-97997: Add col_offset field to tokenizer and use that for AST nodes (python#98000) pythonGH-88968: Reject socket that is already used as a transport (python#98010) pythongh-96346: Use double caching for re._compile() (python#96347) pythongh-91708: Revert params note in urllib.parse.urlparse table (python#96699) pythongh-96265: Fix some formatting in faq/design.rst (python#96924) pythongh-73196: Add namespace/scope clarification for inheritance section (python#92840) pythongh-97646: Change `.js` and `.mjs` files mimetype to conform to RFC 9239 (python#97934) pythongh-97923: Always run Ubuntu SSL tests with others in CI (python#97940) pythongh-97956: Mention `generate_global_objects.py` in `AC How-To` (python#97957) pythongh-96959: Update HTTP links which are redirected to HTTPS (python#98039) ...