-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-43452: Microoptimizations to PyType_Lookup #24804
Conversation
@@ -32,8 +32,7 @@ class object "PyObject *" "&PyBaseObject_Type" | |||
& ((1 << MCACHE_SIZE_EXP) - 1)) | |||
|
|||
#define MCACHE_HASH_METHOD(type, name) \ | |||
MCACHE_HASH((type)->tp_version_tag, \ | |||
((PyASCIIObject *)(name))->hash) | |||
MCACHE_HASH((type)->tp_version_tag, ((Py_ssize_t)(name)) >> 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that id(s) >> 3
is a much worse hash function than hash(s)
.
Is the cost of an extra memory read lower than the cost of the additional collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cache is so wildly successful that it doesn't actually matter a whole lot one way or another. It does do a little bit worse on collisions, but it doesn't seem to impact the hit rate much. This is running python -m test.regrtest and logging the stats at the end of the run:
>>3
:
{'total_slots': 4096, 'occupied_slots': 4096, 'num_hits': 20931459, 'num_misses': 4292, 'num_collisions': 127068, 'num_uncacheable': 1, 'num_mro_steps': 1170327}
((PyASCIIObject *)(name))->hash
:
{'total_slots': 4096, 'occupied_slots': 4096, 'num_hits': 20922787, 'num_misses': 4304, 'num_collisions': 107251, 'num_uncacheable': 1, 'num_mro_steps': 1133300}
method_cache_hits is bumped in the hit case
method_cache_collisions is bumped when adding a new entry and replacing the old one
method_cache_misses is bumped when adding a new entry and there isn't an existing one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any evidence as to which is faster?
If not, then the status quo wins.
My intuition is that the better spread from using the hash is likely to have better worse case performance, but in general there would be no measurable difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think per @methane running the benchmark suite it does look better with the object hash. I've also setup a small micro-benchmark and put it in https://bugs.python.org/issue43452 and it shows a slight win there as well. I had to write it in C to get any clear signal one way or another though, just doing the tight loop in Python seemed to have too much noise.
I like this in general. I'm not convinced by (1) as it may lead to more collisions. I'd be happy to be proved wrong, if you have data. |
Can you try to write a micro-benchmark to show the speedup? |
I'll make an attempt at it and report back, but I think it'll be hard to get enough signal even w/ a microbenchmark. When I originally measured this against re-played Instagram traffic we saw a reduction in _PyType_Lookup from 1.7% of CPU usage to 1.6% of CPU usage. So it is a pretty small win overall, and may be dwarfed by all the other interpreter overhead in a benchmark. |
This is great work. I agree with @markshannon, I think that I am sold with (2) and (3) but I am quite not convinced by (1), but I think it should be fine because I expect quite a small amount of cache misses. |
One additional tweak here could be to do >> 2 on 32-bit platforms, although I'm not sure how much that'll really matter either. |
Objects are aligned by 16byte on 64bit and 8byte on 32bit platforms. So |
Since unicode objects is larger than minimum alignment, more than 4bits are not random.
But lower bits of |
I forgot about we use only 12bits for hash. All bits are important. I posted benchmark result of |
About
|
Should we update _Py_HashPointer()? |
Performance difference between |
I think this change is significant enough to have a news entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But a news entry would be nice.
@vstinner I've added a small C micro-benchmark in https://bugs.python.org/issue43452 |
Thanks for proving that the optimization actually makes the code faster :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let’s land this. Thanks Dino for going the extra mile!
Thanks a lot @DinoV for the fantastic work! |
* master: (129 commits) bpo-43452: Micro-optimizations to PyType_Lookup (pythonGH-24804) bpo-43517: Fix false positive in detection of circular imports (python#24895) bpo-43494: Make some minor changes to lnotab notes (pythonGH-24861) Mention that code.co_lnotab is deprecated in what's new for 3.10. (python#24902) bpo-43244: Remove symtable.h header file (pythonGH-24910) bpo-43466: Add --with-openssl-rpath configure option (pythonGH-24820) Fix a typo in c-analyzer (pythonGH-24468) bpo-41561: Add workaround for Ubuntu's custom security level (pythonGH-24915) bpo-43521: Allow ast.unparse with empty sets and NaN (pythonGH-24897) bpo-43244: Remove the PyAST_Validate() function (pythonGH-24911) bpo-43541: Fix PyEval_EvalCodeEx() regression (pythonGH-24918) bpo-43244: Fix test_peg_generators on Windows (pythonGH-24913) bpo-39342: Expose X509_V_FLAG_ALLOW_PROXY_CERTS in ssl module (pythonGH-18011) bpo-43244: Fix test_peg_generator for PyAST_Validate() (pythonGH-24912) bpo-42128: Add 'missing :' syntax error message to match statements (pythonGH-24733) bpo-43244: Add pycore_ast.h header file (pythonGH-24908) bpo-43244: Rename pycore_ast.h to pycore_ast_state.h (pythonGH-24907) Remove unnecessary imports in the grammar parser (pythonGH-24904) bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters (pythonGH-24843) Add PEP 626 to what's new in 3.10. (python#24892) ...
The common case going through _PyType_Lookup is to have a cache hit. There's some small tweaks which can make this a little cheaper:
the name field identity is used for a cache hit, and is kept alive by the cache. So there's no need to read the hash code of the name - instead the address can be used as the hash.
There's no need to check if the name is cachable on the lookup either, it probably is, and if it is, it'll be in the cache.
If we clear the version tag when invalidating a type then we don't actually need to check for a valid version tag bit.
https://bugs.python.org/issue43452