Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-43452: Microoptimizations to PyType_Lookup #24804

Merged
merged 3 commits into from
Mar 20, 2021

Conversation

DinoV
Copy link
Contributor

@DinoV DinoV commented Mar 9, 2021

The common case going through _PyType_Lookup is to have a cache hit. There's some small tweaks which can make this a little cheaper:

  1. the name field identity is used for a cache hit, and is kept alive by the cache. So there's no need to read the hash code of the name - instead the address can be used as the hash.

  2. There's no need to check if the name is cachable on the lookup either, it probably is, and if it is, it'll be in the cache.

  3. If we clear the version tag when invalidating a type then we don't actually need to check for a valid version tag bit.

https://bugs.python.org/issue43452

@@ -32,8 +32,7 @@ class object "PyObject *" "&PyBaseObject_Type"
& ((1 << MCACHE_SIZE_EXP) - 1))

#define MCACHE_HASH_METHOD(type, name) \
MCACHE_HASH((type)->tp_version_tag, \
((PyASCIIObject *)(name))->hash)
MCACHE_HASH((type)->tp_version_tag, ((Py_ssize_t)(name)) >> 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that id(s) >> 3 is a much worse hash function than hash(s).
Is the cost of an extra memory read lower than the cost of the additional collisions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cache is so wildly successful that it doesn't actually matter a whole lot one way or another. It does do a little bit worse on collisions, but it doesn't seem to impact the hit rate much. This is running python -m test.regrtest and logging the stats at the end of the run:

>>3:
{'total_slots': 4096, 'occupied_slots': 4096, 'num_hits': 20931459, 'num_misses': 4292, 'num_collisions': 127068, 'num_uncacheable': 1, 'num_mro_steps': 1170327}

((PyASCIIObject *)(name))->hash:
{'total_slots': 4096, 'occupied_slots': 4096, 'num_hits': 20922787, 'num_misses': 4304, 'num_collisions': 107251, 'num_uncacheable': 1, 'num_mro_steps': 1133300}

method_cache_hits is bumped in the hit case
method_cache_collisions is bumped when adding a new entry and replacing the old one
method_cache_misses is bumped when adding a new entry and there isn't an existing one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any evidence as to which is faster?
If not, then the status quo wins.

My intuition is that the better spread from using the hash is likely to have better worse case performance, but in general there would be no measurable difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think per @methane running the benchmark suite it does look better with the object hash. I've also setup a small micro-benchmark and put it in https://bugs.python.org/issue43452 and it shows a slight win there as well. I had to write it in C to get any clear signal one way or another though, just doing the tight loop in Python seemed to have too much noise.

@markshannon
Copy link
Member

I like this in general.

I'm not convinced by (1) as it may lead to more collisions. I'd be happy to be proved wrong, if you have data.
(2) and (3) are definitely improvements.

@vstinner
Copy link
Member

Can you try to write a micro-benchmark to show the speedup?

@vstinner
Copy link
Member

@DinoV
Copy link
Contributor Author

DinoV commented Mar 10, 2021

Can you try to write a micro-benchmark to show the speedup?

I'll make an attempt at it and report back, but I think it'll be hard to get enough signal even w/ a microbenchmark. When I originally measured this against re-played Instagram traffic we saw a reduction in _PyType_Lookup from 1.7% of CPU usage to 1.6% of CPU usage. So it is a pretty small win overall, and may be dwarfed by all the other interpreter overhead in a benchmark.

@pablogsal
Copy link
Member

This is great work.

I agree with @markshannon, I think that I am sold with (2) and (3) but I am quite not convinced by (1), but I think it should be fine because I expect quite a small amount of cache misses.

@DinoV
Copy link
Contributor Author

DinoV commented Mar 10, 2021

This is great work.

I agree with @markshannon, I think that I am sold with (2) and (3) but I am quite not convinced by (1), but I think it should be fine because I expect quite a small amount of cache misses.

One additional tweak here could be to do >> 2 on 32-bit platforms, although I'm not sure how much that'll really matter either.

@methane
Copy link
Member

methane commented Mar 11, 2021

One additional tweak here could be to do >> 2 on 32-bit platforms, although I'm not sure how much that'll really matter either.

Objects are aligned by 16byte on 64bit and 8byte on 32bit platforms. So >>4 on 64bit and >>3 on 32bit would be better.

@methane
Copy link
Member

methane commented Mar 11, 2021

Objects are aligned by 16byte on 64bit and 8byte on 32bit platforms. So >>4 on 64bit and >>3 on 32bit would be better.

Since unicode objects is larger than minimum alignment, more than 4bits are not random.

>>> sys.getsizeof("a")
50
>>> (50+15)//16
4

But lower bits of tp_version_tag are more random than higher bits. So I think no need to throw away all non-random lower bits in unicode pointer.

@methane
Copy link
Member

methane commented Mar 11, 2021

But lower bits of tp_version_tag are more random than higher bits. So I think no need to throw away all non-random lower bits in unicode pointer.

I forgot about we use only 12bits for hash. All bits are important. I posted benchmark result of >>3 in bpo. I will benchmark >>4 and >>5 too.

@vstinner
Copy link
Member

About >>4 shift, see also the _Py_HashPointer() function:

    /* bottom 3 or 4 bits are likely to be 0; rotate y by 4 to avoid
       excessive hash collisions for dicts and sets */
    y = (y >> 4) | (y << (8 * SIZEOF_VOID_P - 4));

@vstinner
Copy link
Member

Objects are aligned by 16byte on 64bit and 8byte on 32bit platforms. So >>4 on 64bit and >>3 on 32bit would be better.

Should we update _Py_HashPointer()?

@methane
Copy link
Member

methane commented Mar 11, 2021

I forgot about we use only 12bits for hash. All bits are important. I posted benchmark result of >>3 in bpo. I will benchmark >>4 and >>5 too.

Performance difference between >>3, >>4, and >>5 are not significant.
All of them are faster than status quo (result).

@methane
Copy link
Member

methane commented Mar 19, 2021

I think this change is significant enough to have a news entry.

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But a news entry would be nice.

@DinoV
Copy link
Contributor Author

DinoV commented Mar 20, 2021

Can you try to write a micro-benchmark to show the speedup?

@vstinner I've added a small C micro-benchmark in https://bugs.python.org/issue43452

@vstinner
Copy link
Member

obj hash: 0m6.222s
str hash: 0m6.327s
baseline: 0m6.784s

Thanks for proving that the optimization actually makes the code faster :-)

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let’s land this. Thanks Dino for going the extra mile!

@pablogsal pablogsal merged commit ee48c7d into python:master Mar 20, 2021
@pablogsal
Copy link
Member

Thanks a lot @DinoV for the fantastic work!

jab added a commit to jab/cpython that referenced this pull request Mar 20, 2021
* master: (129 commits)
  bpo-43452: Micro-optimizations to PyType_Lookup (pythonGH-24804)
  bpo-43517: Fix false positive in detection of circular imports (python#24895)
  bpo-43494: Make some minor changes to lnotab notes (pythonGH-24861)
  Mention that code.co_lnotab is deprecated in what's new for 3.10. (python#24902)
  bpo-43244: Remove symtable.h header file (pythonGH-24910)
  bpo-43466: Add --with-openssl-rpath configure option (pythonGH-24820)
  Fix a typo in c-analyzer (pythonGH-24468)
  bpo-41561: Add workaround for Ubuntu's custom security level (pythonGH-24915)
  bpo-43521: Allow ast.unparse with empty sets and NaN (pythonGH-24897)
  bpo-43244: Remove the PyAST_Validate() function (pythonGH-24911)
  bpo-43541: Fix PyEval_EvalCodeEx() regression (pythonGH-24918)
  bpo-43244: Fix test_peg_generators on Windows (pythonGH-24913)
  bpo-39342: Expose X509_V_FLAG_ALLOW_PROXY_CERTS in ssl module (pythonGH-18011)
  bpo-43244: Fix test_peg_generator for PyAST_Validate() (pythonGH-24912)
  bpo-42128: Add 'missing :' syntax error message to match statements (pythonGH-24733)
  bpo-43244: Add pycore_ast.h header file (pythonGH-24908)
  bpo-43244: Rename pycore_ast.h to pycore_ast_state.h (pythonGH-24907)
  Remove unnecessary imports in the grammar parser (pythonGH-24904)
  bpo-35883: Py_DecodeLocale() escapes invalid Unicode characters (pythonGH-24843)
  Add PEP 626 to what's new in 3.10. (python#24892)
  ...
@DinoV DinoV deleted the pytype_lookup branch May 31, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants