-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Always using panda's hashtable approach, dropping np.in1d #36611
Conversation
Nice find. Not sure if we should just lower the limit in the condition to keep performance for a low number of elements to check, but I think this is OK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am pretty sure we have some other isin benchmarks, can you run them.
This is very odd result, I am pretty sure this numpy approach at least when we added this was WAY faster.
could be wrong, but see if you can find the issues that changed this and make sure that this is not in error.
I have run all benchmarks which used
The only tests with difference were:
which I would say is an improvement in overall. However, I have also changed my introduced benchmarks because they had a bias in the first version: most of the elements from the series were not in the
Now the benchmark have another bias: every element from the series is present in
Interesting difference to the original benchmark:
My assumption is that the hash-functions for float64 (
|
See also #28303 |
Just to illustrate how bad/severe the issue is. Here is the currently used hash-function for
For 1000 keys, khash set/map will use 2048-buckets, thus we need to see the hashes modulo 2048:
gives us only 2 different hashes - 0 and 1024, but for int64 there are 1000 different hashes. Which mean that the look-up is O(n^2) for float64. |
The most obvious solution would be to use
This changed hash-function on top of this branch gave (only floats):
The important results are:
It looks promising, however
there are no hash-collisions and that is all Python's dict needs. However, khash has another strategy and such a regular hash function probably means problems (by the way it also means problems for Another alternative which should be considered/tested is murmur-hash (https://github.com/aappleby/smhasher) which is used in gcc's libstdc++ (https://github.com/gcc-mirror/gcc/blob/41d6b10e96a1de98e90a7c0378437c3255814b16/libstdc%2B%2B-v3/libsupc%2B%2B/hash_bytes.cc#L25) as well as in libc++ (https://github.com/llvm/llvm-project/blob/1cfde143e82aeb47cffba436ba7b5302d8e14193/libcxx/include/utility#L977). |
not averse to having a better float hash impl but would need some serious benchmarking |
It is hard to tell what brought the speed-up. My first suspect would be #13436, but I'm not sure. However, the performance testing for this PR should probably wait until a judgement is passed for #36729. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
param_names = ["dtype", "M"] | ||
|
||
def setup(self, dtype, M): | ||
self.s = Series(np.random.randint(0, M, 10 ** 7)).astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we can expect big perf differences if the values here are clustered, e.g. if they are all negative or all >M in the extreme cases. Do we need to measure any non-uniform cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel You are right, the test is a little bit naive. We should test at least (I have pushed them):
- average (random) case with elements in the look-up-table
- average (random) case with elements not in the look-up-table
- monotone series with elements in the look-up-table (better cache utilization for numpy's approach)
- monotone series with elements not in the look-up-table (better cache utilization for numpy's approach)
it looks as if float32/float64 are somewhat problematic, hopefully after #36729 it will become clearer what is going on.
b93b1e4
to
1a85bb8
Compare
Current time comparisons:
One should see, how it looks once #36729 is resolved - right now float64/float32 seems to be problematic. |
this looks way better even w/o #36729, but if you can re-run and will see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls add a release note in performance.
pandas/core/algorithms.py
Outdated
else: | ||
f = np.in1d | ||
elif is_integer_dtype(comps): | ||
if is_integer_dtype(comps.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you but some commentary / link here to why we are not using np.int1d
…the search in the sorted arrays will have too few cache misses
7387b6d
to
7d706da
Compare
The timings are (pls use
When the look-up is dominated by the calculation of the hash-function (small numbers), we see the disadvantages of #36729 - it is costlier now (almost factor 3). However, already for n about 100 we see the advantages of a more robust hash-function: for some series we are almost 10 times faster (e.g. The question is: is it worth to keep Seeing the numbers, I would say "Yes", even if it make the code harder to understand/maintain. What is your opinion @jreback @jbrockmendel @WillAyd ? |
yeah the question here is there is a good benefit for hashing when you have x values to lookup but if fewer then the constant time creation hits you so not averse to having a switchover point for a perf reasons at the expense of complexity |
The maintenance burden for this optimization is pretty small, I'm OK with keeping the switchover I think the bigger optimization available (that would also decrease the [non-cython] complexity) is getting hashtables with more dtypes so we dont have to cast to int64 xref #33287 |
@realead ping when ready here |
thanks @realead very nice! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
numpy's
in1d
uses look-up inO(log n)
inin1d
compared to panda'sO(1)
so for a large number of unique elements invalues
pandas solution will become better.Here is comparison in the performance benchmark:
It looks as if for more than 10
values
-elements, panda's approach is faster.See also this comment (#22205 (comment)).
The question is whether it really makes sense to keep numpy's approach for less than 10
values
-elements.