-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: hashtable for 16, 32, 128 bit dtypes #33287
Comments
Hmm not sure I understand the issue (probably due to my lack of any expertise in hashing) - Khash uses 32 bits for hashing . See #28303 (comment) |
At a high level this shows upm in core.algorithms._ensure_data where we cast any float/int/uint dtype to float64/int64/uint64. Also shows up in libindex and in Categorical._values_for_factorize where we do casting that ideally we'd avoid. A level lower down from that in hashtable_class_helper.pxi.in we define Float64HashTable, Int64Hashtable, and Uint64Hastable, but not 16 or 32 bit variants. AFAICT in order to implement the smaller variants we would need corresponding functions in khash |
I think the "32 bits" is about the values being used as hashes, not about the values being hashed. The different khash functions are initialized here: pandas/pandas/_libs/src/klib/khash.h Lines 563 to 566 in 265b842
So apparently we do have 32bit signed integer (we just don't have a HashTable that uses it). What it would take to add other bits to this, I don't directly know. The original klib/khash also only has those 4 defined. I suppose one needs to define a hash function per type (but for lower bits as int64/int32 this might be relatively straightforward). The question might also be to what extent we want to deviate a lot from the original code (maybe we already did though). |
If we have a proper hash function 32bit->32bit (an example could be https://github.com/realead/cykhash/blob/master/src/cykhash/murmurhash.pxi#L14), we just need to do the same dance as for float64-map, here an example basically replacing 64 through 32.
Then this dance has to be continued in Lines 103 to 117 in 0cf3491
after which float32-map can be added here: pandas/pandas/_libs/hashtable_class_helper.pxi.in Lines 25 to 30 in 0cf3491
There is already an Lines 119 to 133 in 0cf3491
but I would improve upon currently used definition: pandas/pandas/_libs/src/klib/khash.h Lines 515 to 516 in 0cf3491
because the hash-function is just the identity: pandas/pandas/_libs/src/klib/khash.h Line 325 in 0cf3491
which can easily lead to O(n^2) behavior for such series like 0,1,2,3,4,5,6 and so on. |
@realead is there a viable way to extend what we have now to any of {float128, complex64, complex128}? |
@jbrockmendel I think complex128 is the most straight forward case right now:
complex64 would be the same as above, once the float32-version is available. I have never worked with np.float128, so don't really know how it supposed to work on x86 or x86_64 (or actually what it means, because Cython maps np.float128 onto |
@realead is this closeable? |
@jbrockmendel it depends on how serious you are about Other types are done. |
float128 is barely supported so let's close this |
@jorisvandenbossche has pointed out in a couple of threads recently that the hashtables we use only support 64 bit float/uint/int dtypes. As a result, whenever we need to factorize a smaller dtype, we have to copy+upcast.
khash.pxd looks like it has int32 functions; might there be uint32 or float32 variants available somewhere in the C code? If not, what would it take to implement them? cc @WillAyd @chris-b1
The text was updated successfully, but these errors were encountered: