PERF: hashtable for 16, 32, 128 bit dtypes #33287

jbrockmendel · 2020-04-04T17:33:08Z

@jorisvandenbossche has pointed out in a couple of threads recently that the hashtables we use only support 64 bit float/uint/int dtypes. As a result, whenever we need to factorize a smaller dtype, we have to copy+upcast.

khash.pxd looks like it has int32 functions; might there be uint32 or float32 variants available somewhere in the C code? If not, what would it take to implement them? cc @WillAyd @chris-b1

WillAyd · 2020-04-04T21:05:01Z

Hmm not sure I understand the issue (probably due to my lack of any expertise in hashing) - Khash uses 32 bits for hashing . See #28303 (comment)

jbrockmendel · 2020-04-04T21:55:09Z

Hmm not sure I understand the issue (probably due to my lack of any expertise in hashing)

At a high level this shows upm in core.algorithms._ensure_data where we cast any float/int/uint dtype to float64/int64/uint64. Also shows up in libindex and in Categorical._values_for_factorize where we do casting that ideally we'd avoid.

A level lower down from that in hashtable_class_helper.pxi.in we define Float64HashTable, Int64Hashtable, and Uint64Hastable, but not 16 or 32 bit variants. AFAICT in order to implement the smaller variants we would need corresponding functions in khash

jorisvandenbossche · 2020-04-05T07:28:50Z

I think the "32 bits" is about the values being used as hashes, not about the values being hashed.

The different khash functions are initialized here:

pandas/pandas/_libs/src/klib/khash.h

Lines 563 to 566 in 265b842

    
           KHASH_MAP_INIT_STR(str, size_t) 
        
           KHASH_MAP_INIT_INT(int32, size_t) 
        
           KHASH_MAP_INIT_INT64(int64, size_t) 
        
           KHASH_MAP_INIT_UINT64(uint64, size_t)

So apparently we do have 32bit signed integer (we just don't have a HashTable that uses it). What it would take to add other bits to this, I don't directly know. The original klib/khash also only has those 4 defined. I suppose one needs to define a hash function per type (but for lower bits as int64/int32 this might be relatively straightforward). The question might also be to what extent we want to deviate a lot from the original code (maybe we already did though).

realead · 2020-09-30T23:55:41Z

@jbrockmendel

If we have a proper hash function 32bit->32bit (an example could be https://github.com/realead/cykhash/blob/master/src/cykhash/murmurhash.pxi#L14), we just need to do the same dance as for float64-map, here an example basically replacing 64 through 32.

#define ZERO_HASH 0
#define NAN_HASH  0

khint32_t PANDAS_INLINE kh_float32_hash_func(double val) {
    // 0.0 and -0.0 should have the same hash:
    if (val == 0.0) {
        return ZERO_HASH;
    }
    // all nans should have the same hash:
    if ( val != val ) {
        return NAN_HASH;
    }
    khint32_t as_int = asint32(val);
    return HASH_FOR_32BIT(as_int);
}

#define kh_float32_hash_equal(a, b) ((a) == (b) || ((b) != (b) && (a) != (a)))

#define KHASH_MAP_INIT_FLOAT32(name, khval_t)								\
	KHASH_INIT(name, khfloat32_t, khval_t, 1, kh_float32_hash_func, kh_float32_hash_equal)

KHASH_MAP_INIT_FLOAT32(float32, size_t)

Then this dance has to be continued in khash.pxi:

pandas/pandas/_libs/khash.pxd

Lines 103 to 117 in 0cf3491

    
           ctypedef struct kh_float64_t: 
        
               khint_t n_buckets, size, n_occupied, upper_bound 
        
               uint32_t *flags 
        
               float64_t *keys 
        
               size_t *vals 
        
           kh_float64_t* kh_init_float64() nogil 
        
           void kh_destroy_float64(kh_float64_t*) nogil 
        
           void kh_clear_float64(kh_float64_t*) nogil 
        
           khint_t kh_get_float64(kh_float64_t*, float64_t) nogil 
        
           void kh_resize_float64(kh_float64_t*, khint_t) nogil 
        
           khint_t kh_put_float64(kh_float64_t*, float64_t, int*) nogil 
        
           void kh_del_float64(kh_float64_t*, khint_t) nogil 
        
           bint kh_exist_float64(kh_float64_t*, khiter_t) nogil

after which float32-map can be added here:

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Lines 25 to 30 in 0cf3491

    
           dtypes = [('Float64', 'float64', 'float64_t'), 
        
                     ('Int64', 'int64', 'int64_t'), 
        
                     ('String', 'string', 'char *'), 
        
                     ('UInt64', 'uint64', 'uint64_t')] 
        
           }}

and in some other places, which should work more or less out of the box.

There is already an int32-implementation in pandas:

pandas/pandas/_libs/khash.pxd

Lines 119 to 133 in 0cf3491

    
           ctypedef struct kh_int32_t: 
        
               khint_t n_buckets, size, n_occupied, upper_bound 
        
               uint32_t *flags 
        
               int32_t *keys 
        
               size_t *vals 
        
           kh_int32_t* kh_init_int32() nogil 
        
           void kh_destroy_int32(kh_int32_t*) nogil 
        
           void kh_clear_int32(kh_int32_t*) nogil 
        
           khint_t kh_get_int32(kh_int32_t*, int32_t) nogil 
        
           void kh_resize_int32(kh_int32_t*, khint_t) nogil 
        
           khint_t kh_put_int32(kh_int32_t*, int32_t, int*) nogil 
        
           void kh_del_int32(kh_int32_t*, khint_t) nogil 
        
           bint kh_exist_int32(kh_int32_t*, khiter_t) nogil

but I would improve upon currently used definition:

pandas/pandas/_libs/src/klib/khash.h

Lines 515 to 516 in 0cf3491

    
           #define KHASH_MAP_INIT_INT(name, khval_t)								\ 
        
           	KHASH_INIT(name, khint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal)

because the hash-function is just the identity:

pandas/pandas/_libs/src/klib/khash.h

Line 325 in 0cf3491

#define kh_int_hash_func(key) (khint32_t)(key)

which can easily lead to O(n^2) behavior for such series like 0,1,2,3,4,5,6 and so on.

jbrockmendel · 2020-11-03T02:25:08Z

@realead is there a viable way to extend what we have now to any of {float128, complex64, complex128}?

realead · 2020-11-03T06:38:17Z

@jbrockmendel I think complex128 is the most straight forward case right now:

for the most cases we can define "equal" as float64-"equal" for real and imaginary part. There are some strange rules concerning complex numbers (C99 Appendix G), so in order to be consistent the simple approach might not be enough.
as hash function one could xor the hash values from real and imaginary parts
the tricky part is to get complex to work in C code on Linux, Windows and with Cython at the same time (VisualStudio doesn't fully support C99, Cython falls back to its own implementation on Windows, but it is not easy to access from C-code). But maybe I'm overcomplicating and it works out-of-the-box.

complex64 would be the same as above, once the float32-version is available.

I have never worked with np.float128, so don't really know how it supposed to work on x86 or x86_64 (or actually what it means, because Cython maps np.float128 onto long double which is only 80bit on x86), so cannot say what problems there could be with np.float128.

jbrockmendel · 2021-01-02T17:54:21Z

@realead is this closeable?

realead · 2021-01-02T19:13:39Z

@jbrockmendel it depends on how serious you are about float128. I have no experience with float128 - so it will take some time for me to understand its behavior on Linux/Windows and to come up with a solution.

Other types are done.

jreback · 2021-01-02T19:31:35Z

float128 is barely supported so let's close this

jbrockmendel added the Performance Memory or execution speed performance label Apr 4, 2020

jbrockmendel mentioned this issue Sep 29, 2020

PERF: using murmur hash for float64 khash-tables #36729

Merged

5 tasks

jbrockmendel changed the title ~~PERF: hashtable for 16, 32 bit dtypes~~ PERF: hashtable for 16, 32, 128 bit dtypes Nov 3, 2020

jbrockmendel mentioned this issue Nov 16, 2020

PERF: Always using panda's hashtable approach, dropping np.in1d #36611

Merged

4 tasks

realead mentioned this issue Nov 17, 2020

PERF: Introducing HashTables for datatypes with 8,16 and 32 bits #37920

Merged

4 tasks

realead mentioned this issue Nov 30, 2020

PERF: Introducing hash tables for complex64 and complex128 #38179

Merged

4 tasks

jreback added this to the No action milestone Jan 2, 2021

jreback closed this as completed Jan 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: hashtable for 16, 32, 128 bit dtypes #33287

PERF: hashtable for 16, 32, 128 bit dtypes #33287

jbrockmendel commented Apr 4, 2020

WillAyd commented Apr 4, 2020

jbrockmendel commented Apr 4, 2020

jorisvandenbossche commented Apr 5, 2020

realead commented Sep 30, 2020 •

edited

Loading

jbrockmendel commented Nov 3, 2020

realead commented Nov 3, 2020

jbrockmendel commented Jan 2, 2021

realead commented Jan 2, 2021

jreback commented Jan 2, 2021

PERF: hashtable for 16, 32, 128 bit dtypes #33287

PERF: hashtable for 16, 32, 128 bit dtypes #33287

Comments

jbrockmendel commented Apr 4, 2020

WillAyd commented Apr 4, 2020

jbrockmendel commented Apr 4, 2020

jorisvandenbossche commented Apr 5, 2020

realead commented Sep 30, 2020 • edited Loading

jbrockmendel commented Nov 3, 2020

realead commented Nov 3, 2020

jbrockmendel commented Jan 2, 2021

realead commented Jan 2, 2021

jreback commented Jan 2, 2021

realead commented Sep 30, 2020 •

edited

Loading