Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: hashtable for 16, 32, 128 bit dtypes #33287

Closed
jbrockmendel opened this issue Apr 4, 2020 · 9 comments
Closed

PERF: hashtable for 16, 32, 128 bit dtypes #33287

jbrockmendel opened this issue Apr 4, 2020 · 9 comments
Labels
Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

@jorisvandenbossche has pointed out in a couple of threads recently that the hashtables we use only support 64 bit float/uint/int dtypes. As a result, whenever we need to factorize a smaller dtype, we have to copy+upcast.

khash.pxd looks like it has int32 functions; might there be uint32 or float32 variants available somewhere in the C code? If not, what would it take to implement them? cc @WillAyd @chris-b1

@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 4, 2020
@WillAyd
Copy link
Member

WillAyd commented Apr 4, 2020

Hmm not sure I understand the issue (probably due to my lack of any expertise in hashing) - Khash uses 32 bits for hashing . See #28303 (comment)

@jbrockmendel
Copy link
Member Author

Hmm not sure I understand the issue (probably due to my lack of any expertise in hashing)

At a high level this shows upm in core.algorithms._ensure_data where we cast any float/int/uint dtype to float64/int64/uint64. Also shows up in libindex and in Categorical._values_for_factorize where we do casting that ideally we'd avoid.

A level lower down from that in hashtable_class_helper.pxi.in we define Float64HashTable, Int64Hashtable, and Uint64Hastable, but not 16 or 32 bit variants. AFAICT in order to implement the smaller variants we would need corresponding functions in khash

@jorisvandenbossche
Copy link
Member

I think the "32 bits" is about the values being used as hashes, not about the values being hashed.

The different khash functions are initialized here:

KHASH_MAP_INIT_STR(str, size_t)
KHASH_MAP_INIT_INT(int32, size_t)
KHASH_MAP_INIT_INT64(int64, size_t)
KHASH_MAP_INIT_UINT64(uint64, size_t)

So apparently we do have 32bit signed integer (we just don't have a HashTable that uses it). What it would take to add other bits to this, I don't directly know. The original klib/khash also only has those 4 defined. I suppose one needs to define a hash function per type (but for lower bits as int64/int32 this might be relatively straightforward). The question might also be to what extent we want to deviate a lot from the original code (maybe we already did though).

@realead
Copy link
Contributor

realead commented Sep 30, 2020

@jbrockmendel

If we have a proper hash function 32bit->32bit (an example could be https://github.com/realead/cykhash/blob/master/src/cykhash/murmurhash.pxi#L14), we just need to do the same dance as for float64-map, here an example basically replacing 64 through 32.

#define ZERO_HASH 0
#define NAN_HASH  0

khint32_t PANDAS_INLINE kh_float32_hash_func(double val) {
    // 0.0 and -0.0 should have the same hash:
    if (val == 0.0) {
        return ZERO_HASH;
    }
    // all nans should have the same hash:
    if ( val != val ) {
        return NAN_HASH;
    }
    khint32_t as_int = asint32(val);
    return HASH_FOR_32BIT(as_int);
}

#define kh_float32_hash_equal(a, b) ((a) == (b) || ((b) != (b) && (a) != (a)))

#define KHASH_MAP_INIT_FLOAT32(name, khval_t)								\
	KHASH_INIT(name, khfloat32_t, khval_t, 1, kh_float32_hash_func, kh_float32_hash_equal)

KHASH_MAP_INIT_FLOAT32(float32, size_t)

Then this dance has to be continued in khash.pxi:

ctypedef struct kh_float64_t:
khint_t n_buckets, size, n_occupied, upper_bound
uint32_t *flags
float64_t *keys
size_t *vals
kh_float64_t* kh_init_float64() nogil
void kh_destroy_float64(kh_float64_t*) nogil
void kh_clear_float64(kh_float64_t*) nogil
khint_t kh_get_float64(kh_float64_t*, float64_t) nogil
void kh_resize_float64(kh_float64_t*, khint_t) nogil
khint_t kh_put_float64(kh_float64_t*, float64_t, int*) nogil
void kh_del_float64(kh_float64_t*, khint_t) nogil
bint kh_exist_float64(kh_float64_t*, khiter_t) nogil

after which float32-map can be added here:

dtypes = [('Float64', 'float64', 'float64_t'),
('Int64', 'int64', 'int64_t'),
('String', 'string', 'char *'),
('UInt64', 'uint64', 'uint64_t')]
}}
and in some other places, which should work more or less out of the box.


There is already an int32-implementation in pandas:

ctypedef struct kh_int32_t:
khint_t n_buckets, size, n_occupied, upper_bound
uint32_t *flags
int32_t *keys
size_t *vals
kh_int32_t* kh_init_int32() nogil
void kh_destroy_int32(kh_int32_t*) nogil
void kh_clear_int32(kh_int32_t*) nogil
khint_t kh_get_int32(kh_int32_t*, int32_t) nogil
void kh_resize_int32(kh_int32_t*, khint_t) nogil
khint_t kh_put_int32(kh_int32_t*, int32_t, int*) nogil
void kh_del_int32(kh_int32_t*, khint_t) nogil
bint kh_exist_int32(kh_int32_t*, khiter_t) nogil

but I would improve upon currently used definition:

#define KHASH_MAP_INIT_INT(name, khval_t) \
KHASH_INIT(name, khint32_t, khval_t, 1, kh_int_hash_func, kh_int_hash_equal)

because the hash-function is just the identity:

#define kh_int_hash_func(key) (khint32_t)(key)

which can easily lead to O(n^2) behavior for such series like 0,1,2,3,4,5,6 and so on.

@jbrockmendel jbrockmendel changed the title PERF: hashtable for 16, 32 bit dtypes PERF: hashtable for 16, 32, 128 bit dtypes Nov 3, 2020
@jbrockmendel
Copy link
Member Author

@realead is there a viable way to extend what we have now to any of {float128, complex64, complex128}?

@realead
Copy link
Contributor

realead commented Nov 3, 2020

@jbrockmendel I think complex128 is the most straight forward case right now:

  • for the most cases we can define "equal" as float64-"equal" for real and imaginary part. There are some strange rules concerning complex numbers (C99 Appendix G), so in order to be consistent the simple approach might not be enough.
  • as hash function one could xor the hash values from real and imaginary parts
  • the tricky part is to get complex to work in C code on Linux, Windows and with Cython at the same time (VisualStudio doesn't fully support C99, Cython falls back to its own implementation on Windows, but it is not easy to access from C-code). But maybe I'm overcomplicating and it works out-of-the-box.

complex64 would be the same as above, once the float32-version is available.

I have never worked with np.float128, so don't really know how it supposed to work on x86 or x86_64 (or actually what it means, because Cython maps np.float128 onto long double which is only 80bit on x86), so cannot say what problems there could be with np.float128.

@jbrockmendel
Copy link
Member Author

@realead is this closeable?

@realead
Copy link
Contributor

realead commented Jan 2, 2021

@jbrockmendel it depends on how serious you are about float128. I have no experience with float128 - so it will take some time for me to understand its behavior on Linux/Windows and to come up with a solution.

Other types are done.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2021

float128 is barely supported so let's close this

@jreback jreback added this to the No action milestone Jan 2, 2021
@jreback jreback closed this as completed Jan 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants