Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.unique converts uint64 to int64 (with overflow) #14721

Closed
dnspies opened this issue Nov 23, 2016 · 16 comments
Closed

Series.unique converts uint64 to int64 (with overflow) #14721

dnspies opened this issue Nov 23, 2016 · 16 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@dnspies
Copy link

dnspies commented Nov 23, 2016

>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series([1,2,2**63], dtype=np.dtype('uint64'))
>>> s
0                      1
1                      2
2    9223372036854775808
dtype: uint64
>>> s.unique()
array([                   1,                    2, -9223372036854775808])

Series.unique should preserve whatever type the values are. But instead it changes from uint64 to int64.

INSTALLED VERSIONS

...
python: 3.4.3.final.0
python-bits: 64
...

...
pandas: 0.17.1
numpy: 1.10.4
...

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

this is a sympton of #4471 in general uint64 support is not great. PR's are welcome to fix that!

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Nov 23, 2016
@jreback jreback added this to the Next Major Release milestone Nov 23, 2016
@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2016

@jreback : IIUC, your statement is incorrect. The bug traces back as follows:

When you call Series.unique(), it calls the super-class implementation here. That method then calls unique1d() here. The erroneous check is here, which catches np.uint64 but then incorrectly converts them to int64.

The correct patch I think is to change the check to np.signedinteger and then have a separate one for np.unsignedinteger. However, I cannot find a proper implementation of a uint64 hashtable. In addition, things seem to get insanely complicated as we move to hashtable.pyx (and related helper .pxi.in files) khash.pxd, and khash.h, which setup all of the hashtable functionality.

The easy way out but less desirable way I think is to modify this branch here and convert the resulting array back to the original dtype.

Thoughts?

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

@gfyoung I was pointing to a 'generic' issue with uint64. it doesn't work in lots of places!

@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2016

@jreback : Ah, fair enough. But thoughts about this?

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

However, I cannot find a proper implementation of a uint64 hashtable. In addition, things seem to get insanely complicated as we move to hashtable.pyx (and related helper .pxi.in files) khash.pxd, and khash.h, which setup all of the hashtable functionality.

hmm

@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2016

@jreback : If someone knows how to navigate through that swamp of hashtable code, then by all means, patch it! 😄 But yes, it's just a matter of adding another hashtable class.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

you can use python hashing, no?

In [9]: hash(np.uint64(2**63+1))
Out[9]: 5

In [10]: hash(np.int64(2**63-1))
Out[10]: 3

@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2016

@jreback : Perhaps, but then what's going on with this entire hashtable class? Why was it implemented that way if Python hashing is feasible?

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

prob not that hard to write some klib type stuff in cython

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

klib is >> faster than python for the types of things we do (and doesn't have as much overhead for pyobjects)

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

or just add to the klib stuff (might be easier)

@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2016

@jreback : I think adding to the klib is the best solution, but I don't really understand how all of it works.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

me neither :<

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

its just template stuff in a header only library

@jreback
Copy link
Contributor

jreback commented Nov 23, 2016

you can prob copy int64 and just do a replace. (maybe)

@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2016

@jreback : True that file is, but I have no idea how it gets populated.

gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 19, 2016
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow.

Closes pandas-devgh-14721.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 19, 2016
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes pandas-devgh-14721.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 19, 2016
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes pandas-devgh-14721.
@jreback jreback modified the milestones: 0.20.0, Next Major Release Dec 19, 2016
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 20, 2016
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes pandas-devgh-14721.
ShaharBental pushed a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016
Introduces a `UInt64HashTable` class to hash `uint64` elements and
prevent overflow in functions like `Series.unique`.    Closes pandas-dev#14721.

Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#14915 from gfyoung/uint64-hashtable-patch and squashes the following commits:

380c580 [gfyoung] BUG: Prevent uint64 overflow in Series.unique
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

3 participants