Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Prevent uint64 overflow in Series.unique #14915

Closed

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 19, 2016

Introduces a UInt64HashTable class to hash uint64 elements and prevent overflow in functions like Series.unique.

Closes #14721.

@sinhrks sinhrks added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug labels Dec 19, 2016
@codecov-io
Copy link

codecov-io commented Dec 19, 2016

Current coverage is 84.65% (diff: 100%)

Merging #14915 into master will increase coverage by <.01%

@@             master     #14915   diff @@
==========================================
  Files           144        144          
  Lines         51016      51020     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43184      43189     +5   
+ Misses         7832       7831     -1   
  Partials          0          0          

Powered by Codecov. Last update 3ccb501...380c580

np.arange(len(xs), dtype=np.int64))

def test_get_unique(self):
s = pd.Series([1, 2, 2**63, 2**63], dtype=np.uint64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think needs a bit more testing
most test are in algos for things like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Needs a bit more testing" is vague. What sort of cases are you thinking of? Secondly, the affected function has no relation to algos. The code path is completely different.

However, you did catch a bug in pd.unique, which does travel through algorithms.py and can now benefit from my hashtable. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not on the hashtable itself. ideally you should add uint64 to lots of places that we test int64 (construction type things). IOW, to actually exercise this code. I know this is vague. happy to merge this and to do a followup with more tests (and potentially breaking things). later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, well that's a larger issue beyond this PR. However, let's not merge until I patch pd.unique, which suffers from a similar issue as Series.unique.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

you will need to rebase after #14919. I moved a couple of things, and added a HashTable testing section.

Though what I would really like is to systematically add unit64 to as many places as we test other dtypes and see what breaks / falls out (can do in another PR). Then create issues for incompat places.

@gfyoung gfyoung force-pushed the uint64-hashtable-patch branch from 3e21ad6 to 03e926e Compare December 19, 2016 19:20
@gfyoung gfyoung closed this Dec 19, 2016
@gfyoung gfyoung deleted the uint64-hashtable-patch branch December 19, 2016 19:22
@gfyoung gfyoung restored the uint64-hashtable-patch branch December 19, 2016 19:22
@gfyoung gfyoung reopened this Dec 19, 2016
@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

what was the bug (from before I moved unique)?

@gfyoung
Copy link
Member Author

gfyoung commented Dec 19, 2016

@jreback : The diff explains it all. We were smashing all integer types together into one hashtable, which does not work if you have uint.

@jreback jreback added this to the 0.20.0 milestone Dec 19, 2016
@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

ok, lgtm. ping on green.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 19, 2016

@jreback : Sounds good. I changed my mind about changing pd.unique in this PR. core.algorithms is riddled with uint64 bugs that need to be checked. I'll tackle in a follow-up.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

@jreback : Sounds good. I changed my mind about changing pd.unique in this PR. core.algorithms is riddled with uint64 bugs that need to be checked. I'll tackle in a follow-up.

totally fine. we can make a master issue of these if that is helpful.

@gfyoung gfyoung force-pushed the uint64-hashtable-patch branch from 03e926e to 1583235 Compare December 20, 2016 00:15
@gfyoung gfyoung closed this Dec 20, 2016
@gfyoung gfyoung force-pushed the uint64-hashtable-patch branch from 1583235 to 8e630b6 Compare December 20, 2016 00:15
Introduces a UInt64HashTable class to hash
uint64 elements and prevent overflow in
functions like Series.unique.

Closes pandas-devgh-14721.
@gfyoung gfyoung reopened this Dec 20, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Dec 20, 2016

@jreback : Everything is green, so ready to merge if there are no other concerns.

@jreback jreback closed this in b35c689 Dec 20, 2016
@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

thanks!

as I said, ideally have an issue which greatly expands test coverage for uint64 (and just mark the tests that are failing (e.g. assert that they are failing or comment out)).

Then can come back around and fix.

@gfyoung gfyoung deleted the uint64-hashtable-patch branch December 20, 2016 14:39
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 20, 2016
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
pandas-devgh-14915).
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 20, 2016
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
pandas-devgh-14915).
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 20, 2016
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
pandas-devgh-14915).
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 20, 2016
Uses UInt64HashTable to patch a uint64
overflow bug in pd.unique analogous to
that seen in Series.unique (patched in
pandas-devgh-14915).
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 21, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 21, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 21, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 22, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 22, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 22, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 22, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 23, 2016
1) duplicated()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and refactors to use duplicated_uint64.

2) mode()

Updates documentation to describe the "values"
parameter in the signature, adds tests for uint64,
and reactors to use mode_uint64.

3) unique()

Uses UInt64HashTable to patch a uint64 overflow bug
analogous to that seen in Series.unique (patched in
pandas-devgh-14915).

4) Types API

Introduces "is_signed_integer_dtype" and "is_unsigned
_integer_dtype" to the public API. Used in refactoring/
patching of 1-3.
jreback pushed a commit that referenced this pull request Dec 24, 2016
Adds `uint64` ranking functions to `algos.pyx` to allow for proper
ranking with `uint64`.    Also introduces partial patch for
`factorize()` by adding `uint64` hashtables and vectors for  usage.
However, this patch is only partial because the larger bug of non-
support for `uint64`  in `Index` has not been fixed (**UPDATE**:
tackled in #14937):  ~~~python  >>> from pandas import Index, np  >>>
Index(np.array([2**63], dtype=np.uint64))
Int64Index([-9223372036854775808], dtype='int64')  ~~~    Also patches
a bug in `UInt64HashTable` from #14915 that had an erroneous null
condition that was caught during testing and was hence removed.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits:

2598cea [gfyoung] BUG: Patch rank() uint64 behavior
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
Introduces a `UInt64HashTable` class to hash `uint64` elements and
prevent overflow in functions like `Series.unique`.    Closes pandas-dev#14721.

Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#14915 from gfyoung/uint64-hashtable-patch and squashes the following commits:

380c580 [gfyoung] BUG: Prevent uint64 overflow in Series.unique
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
Adds `uint64` ranking functions to `algos.pyx` to allow for proper
ranking with `uint64`.    Also introduces partial patch for
`factorize()` by adding `uint64` hashtables and vectors for  usage.
However, this patch is only partial because the larger bug of non-
support for `uint64`  in `Index` has not been fixed (**UPDATE**:
tackled in pandas-dev#14937):  ~~~python  >>> from pandas import Index, np  >>>
Index(np.array([2**63], dtype=np.uint64))
Int64Index([-9223372036854775808], dtype='int64')  ~~~    Also patches
a bug in `UInt64HashTable` from pandas-dev#14915 that had an erroneous null
condition that was caught during testing and was hence removed.

Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits:

2598cea [gfyoung] BUG: Patch rank() uint64 behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants