-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Prevent uint64 overflow in Series.unique #14915
Conversation
Current coverage is 84.65% (diff: 100%)@@ master #14915 diff @@
==========================================
Files 144 144
Lines 51016 51020 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43184 43189 +5
+ Misses 7832 7831 -1
Partials 0 0
|
np.arange(len(xs), dtype=np.int64)) | ||
|
||
def test_get_unique(self): | ||
s = pd.Series([1, 2, 2**63, 2**63], dtype=np.uint64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think needs a bit more testing
most test are in algos for things like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Needs a bit more testing" is vague. What sort of cases are you thinking of? Secondly, the affected function has no relation to algos
. The code path is completely different.
However, you did catch a bug in pd.unique
, which does travel through algorithms.py
and can now benefit from my hashtable. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not on the hashtable itself. ideally you should add uint64 to lots of places that we test int64 (construction type things). IOW, to actually exercise this code. I know this is vague. happy to merge this and to do a followup with more tests (and potentially breaking things). later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, well that's a larger issue beyond this PR. However, let's not merge until I patch pd.unique
, which suffers from a similar issue as Series.unique
.
you will need to rebase after #14919. I moved a couple of things, and added a HashTable testing section. Though what I would really like is to systematically add unit64 to as many places as we test other dtypes and see what breaks / falls out (can do in another PR). Then create issues for incompat places. |
3e21ad6
to
03e926e
Compare
what was the bug (from before I moved unique)? |
ok, lgtm. ping on green. |
@jreback : Sounds good. I changed my mind about changing |
totally fine. we can make a master issue of these if that is helpful. |
03e926e
to
1583235
Compare
1583235
to
8e630b6
Compare
Introduces a UInt64HashTable class to hash uint64 elements and prevent overflow in functions like Series.unique. Closes pandas-devgh-14721.
@jreback : Everything is green, so ready to merge if there are no other concerns. |
thanks! as I said, ideally have an issue which greatly expands test coverage for uint64 (and just mark the tests that are failing (e.g. assert that they are failing or comment out)). Then can come back around and fix. |
Uses UInt64HashTable to patch a uint64 overflow bug in pd.unique analogous to that seen in Series.unique (patched in pandas-devgh-14915).
Uses UInt64HashTable to patch a uint64 overflow bug in pd.unique analogous to that seen in Series.unique (patched in pandas-devgh-14915).
Uses UInt64HashTable to patch a uint64 overflow bug in pd.unique analogous to that seen in Series.unique (patched in pandas-devgh-14915).
Uses UInt64HashTable to patch a uint64 overflow bug in pd.unique analogous to that seen in Series.unique (patched in pandas-devgh-14915).
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
1) duplicated() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and refactors to use duplicated_uint64. 2) mode() Updates documentation to describe the "values" parameter in the signature, adds tests for uint64, and reactors to use mode_uint64. 3) unique() Uses UInt64HashTable to patch a uint64 overflow bug analogous to that seen in Series.unique (patched in pandas-devgh-14915). 4) Types API Introduces "is_signed_integer_dtype" and "is_unsigned _integer_dtype" to the public API. Used in refactoring/ patching of 1-3.
Adds `uint64` ranking functions to `algos.pyx` to allow for proper ranking with `uint64`. Also introduces partial patch for `factorize()` by adding `uint64` hashtables and vectors for usage. However, this patch is only partial because the larger bug of non- support for `uint64` in `Index` has not been fixed (**UPDATE**: tackled in #14937): ~~~python >>> from pandas import Index, np >>> Index(np.array([2**63], dtype=np.uint64)) Int64Index([-9223372036854775808], dtype='int64') ~~~ Also patches a bug in `UInt64HashTable` from #14915 that had an erroneous null condition that was caught during testing and was hence removed. Author: gfyoung <gfyoung17@gmail.com> Closes #14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits: 2598cea [gfyoung] BUG: Patch rank() uint64 behavior
Introduces a `UInt64HashTable` class to hash `uint64` elements and prevent overflow in functions like `Series.unique`. Closes pandas-dev#14721. Author: gfyoung <gfyoung17@gmail.com> Closes pandas-dev#14915 from gfyoung/uint64-hashtable-patch and squashes the following commits: 380c580 [gfyoung] BUG: Prevent uint64 overflow in Series.unique
Adds `uint64` ranking functions to `algos.pyx` to allow for proper ranking with `uint64`. Also introduces partial patch for `factorize()` by adding `uint64` hashtables and vectors for usage. However, this patch is only partial because the larger bug of non- support for `uint64` in `Index` has not been fixed (**UPDATE**: tackled in pandas-dev#14937): ~~~python >>> from pandas import Index, np >>> Index(np.array([2**63], dtype=np.uint64)) Int64Index([-9223372036854775808], dtype='int64') ~~~ Also patches a bug in `UInt64HashTable` from pandas-dev#14915 that had an erroneous null condition that was caught during testing and was hence removed. Author: gfyoung <gfyoung17@gmail.com> Closes pandas-dev#14935 from gfyoung/core-algorithms-uint64-two and squashes the following commits: 2598cea [gfyoung] BUG: Patch rank() uint64 behavior
Introduces a
UInt64HashTable
class to hashuint64
elements and prevent overflow in functions likeSeries.unique
.Closes #14721.