-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: using dtype='int64' argument of Series causes ValueError: values cannot be losslessly cast to int64 for integer strings #45017
Conversation
pandas/core/dtypes/cast.py
Outdated
@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array( | |||
) | |||
return casted | |||
|
|||
if all(np.dtype(i) is dtype for i in casted): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to be super-slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can maybe update the comment on L2102 if we've found cases that get here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall any cases that reach the ValueError that were there. There was one case which the added code solved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have a faster method to check whether all the cast for all the elements is performed correctly.
Checking the dtypes is one way but individually we cannot check all elements.
This is executed last, so in many cases we will not reach here. So a bit slow can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we need to be really sure about this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the check as requested.
As dtype will be an integer dtype there is no need to explicitly check it we can check all integer dtypes. |
pandas/core/dtypes/cast.py
Outdated
@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array( | |||
) | |||
return casted | |||
|
|||
if all(isinstance(i, (int, np.integer)) for i in casted): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can infer_dtype(casted) == 'integer'
i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@jbrockmendel @jreback any updates? |
@@ -1810,6 +1810,18 @@ def test_constructor_bool_dtype_missing_values(self): | |||
expected = Series(True, index=[0], dtype="bool") | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_constructor_int64_dtype(self): | |||
# GH-44923 | |||
result = Series(["0", "1", "2"], dtype="int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you come up with a test case that goes through 2099-2100 but that shouldn't get cast? maybe ["0", "1", "1.1"]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series(["0", "1", 0, 1], dtype="int64") this raises the specified ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But shouldn't this get cast to int64
as all can be type casted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add the above example and change the code to fit it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel there is no such case, so it is good to keep that line as it keeps the base covered.
pandas/core/dtypes/cast.py
Outdated
@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array( | |||
) | |||
return casted | |||
|
|||
if lib.infer_dtype(casted) == "integer": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this always holds. The condition we're interested in is whether the casting was lossy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but this test currently passes the extra test added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the added tests pass w/o this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, because this condition always holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But before adding this check the testcase in the issue was not passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But before adding this check the testcase in the issue was not passing.
That's a reason to add some check for this, but this particular check is not the right check. ATM this is equivalent to (but slower than) if True:
@jreback @jbrockmendel any update on this? |
@shubham11941140 there is not need to ping. you need to answer @jbrockmendel last question. I guess you tests is not specific enough (if it already passed) |
Actually I have added the discussion point to another test case, which might indicate this. I was asking to add another testcase which gives a lossless conversion. |
great |
Do I add it? |
does that test fail before your change? that's the key, does this replicate the original issue. |
It was currently failing the test. |
@@ -1810,6 +1810,18 @@ def test_constructor_bool_dtype_missing_values(self): | |||
expected = Series(True, index=[0], dtype="bool") | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_constructor_int64_dtype(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls parameterize these. also add a case for uint64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a case for uint64 will cause it to crash as this will implicitly type cast to int64 leading to an Assertion Error with uint64.
Parametrization is completed
@jbrockmendel any update on this? I think I have covered everything. |
@@ -1810,6 +1810,19 @@ def test_constructor_bool_dtype_missing_values(self): | |||
expected = Series(True, index=[0], dtype="bool") | |||
tm.assert_series_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize("int_dtype", ["int64"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use this fixture instead: any_int_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
expected = Series([-1, 0, 1, 2]) | ||
tm.assert_series_equal(result, expected) | ||
|
||
def test_constructor_float64_dtype(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use any_float_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/dtypes/cast.py
Outdated
@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array( | |||
) | |||
return casted | |||
|
|||
if lib.infer_dtype(casted) == "integer": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the added tests pass w/o this change?
@@ -1810,6 +1810,20 @@ def test_constructor_bool_dtype_missing_values(self): | |||
expected = Series(True, index=[0], dtype="bool") | |||
tm.assert_series_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize("any_int_dtype", ["int64"]) | |||
def test_constructor_int64_dtype(self, any_int_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, pls just use the fixture itself, e.g. no parameterize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is causing Assertion Error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code segment is leading to this issue, if we have only int64 there is no issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to match the expected value as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I think I have covered everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shubham11941140 you are not using the fixtures pls do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just remove the paramterize completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint -> uint8, uint16, uint32, uint64 are failing due to internal code implementation. Do i fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback removed parametrization, now it should be ready.
@shubham11941140 you are touching some very particular code, i don't know how to proceed here. this PR keeps expanding in scope & code changes w/o test cases. |
I will explain it here in detail. It is the same test case in the
So, taking
I have written these 2 cases and they must be equal i.e. The issue is that when dtype=uint8 or for the matter of fact any dtype where If you look very carefully, However, there is no error in this case as strings can be implicitly cast to int by this function. This is done within the casted variable which is For this testcase as specified, I have created a try except block that prevents other tests from breaking and allows type casting of strings to unsigned integers without running into a In this manner, I am not running into an issue with the I hope this can now explain the code change I have done associated with the test case. If you are unable to understand anything further. I can schedule a call to explain it. |
@jreback do I need to give a more elaborate explanation? |
will take another look today. |
@jbrockmendel @jreback any update? |
The check on L2108 still needs to be changed to something meaningful. The check is about the casting not being lossy. |
How do I check whether it is a lossless change? Is there any construct? |
you need to construct a test explicitly for this |
@jreback @jbrockmendel . I have removed the obvious statement used for the return condition and added to the above condition. This is obviously checking the lossless condition and it is combined with the Overflow condition as previously written as if the user gives very large strings to be casted into small integers such as I think this will satisfy the requisite conditions now. |
pandas/core/dtypes/cast.py
Outdated
@@ -2074,7 +2083,7 @@ def maybe_cast_to_integer_array( | |||
if is_object_dtype(arr.dtype): | |||
raise ValueError("Trying to coerce float values to integers") | |||
|
|||
if casted.dtype < arr.dtype: | |||
if casted.dtype < arr.dtype or lib.infer_dtype(casted) < lib.infer_dtype(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib.infer_dtype(casted) < lib.infer_dtype(arr)
im not sure what this is supposed to mean
raise OverflowError( | ||
"Trying to coerce negative values to unsigned integers" | ||
) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here about what cases get here
|
lib.infer_dtype returns a string. Why is comparing two strings useful here? |
if the dtype is a string it is a situation of lossless cast, so that can be specified directly. |
@shubham11941140 if you want to merge master and update to comments |
Branch is updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shubham11941140 before you do any patches here, can you demonstrate failing cases in tests
I did not understand what do you want me to do? |
On running the query, it mentions a FutureWarning when comparing This kind of error is not coming from the changes I have made. |
The issue is that string comparison with integer is going past the limit of the NumPy comparison because of which it is failing. |
I have tested the build on my local machine and it is not failing, but 2 errors come here. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in main, address the comments in the code diff and we can reopen |
ValueError: values cannot be losslessly cast to int64
for integer strings #44923Added extra check before returning ValueError.