Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: using dtype='int64' argument of Series causes ValueError: values cannot be losslessly cast to int64 for integer strings #45017

Closed
wants to merge 15 commits into from

Conversation

shubham11941140
Copy link
Contributor

@shubham11941140 shubham11941140 commented Dec 22, 2021

Added extra check before returning ValueError.

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
)
return casted

if all(np.dtype(i) is dtype for i in casted):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to be super-slow

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can maybe update the comment on L2102 if we've found cases that get here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall any cases that reach the ValueError that were there. There was one case which the added code solved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a faster method to check whether all the cast for all the elements is performed correctly.

Checking the dtypes is one way but individually we cannot check all elements.

This is executed last, so in many cases we will not reach here. So a bit slow can do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we need to be really sure about this check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the check as requested.

@shubham11941140
Copy link
Contributor Author

As dtype will be an integer dtype there is no need to explicitly check it we can check all integer dtypes.

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
)
return casted

if all(isinstance(i, (int, np.integer)) for i in casted):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can infer_dtype(casted) == 'integer' i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@shubham11941140
Copy link
Contributor Author

@jbrockmendel @jreback any updates?

@@ -1810,6 +1810,18 @@ def test_constructor_bool_dtype_missing_values(self):
expected = Series(True, index=[0], dtype="bool")
tm.assert_series_equal(result, expected)

def test_constructor_int64_dtype(self):
# GH-44923
result = Series(["0", "1", "2"], dtype="int64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you come up with a test case that goes through 2099-2100 but that shouldn't get cast? maybe ["0", "1", "1.1"]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Series(["0", "1", 0, 1], dtype="int64") this raises the specified ValueError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't this get cast to int64 as all can be type casted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add the above example and change the code to fit it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel there is no such case, so it is good to keep that line as it keeps the base covered.

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
)
return casted

if lib.infer_dtype(casted) == "integer":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this always holds. The condition we're interested in is whether the casting was lossy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this test currently passes the extra test added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the added tests pass w/o this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, because this condition always holds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But before adding this check the testcase in the issue was not passing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But before adding this check the testcase in the issue was not passing.

That's a reason to add some check for this, but this particular check is not the right check. ATM this is equivalent to (but slower than) if True:

@shubham11941140
Copy link
Contributor Author

@jreback @jbrockmendel any update on this?

@jreback
Copy link
Contributor

jreback commented Dec 27, 2021

@jreback @jbrockmendel any update on this?

@shubham11941140 there is not need to ping. you need to answer @jbrockmendel last question. I guess you tests is not specific enough (if it already passed)

@shubham11941140
Copy link
Contributor Author

Actually I have added the discussion point to another test case, which might indicate this. I was asking to add another testcase which gives a lossless conversion.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2021

Actually I have added the discussion point to another test case, which might indicate this. I was asking to add another testcase which gives a lossless conversion.

great

@shubham11941140
Copy link
Contributor Author

Do I add it?
The last testcase should be correct as the conversion is lossless including this one.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2021

Do I add it? The last testcase should be correct as the conversion is lossless including this one.

does that test fail before your change? that's the key, does this replicate the original issue.

@shubham11941140
Copy link
Contributor Author

It was currently failing the test.
So I will add it

@@ -1810,6 +1810,18 @@ def test_constructor_bool_dtype_missing_values(self):
expected = Series(True, index=[0], dtype="bool")
tm.assert_series_equal(result, expected)

def test_constructor_int64_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls parameterize these. also add a case for uint64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a case for uint64 will cause it to crash as this will implicitly type cast to int64 leading to an Assertion Error with uint64.

Parametrization is completed

@shubham11941140
Copy link
Contributor Author

@jbrockmendel any update on this? I think I have covered everything.

@@ -1810,6 +1810,19 @@ def test_constructor_bool_dtype_missing_values(self):
expected = Series(True, index=[0], dtype="bool")
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("int_dtype", ["int64"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use this fixture instead: any_int_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

expected = Series([-1, 0, 1, 2])
tm.assert_series_equal(result, expected)

def test_constructor_float64_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use any_float_dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
)
return casted

if lib.infer_dtype(casted) == "integer":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the added tests pass w/o this change?

@@ -1810,6 +1810,20 @@ def test_constructor_bool_dtype_missing_values(self):
expected = Series(True, index=[0], dtype="bool")
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("any_int_dtype", ["int64"])
def test_constructor_int64_dtype(self, any_int_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, pls just use the fixture itself, e.g. no parameterize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing Assertion Error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code segment is leading to this issue, if we have only int64 there is no issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to match the expected value as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think I have covered everything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham11941140 you are not using the fixtures pls do so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove the paramterize completely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint -> uint8, uint16, uint32, uint64 are failing due to internal code implementation. Do i fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback removed parametrization, now it should be ready.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2022

@jreback any update?

@shubham11941140 you are touching some very particular code, i don't know how to proceed here. this PR keeps expanding in scope & code changes w/o test cases.

@shubham11941140
Copy link
Contributor Author

I will explain it here in detail. It is the same test case in the test_constructors.py file

result = Series(["0", "1", "2"], dtype=any_int_dtype)
exp = Series([0, 1, 2], dtype=any_int_dtype)

any_int_dtype covers uint8 and every other unsigned integer dtype

So, taking uint8:

result = Series(["0", "1", "2"], dtype=uint8)
exp = Series([0, 1, 2], dtype=uint8)

I have written these 2 cases and they must be equal i.e. tm.assert_frame_equal must hold.

The issue is that when dtype=uint8 or for the matter of fact any dtype where is_unsigned_integer_dtype(dtype) holds, we reach the condition of (arr < 0).any().

If you look very carefully, arr = ["0", "1", "2"], our np.ndarray, that contains strings. The function (arr < 0).any() is a numpy function that DOES NOT ALLOW the comparison of strings to integers (As "0" is a str and 0 is int) and leads to a TypeError.

However, there is no error in this case as strings can be implicitly cast to int by this function.

This is done within the casted variable which is casted = [0, 1, 2] which is then casted into uint dtype. If we apply the function (casted < 0).any(), there should not be any error for the above specified testcase.

For this testcase as specified, I have created a try except block that prevents other tests from breaking and allows type casting of strings to unsigned integers without running into a TypeError of the NumPy function.

In this manner, I am not running into an issue with the unsigned_integer_dtype and can perform the implicit cast.

I hope this can now explain the code change I have done associated with the test case. If you are unable to understand anything further. I can schedule a call to explain it.

@jreback

@shubham11941140
Copy link
Contributor Author

@jreback do I need to give a more elaborate explanation?

@jbrockmendel
Copy link
Member

will take another look today.

@shubham11941140
Copy link
Contributor Author

@jbrockmendel @jreback any update?

@jbrockmendel
Copy link
Member

any update?

The check on L2108 still needs to be changed to something meaningful. The check is about the casting not being lossy.

@shubham11941140
Copy link
Contributor Author

How do I check whether it is a lossless change? Is there any construct?

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

How do I check whether it is a lossless change? Is there any construct?

you need to construct a test explicitly for this

@shubham11941140
Copy link
Contributor Author

@jreback @jbrockmendel . I have removed the obvious statement used for the return condition and added to the above condition.

This is obviously checking the lossless condition and it is combined with the Overflow condition as previously written as if the user gives very large strings to be casted into small integers such as int8, we will not run into a similar issue again.

I think this will satisfy the requisite conditions now.

@@ -2074,7 +2083,7 @@ def maybe_cast_to_integer_array(
if is_object_dtype(arr.dtype):
raise ValueError("Trying to coerce float values to integers")

if casted.dtype < arr.dtype:
if casted.dtype < arr.dtype or lib.infer_dtype(casted) < lib.infer_dtype(arr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib.infer_dtype(casted) < lib.infer_dtype(arr) im not sure what this is supposed to mean

raise OverflowError(
"Trying to coerce negative values to unsigned integers"
)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here about what cases get here

@shubham11941140
Copy link
Contributor Author

lib.infer_dtype(casted) < lib.infer_dtype(arr) it is the same like the extension of casted.dtype < arr.dtype to inferred dtypes as they are not properly initialized in the input. However they check the same as the condition of the lossless casting as the warning message means.

@jbrockmendel
Copy link
Member

lib.infer_dtype(casted) < lib.infer_dtype(arr) it is the same like the extension of casted.dtype < arr.dtype to inferred dtypes as they are not properly initialized in the input. However they check the same as the condition of the lossless casting as the warning message means.

lib.infer_dtype returns a string. Why is comparing two strings useful here?

@shubham11941140
Copy link
Contributor Author

if the dtype is a string it is a situation of lossless cast, so that can be specified directly.

@jreback
Copy link
Contributor

jreback commented Mar 6, 2022

@shubham11941140 if you want to merge master and update to comments

@shubham11941140
Copy link
Contributor Author

Branch is updated.

@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Mar 7, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shubham11941140 before you do any patches here, can you demonstrate failing cases in tests

@shubham11941140
Copy link
Contributor Author

I did not understand what do you want me to do?

@shubham11941140
Copy link
Contributor Author

On running the query, it mentions a FutureWarning when comparing return bool(asarray(a1 == a2).all()).

This kind of error is not coming from the changes I have made.

@shubham11941140
Copy link
Contributor Author

The issue is that string comparison with integer is going past the limit of the NumPy comparison because of which it is failing.

@shubham11941140
Copy link
Contributor Author

image

The strange part is that this should be raising a Warning with the np.array_equal comparison.

Very strange is that numpy dev is raising an error which is not happening on my machine.

@shubham11941140
Copy link
Contributor Author

I have tested the build on my local machine and it is not failing, but 2 errors come here.

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in main, address the comments in the code diff and we can reopen

@mroeschke mroeschke closed this May 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
4 participants