BUG: using dtype='int64' argument of Series causes ValueError: values cannot be losslessly cast to int64 for integer strings #45017

shubham11941140 · 2021-12-22T19:59:38Z

closes BUG: using dtype='int64' argument of Series causes ValueError: values cannot be losslessly cast to int64 for integer strings #44923
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Added extra check before returning ValueError.

jbrockmendel · 2021-12-22T21:58:16Z

pandas/core/dtypes/cast.py

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
        )
        return casted

+    if all(np.dtype(i) is dtype for i in casted):


this is going to be super-slow

can maybe update the comment on L2102 if we've found cases that get here

I don't recall any cases that reach the ValueError that were there. There was one case which the added code solved.

I do not have a faster method to check whether all the cast for all the elements is performed correctly.

Checking the dtypes is one way but individually we cannot check all elements.

This is executed last, so in many cases we will not reach here. So a bit slow can do.

yeah we need to be really sure about this check

Changed the check as requested.

shubham11941140 · 2021-12-23T06:18:20Z

As dtype will be an integer dtype there is no need to explicitly check it we can check all integer dtypes.

jreback · 2021-12-23T16:31:52Z

pandas/core/dtypes/cast.py

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
        )
        return casted

+    if all(isinstance(i, (int, np.integer)) for i in casted):


you can infer_dtype(casted) == 'integer' i think

shubham11941140 · 2021-12-24T17:25:13Z

@jbrockmendel @jreback any updates?

jbrockmendel · 2021-12-24T17:33:06Z

pandas/tests/series/test_constructors.py

@@ -1810,6 +1810,18 @@ def test_constructor_bool_dtype_missing_values(self):
        expected = Series(True, index=[0], dtype="bool")
        tm.assert_series_equal(result, expected)

+    def test_constructor_int64_dtype(self):
+        # GH-44923
+        result = Series(["0", "1", "2"], dtype="int64")


can you come up with a test case that goes through 2099-2100 but that shouldn't get cast? maybe ["0", "1", "1.1"]?

Series(["0", "1", 0, 1], dtype="int64") this raises the specified ValueError

But shouldn't this get cast to int64 as all can be type casted.

Should I add the above example and change the code to fit it?

@jbrockmendel there is no such case, so it is good to keep that line as it keeps the base covered.

jbrockmendel · 2021-12-24T17:33:53Z

pandas/core/dtypes/cast.py

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
        )
        return casted

+    if lib.infer_dtype(casted) == "integer":


i think this always holds. The condition we're interested in is whether the casting was lossy.

Agreed, but this test currently passes the extra test added.

do the added tests pass w/o this change?

Yes they do.

Right, because this condition always holds.

But before adding this check the testcase in the issue was not passing.

But before adding this check the testcase in the issue was not passing.

That's a reason to add some check for this, but this particular check is not the right check. ATM this is equivalent to (but slower than) if True:

shubham11941140 · 2021-12-27T15:07:31Z

@jreback @jbrockmendel any update on this?

jreback · 2021-12-27T15:09:01Z

@jreback @jbrockmendel any update on this?

@shubham11941140 there is not need to ping. you need to answer @jbrockmendel last question. I guess you tests is not specific enough (if it already passed)

shubham11941140 · 2021-12-27T15:15:04Z

Actually I have added the discussion point to another test case, which might indicate this. I was asking to add another testcase which gives a lossless conversion.

jreback · 2021-12-27T15:16:47Z

Actually I have added the discussion point to another test case, which might indicate this. I was asking to add another testcase which gives a lossless conversion.

great

shubham11941140 · 2021-12-27T15:17:31Z

Do I add it?
The last testcase should be correct as the conversion is lossless including this one.

jreback · 2021-12-27T15:21:57Z

Do I add it? The last testcase should be correct as the conversion is lossless including this one.

does that test fail before your change? that's the key, does this replicate the original issue.

shubham11941140 · 2021-12-27T15:25:27Z

It was currently failing the test.
So I will add it

jreback · 2021-12-27T15:30:32Z

pandas/tests/series/test_constructors.py

@@ -1810,6 +1810,18 @@ def test_constructor_bool_dtype_missing_values(self):
        expected = Series(True, index=[0], dtype="bool")
        tm.assert_series_equal(result, expected)

+    def test_constructor_int64_dtype(self):


pls parameterize these. also add a case for uint64.

Adding a case for uint64 will cause it to crash as this will implicitly type cast to int64 leading to an Assertion Error with uint64.

Parametrization is completed

shubham11941140 · 2021-12-30T15:21:59Z

@jbrockmendel any update on this? I think I have covered everything.

jreback · 2021-12-31T16:14:59Z

pandas/tests/series/test_constructors.py

@@ -1810,6 +1810,19 @@ def test_constructor_bool_dtype_missing_values(self):
        expected = Series(True, index=[0], dtype="bool")
        tm.assert_series_equal(result, expected)

+    @pytest.mark.parametrize("int_dtype", ["int64"])


use this fixture instead: any_int_dtype

jreback · 2021-12-31T16:18:47Z

pandas/tests/series/test_constructors.py

+        expected = Series([-1, 0, 1, 2])
+        tm.assert_series_equal(result, expected)
+
+    def test_constructor_float64_dtype(self):


use any_float_dtype

jreback · 2021-12-31T16:19:03Z

pandas/core/dtypes/cast.py

@@ -2096,6 +2096,9 @@ def maybe_cast_to_integer_array(
        )
        return casted

+    if lib.infer_dtype(casted) == "integer":


do the added tests pass w/o this change?

jreback · 2021-12-31T16:44:01Z

pandas/tests/series/test_constructors.py

@@ -1810,6 +1810,20 @@ def test_constructor_bool_dtype_missing_values(self):
        expected = Series(True, index=[0], dtype="bool")
        tm.assert_series_equal(result, expected)

+    @pytest.mark.parametrize("any_int_dtype", ["int64"])
+    def test_constructor_int64_dtype(self, any_int_dtype):


no, pls just use the fixture itself, e.g. no parameterize

This is causing Assertion Error.

The previous code segment is leading to this issue, if we have only int64 there is no issue.

you need to match the expected value as well

@jreback I think I have covered everything?

@shubham11941140 you are not using the fixtures pls do so

just remove the paramterize completely

uint -> uint8, uint16, uint32, uint64 are failing due to internal code implementation. Do i fix this?

@jreback removed parametrization, now it should be ready.

jreback · 2022-01-04T16:52:14Z

@jreback any update?

@shubham11941140 you are touching some very particular code, i don't know how to proceed here. this PR keeps expanding in scope & code changes w/o test cases.

shubham11941140 · 2022-01-04T17:20:21Z

I will explain it here in detail. It is the same test case in the test_constructors.py file

result = Series(["0", "1", "2"], dtype=any_int_dtype)
exp = Series([0, 1, 2], dtype=any_int_dtype)

any_int_dtype covers uint8 and every other unsigned integer dtype

So, taking uint8:

result = Series(["0", "1", "2"], dtype=uint8)
exp = Series([0, 1, 2], dtype=uint8)

I have written these 2 cases and they must be equal i.e. tm.assert_frame_equal must hold.

The issue is that when dtype=uint8 or for the matter of fact any dtype where is_unsigned_integer_dtype(dtype) holds, we reach the condition of (arr < 0).any().

If you look very carefully, arr = ["0", "1", "2"], our np.ndarray, that contains strings. The function (arr < 0).any() is a numpy function that DOES NOT ALLOW the comparison of strings to integers (As "0" is a str and 0 is int) and leads to a TypeError.

However, there is no error in this case as strings can be implicitly cast to int by this function.

This is done within the casted variable which is casted = [0, 1, 2] which is then casted into uint dtype. If we apply the function (casted < 0).any(), there should not be any error for the above specified testcase.

For this testcase as specified, I have created a try except block that prevents other tests from breaking and allows type casting of strings to unsigned integers without running into a TypeError of the NumPy function.

In this manner, I am not running into an issue with the unsigned_integer_dtype and can perform the implicit cast.

I hope this can now explain the code change I have done associated with the test case. If you are unable to understand anything further. I can schedule a call to explain it.

@jreback

shubham11941140 · 2022-01-05T16:03:54Z

@jreback do I need to give a more elaborate explanation?

jbrockmendel · 2022-01-05T16:30:18Z

will take another look today.

shubham11941140 · 2022-01-06T15:38:37Z

@jbrockmendel @jreback any update?

jbrockmendel · 2022-01-06T16:37:38Z

any update?

The check on L2108 still needs to be changed to something meaningful. The check is about the casting not being lossy.

shubham11941140 · 2022-01-06T16:40:54Z

How do I check whether it is a lossless change? Is there any construct?

jreback · 2022-01-16T17:35:47Z

How do I check whether it is a lossless change? Is there any construct?

you need to construct a test explicitly for this

shubham11941140 · 2022-01-17T14:33:59Z

@jreback @jbrockmendel . I have removed the obvious statement used for the return condition and added to the above condition.

This is obviously checking the lossless condition and it is combined with the Overflow condition as previously written as if the user gives very large strings to be casted into small integers such as int8, we will not run into a similar issue again.

I think this will satisfy the requisite conditions now.

jbrockmendel · 2022-01-24T18:10:18Z

pandas/core/dtypes/cast.py

@@ -2074,7 +2083,7 @@ def maybe_cast_to_integer_array(
    if is_object_dtype(arr.dtype):
        raise ValueError("Trying to coerce float values to integers")

-    if casted.dtype < arr.dtype:
+    if casted.dtype < arr.dtype or lib.infer_dtype(casted) < lib.infer_dtype(arr):


lib.infer_dtype(casted) < lib.infer_dtype(arr) im not sure what this is supposed to mean

jbrockmendel · 2022-01-24T18:13:01Z

pandas/core/dtypes/cast.py

+                raise OverflowError(
+                    "Trying to coerce negative values to unsigned integers"
+                )
+        except TypeError:


can you add a comment here about what cases get here

shubham11941140 · 2022-01-24T18:16:10Z

lib.infer_dtype(casted) < lib.infer_dtype(arr) it is the same like the extension of casted.dtype < arr.dtype to inferred dtypes as they are not properly initialized in the input. However they check the same as the condition of the lossless casting as the warning message means.

jbrockmendel · 2022-01-24T21:31:42Z

lib.infer_dtype(casted) < lib.infer_dtype(arr) it is the same like the extension of casted.dtype < arr.dtype to inferred dtypes as they are not properly initialized in the input. However they check the same as the condition of the lossless casting as the warning message means.

lib.infer_dtype returns a string. Why is comparing two strings useful here?

shubham11941140 · 2022-02-11T15:53:17Z

if the dtype is a string it is a situation of lossless cast, so that can be specified directly.

jreback · 2022-03-06T23:27:05Z

@shubham11941140 if you want to merge master and update to comments

shubham11941140 · 2022-03-07T07:08:04Z

Branch is updated.

jreback

@shubham11941140 before you do any patches here, can you demonstrate failing cases in tests

shubham11941140 · 2022-03-07T13:51:45Z

I did not understand what do you want me to do?

shubham11941140 · 2022-03-07T14:18:12Z

On running the query, it mentions a FutureWarning when comparing return bool(asarray(a1 == a2).all()).

This kind of error is not coming from the changes I have made.

shubham11941140 · 2022-03-07T14:32:06Z

The issue is that string comparison with integer is going past the limit of the NumPy comparison because of which it is failing.

shubham11941140 · 2022-03-07T15:12:05Z

The strange part is that this should be raising a Warning with the np.array_equal comparison.

Very strange is that numpy dev is raising an error which is not happening on my machine.

shubham11941140 · 2022-03-07T15:15:25Z

I have tested the build on my local machine and it is not failing, but 2 errors come here.

mroeschke · 2022-05-07T02:52:21Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in main, address the comments in the code diff and we can reopen

shubham11941140 added 2 commits December 23, 2021 01:25

Valueerror

7952825

precommit

5d0362e

jbrockmendel reviewed Dec 22, 2021

View reviewed changes

shubham11941140 requested a review from jbrockmendel December 23, 2021 03:07

shubham11941140 added 2 commits December 23, 2021 08:41

Merge branch 'master' of https://github.com/pandas-dev/pandas into b7

01af816

Optimization

730d5a9

jreback requested changes Dec 23, 2021

View reviewed changes

shubham11941140 added 2 commits December 24, 2021 01:33

infer_dtype added

204c3c9

precommit

814e5ff

shubham11941140 requested a review from jreback December 23, 2021 20:06

jbrockmendel reviewed Dec 24, 2021

View reviewed changes

shubham11941140 requested a review from jbrockmendel December 24, 2021 17:45

jreback reviewed Dec 27, 2021

View reviewed changes

Parametrised

a76a34a

shubham11941140 requested a review from jreback December 27, 2021 16:14

jreback requested changes Dec 31, 2021

View reviewed changes

changed to any_dtype

8603236

shubham11941140 requested a review from jreback December 31, 2021 16:37

jreback reviewed Dec 31, 2021

View reviewed changes

removed obvious line

eed91cb

jbrockmendel reviewed Jan 24, 2022

View reviewed changes

shubham11941140 requested a review from jbrockmendel January 24, 2022 18:20

Checking string dtype directly

ae6ea47

Merge branch 'main' into b7

1247f0e

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Mar 7, 2022

jreback requested changes Mar 7, 2022

View reviewed changes

shubham11941140 requested a review from jreback March 7, 2022 13:51

mroeschke closed this May 7, 2022

BUG: using dtype='int64' argument of Series causes ValueError: values cannot be losslessly cast to int64 for integer strings #45017

BUG: using dtype='int64' argument of Series causes ValueError: values cannot be losslessly cast to int64 for integer strings #45017

Conversation

shubham11941140 commented Dec 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubham11941140 commented Dec 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubham11941140 commented Dec 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubham11941140 commented Dec 27, 2021

jreback commented Dec 27, 2021

shubham11941140 commented Dec 27, 2021

jreback commented Dec 27, 2021

shubham11941140 commented Dec 27, 2021

jreback commented Dec 27, 2021

shubham11941140 commented Dec 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubham11941140 commented Dec 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 4, 2022

shubham11941140 commented Jan 4, 2022

shubham11941140 commented Jan 5, 2022

jbrockmendel commented Jan 5, 2022

shubham11941140 commented Jan 6, 2022

jbrockmendel commented Jan 6, 2022

shubham11941140 commented Jan 6, 2022

jreback commented Jan 16, 2022

shubham11941140 commented Jan 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shubham11941140 commented Jan 24, 2022

jbrockmendel commented Jan 24, 2022

shubham11941140 commented Feb 11, 2022

jreback commented Mar 6, 2022

shubham11941140 commented Mar 7, 2022

jreback left a comment

Choose a reason for hiding this comment

shubham11941140 commented Mar 7, 2022

shubham11941140 commented Mar 7, 2022

shubham11941140 commented Mar 7, 2022

shubham11941140 commented Mar 7, 2022

shubham11941140 commented Mar 7, 2022

mroeschke commented May 7, 2022

shubham11941140 commented Dec 22, 2021 •

edited

Loading