Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751

Closed
wants to merge 12 commits into from

Conversation

CloseChoice
Copy link
Member

@CloseChoice CloseChoice commented Jul 16, 2022

@CloseChoice CloseChoice changed the title 2022 07 16 regr 47677 FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error Jul 16, 2022
@CloseChoice CloseChoice requested a review from phofl July 16, 2022 16:49
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a string is used for enlarging? Like a

@CloseChoice
Copy link
Member Author

CloseChoice commented Jul 16, 2022

What happens when a string is used for enlarging? Like a

The series is also cast to object, which might not be the behaviour we want. But this also happens previous to the regression (tested on v1.4.3).

@phofl
Copy link
Member

phofl commented Jul 16, 2022

Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?

@CloseChoice
Copy link
Member Author

Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?

How should we tackle that problem? If the enlarging value can be cast to categorical we should end up with a categorical series (so enlarging with 0 as in the issue also yields a categorical series)? Or only if the enlarging value is already one of the categories?

@simonjayhawkins simonjayhawkins added this to the 1.5 milestone Jul 16, 2022
@@ -908,7 +908,7 @@ Indexing
- Bug in :meth:`NDFrame.xs`, :meth:`DataFrame.iterrows`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` not always propagating metadata (:issue:`28283`)
- Bug in :meth:`DataFrame.sum` min_count changes dtype if input contains NaNs (:issue:`46947`)
- Bug in :class:`IntervalTree` that lead to an infinite recursion. (:issue:`46658`)
-
- Bug in :meth:`DataFrame.loc` when creating a new element on a :class:`Series` with dtype :class:`CategoricalDtype` (:issue:`47677`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release note not needed if only a fix for issue on main. If changing behavior from 1.4.3, as suggested in #47751 (comment) then will need a release note covering just that change.

@phofl
Copy link
Member

phofl commented Jul 16, 2022

How does concat behave in that case?

@CloseChoice
Copy link
Member Author

CloseChoice commented Jul 17, 2022

How does concat behave in that case?

Like this

import pandas as pd
s = pd.Series(["a", "b", "c"], dtype="category")
t = pd.concat([s, pd.Series(["a"], index=[3])])  # dtype: object
t2 = pd.concat([s, pd.Series(["a"], index=[3], dtype="category")])  # dtype: object

I would consider this fine for t but not for t2. Note that I checked this for v1.4.3 and my fix and the behaviour is the same for both.

@phofl
Copy link
Member

phofl commented Jul 17, 2022

Yeah I think t2 is off

@phofl
Copy link
Member

phofl commented Jul 17, 2022

Could you try t2 with categories specified as a,b, c?

@CloseChoice
Copy link
Member Author

CloseChoice commented Jul 17, 2022

Could you try t2 with categories specified as a,b, c?

This is interesting:

t = pd.concat([s, pd.Series(["a"])]) # dtype object
t2 = pd.concat([s, pd.Series(["a"], dtype="category")])  # dtype object
t3 = pd.concat([s, pd.Series(["a", "b"], dtype="category")]) # dtype object
t4 = pd.concat([s, pd.Series(["a", "b", "c"], dtype="category")])  # dtype category
t5 = pd.concat([s, pd.Series(["a", "b", "a", "b", "c"], dtype="category")]) #dtype category
t6 = pd.concat([s, pd.Series(["a", "b", "a", "b", "a"], dtype="category")]) #dtype object
t7 = pd.concat([s, pd.Series(["a", "b", "d"], dtype="category")])  # dtype object

Only if the categories are match exactly we preserve the category type. This looks really buggy. But this is not a regression, checked it on v1.4.3, main and my fix and the behaviour is identical on all. Should we create a seperate issue for this?

@phofl
Copy link
Member

phofl commented Jul 17, 2022

No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case

@CloseChoice
Copy link
Member Author

CloseChoice commented Jul 17, 2022

No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case

I updated the PR to preserve the categorical dtype if the enlarging element is already in the categories. I needed a special case
in _setitem_with_indexer_missing for this since we don't want to change concat_compat for this.

Please note that in

s = pd.DataFrame([["a", "b"], ["c", "k"]], dtype="category")
s.loc[3] = "a"

both axis are cast to object. This behaviour was not changed. But to be consistent I would expect, that the first column should stay categorical while the second should change.

@@ -2119,9 +2120,16 @@ def _setitem_with_indexer_missing(self, indexer, value):
new_values = Series([value], dtype=new_dtype)._values

if len(self.obj._values):
# GH#47677 handle enlarging with a scalar as a special case
Copy link
Member

@phofl phofl Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place for this. You have to create new_values correctly instead of avoiding concat_compat

Copy link
Member Author

@CloseChoice CloseChoice Jul 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, here is the value of new_values

(Pdb++) new_values
['a']
Categories (1, object): ['a']

The point of avoiding concat_compat is, that in there we explicitly check if the dtypes are the same, and because the categories aren't the same the concatenated series is cast to object. If we'd change that then this will also have an effect on pd.concat([s, pd.Series(["a", "b"], dtype="category")]) or we create a very specific special case in concat_compat and check for length 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not what I was talking about. But lets start over:

If we want to handle this in maybe_promote, we should return the dtype we get as input, not the string "categories". This loses all information we already got. No need to special case afterwards then.

If we decide against handling this in maybe_promote, we have to handle this earlier, e.g. in line 2120 at the latest, not here. tolist() is never a good idea, if you want to keep the dtype, this looses all precision information, e.g.

result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3], dtype="Int64")))
result.loc[3] = 2

this is converted into CategoricalDtype with int64 not Int64, same would go for int32. It would get converted into int64 too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. Of course you're right, just returning the given dtype works fine. I updated the PR, handling this in maybe_promote feels right for me, checking for dtypes and returning them is done there a lot.

@phofl
Copy link
Member

phofl commented Jul 18, 2022

DataFrame cases don't work yet, this is correct. But this has multiple issues still, so lets focus on Series here

@phofl
Copy link
Member

phofl commented Jul 18, 2022

Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure

@CloseChoice
Copy link
Member Author

Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure

Yep, it works in the sense that after enlarging with nan the series has dtype object. But I think that's fine. I added the test

@phofl
Copy link
Member

phofl commented Jul 18, 2022

I was referring to the opposite.

result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3])))
result.loc[1] = np.nan  # np.nan and pd.NA

this keeps categorical dtype, e.g. enlargement should too. Could you add tests for enlarging and setting into it to test that they are consistent?

Could you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)

@CloseChoice
Copy link
Member Author

you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)

Found one fixture in tests/groupby/test_function.py called dtypes_for_minmax but that doesn't look for reusing in other files. Also didn't check for np.int32 and np.float32 since

pd.CategoricalDtype(pd.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}

these are cast implicitly.

@phofl
Copy link
Member

phofl commented Jul 18, 2022

Its called something like any_numeric_dtype

@CloseChoice
Copy link
Member Author

Its called something like any_numeric_dtype

used any_numeric_ea_dtype for this, seems like the most generic fixture available for this case

@phofl
Copy link
Member

phofl commented Jul 19, 2022

I think there exists one that includes numpy dtypes, but this would work too

@CloseChoice
Copy link
Member Author

CloseChoice commented Jul 19, 2022

I think there exists one that includes numpy dtypes, but this would work too

besides that I still don't find a unified extension array + numpy dtype fixture it seems like this wouldn't work for np.int32 and np.float32 since these are implicitly cast to their 64 bit counterparts

pd.CategoricalDtype(np.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}

@phofl
Copy link
Member

phofl commented Jul 19, 2022

Could you try Index([1, 2, 3], dtype=int32), this should work

@CloseChoice
Copy link
Member Author

CloseChoice commented Jul 19, 2022

Index([1, 2, 3], dtype=int32)

Same issue:

pd.Index([1, 2, 3], dtype='int32')
>>> Int64Index([1, 2, 3], dtype='int64')

@phofl
Copy link
Member

phofl commented Jul 19, 2022

Hm ok, then ea dtypes only

@phofl phofl closed this Jul 19, 2022
@CloseChoice
Copy link
Member Author

Any specific reasons for closing and not merging?

@CloseChoice CloseChoice reopened this Jul 19, 2022
@phofl
Copy link
Member

phofl commented Jul 19, 2022

Äh no, sorry for that. Pressed the wrong button

@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Jul 22, 2022
@phofl phofl self-requested a review August 9, 2022 07:12
@@ -646,6 +648,12 @@ def _maybe_promote(dtype: np.dtype, fill_value=np.nan):

return np.dtype("object"), fill_value

elif isinstance(dtype, CategoricalDtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this before the isna check and add the Categorical handling of nan values into this block?

Keeps the Categorical specific code together

Ohterwise this looks good

@phofl phofl mentioned this pull request Aug 15, 2022
@phofl phofl added the Blocker for rc Blocking issue or pull request for release candidate label Aug 15, 2022
@phofl
Copy link
Member

phofl commented Aug 16, 2022

merged #48106, thx @CloseChoice

@phofl phofl closed this Aug 16, 2022
@phofl phofl removed the Blocker for rc Blocking issue or pull request for release candidate label Aug 16, 2022
@phofl phofl removed this from the 1.5 milestone Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: setting numeric value in Categorical Series with enlargement raise internal error
4 participants