-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when a string is used for enlarging? Like a
The series is also cast to |
Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly? |
How should we tackle that problem? If the enlarging value can be cast to categorical we should end up with a categorical series (so enlarging with |
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -908,7 +908,7 @@ Indexing | |||
- Bug in :meth:`NDFrame.xs`, :meth:`DataFrame.iterrows`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` not always propagating metadata (:issue:`28283`) | |||
- Bug in :meth:`DataFrame.sum` min_count changes dtype if input contains NaNs (:issue:`46947`) | |||
- Bug in :class:`IntervalTree` that lead to an infinite recursion. (:issue:`46658`) | |||
- | |||
- Bug in :meth:`DataFrame.loc` when creating a new element on a :class:`Series` with dtype :class:`CategoricalDtype` (:issue:`47677`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release note not needed if only a fix for issue on main. If changing behavior from 1.4.3, as suggested in #47751 (comment) then will need a release note covering just that change.
How does concat behave in that case? |
Like this import pandas as pd
s = pd.Series(["a", "b", "c"], dtype="category")
t = pd.concat([s, pd.Series(["a"], index=[3])]) # dtype: object
t2 = pd.concat([s, pd.Series(["a"], index=[3], dtype="category")]) # dtype: object I would consider this fine for |
Yeah I think t2 is off |
Could you try t2 with categories specified as a,b, c? |
This is interesting: t = pd.concat([s, pd.Series(["a"])]) # dtype object
t2 = pd.concat([s, pd.Series(["a"], dtype="category")]) # dtype object
t3 = pd.concat([s, pd.Series(["a", "b"], dtype="category")]) # dtype object
t4 = pd.concat([s, pd.Series(["a", "b", "c"], dtype="category")]) # dtype category
t5 = pd.concat([s, pd.Series(["a", "b", "a", "b", "c"], dtype="category")]) #dtype category
t6 = pd.concat([s, pd.Series(["a", "b", "a", "b", "a"], dtype="category")]) #dtype object
t7 = pd.concat([s, pd.Series(["a", "b", "d"], dtype="category")]) # dtype object Only if the categories are match exactly we preserve the category type. This looks really buggy. But this is not a regression, checked it on v1.4.3, main and my fix and the behaviour is identical on all. Should we create a seperate issue for this? |
No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case |
I updated the PR to preserve the categorical dtype if the enlarging element is already in the categories. I needed a special case Please note that in s = pd.DataFrame([["a", "b"], ["c", "k"]], dtype="category")
s.loc[3] = "a" both axis are cast to |
pandas/core/indexing.py
Outdated
@@ -2119,9 +2120,16 @@ def _setitem_with_indexer_missing(self, indexer, value): | |||
new_values = Series([value], dtype=new_dtype)._values | |||
|
|||
if len(self.obj._values): | |||
# GH#47677 handle enlarging with a scalar as a special case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right place for this. You have to create new_values correctly instead of avoiding concat_compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do, here is the value of new_values
(Pdb++) new_values
['a']
Categories (1, object): ['a']
The point of avoiding concat_compat is, that in there we explicitly check if the dtypes are the same, and because the categories aren't the same the concatenated series is cast to object. If we'd change that then this will also have an effect on pd.concat([s, pd.Series(["a", "b"], dtype="category")])
or we create a very specific special case in concat_compat
and check for length 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is not what I was talking about. But lets start over:
If we want to handle this in maybe_promote, we should return the dtype we get as input, not the string "categories"
. This loses all information we already got. No need to special case afterwards then.
If we decide against handling this in maybe_promote, we have to handle this earlier, e.g. in line 2120 at the latest, not here. tolist() is never a good idea, if you want to keep the dtype, this looses all precision information, e.g.
result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3], dtype="Int64")))
result.loc[3] = 2
this is converted into CategoricalDtype with int64
not Int64
, same would go for int32
. It would get converted into int64
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot. Of course you're right, just returning the given dtype works fine. I updated the PR, handling this in maybe_promote
feels right for me, checking for dtypes and returning them is done there a lot.
DataFrame cases don't work yet, this is correct. But this has multiple issues still, so lets focus on Series here |
Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure |
Yep, it works in the sense that after enlarging with nan the series has dtype object. But I think that's fine. I added the test |
I was referring to the opposite.
this keeps categorical dtype, e.g. enlargement should too. Could you add tests for enlarging and setting into it to test that they are consistent? Could you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that) |
Found one fixture in pd.CategoricalDtype(pd.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False} these are cast implicitly. |
Its called something like any_numeric_dtype |
used |
I think there exists one that includes numpy dtypes, but this would work too |
besides that I still don't find a unified extension array + numpy dtype fixture it seems like this wouldn't work for pd.CategoricalDtype(np.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False} |
Could you try Index([1, 2, 3], dtype=int32), this should work |
Same issue: pd.Index([1, 2, 3], dtype='int32')
>>> Int64Index([1, 2, 3], dtype='int64') |
Hm ok, then ea dtypes only |
Any specific reasons for closing and not merging? |
Äh no, sorry for that. Pressed the wrong button |
@@ -646,6 +648,12 @@ def _maybe_promote(dtype: np.dtype, fill_value=np.nan): | |||
|
|||
return np.dtype("object"), fill_value | |||
|
|||
elif isinstance(dtype, CategoricalDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this before the isna check and add the Categorical handling of nan values into this block?
Keeps the Categorical specific code together
Ohterwise this looks good
merged #48106, thx @CloseChoice |
doc/source/whatsnew/v1.5.0.rst
file if fixing a bug or adding a new feature.