FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751

CloseChoice · 2022-07-16T16:48:54Z

closes REGR: setting numeric value in Categorical Series with enlargement raise internal error #47677
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v1.5.0.rst file if fixing a bug or adding a new feature.

…ical series

phofl

What happens when a string is used for enlarging? Like a

CloseChoice · 2022-07-16T17:09:11Z

What happens when a string is used for enlarging? Like a

The series is also cast to object, which might not be the behaviour we want. But this also happens previous to the regression (tested on v1.4.3).

phofl · 2022-07-16T17:17:55Z

Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?

CloseChoice · 2022-07-16T17:22:11Z

Yeah this is also a bug then. Could you check if the categorical dtypes match and then return the dtype accordingly?

How should we tackle that problem? If the enlarging value can be cast to categorical we should end up with a categorical series (so enlarging with 0 as in the issue also yields a categorical series)? Or only if the enlarging value is already one of the categories?

simonjayhawkins · 2022-07-16T17:42:57Z

doc/source/whatsnew/v1.5.0.rst

@@ -908,7 +908,7 @@ Indexing
 - Bug in :meth:`NDFrame.xs`, :meth:`DataFrame.iterrows`, :meth:`DataFrame.loc` and :meth:`DataFrame.iloc` not always propagating metadata (:issue:`28283`)
 - Bug in :meth:`DataFrame.sum` min_count changes dtype if input contains NaNs (:issue:`46947`)
 - Bug in :class:`IntervalTree` that lead to an infinite recursion. (:issue:`46658`)
-
+- Bug in :meth:`DataFrame.loc` when creating a new element on a :class:`Series` with dtype :class:`CategoricalDtype` (:issue:`47677`)


release note not needed if only a fix for issue on main. If changing behavior from 1.4.3, as suggested in #47751 (comment) then will need a release note covering just that change.

phofl · 2022-07-16T17:47:36Z

How does concat behave in that case?

CloseChoice · 2022-07-17T08:44:54Z

How does concat behave in that case?

Like this

import pandas as pd
s = pd.Series(["a", "b", "c"], dtype="category")
t = pd.concat([s, pd.Series(["a"], index=[3])])  # dtype: object
t2 = pd.concat([s, pd.Series(["a"], index=[3], dtype="category")])  # dtype: object

I would consider this fine for t but not for t2. Note that I checked this for v1.4.3 and my fix and the behaviour is the same for both.

phofl · 2022-07-17T10:49:46Z

Yeah I think t2 is off

phofl · 2022-07-17T11:44:19Z

Could you try t2 with categories specified as a,b, c?

CloseChoice · 2022-07-17T13:19:55Z

Could you try t2 with categories specified as a,b, c?

This is interesting:

t = pd.concat([s, pd.Series(["a"])]) # dtype object
t2 = pd.concat([s, pd.Series(["a"], dtype="category")])  # dtype object
t3 = pd.concat([s, pd.Series(["a", "b"], dtype="category")]) # dtype object
t4 = pd.concat([s, pd.Series(["a", "b", "c"], dtype="category")])  # dtype category
t5 = pd.concat([s, pd.Series(["a", "b", "a", "b", "c"], dtype="category")]) #dtype category
t6 = pd.concat([s, pd.Series(["a", "b", "a", "b", "a"], dtype="category")]) #dtype object
t7 = pd.concat([s, pd.Series(["a", "b", "d"], dtype="category")])  # dtype object

Only if the categories are match exactly we preserve the category type. This looks really buggy. But this is not a regression, checked it on v1.4.3, main and my fix and the behaviour is identical on all. Should we create a seperate issue for this?

phofl · 2022-07-17T13:21:47Z

No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case

CloseChoice · 2022-07-17T15:07:31Z

No this make sense, the dtypes are not equal with different categories. But enlargement with a scalar is a different case, we should preserve categorical dtype there. We have to create a series with the correct categories in that case

I updated the PR to preserve the categorical dtype if the enlarging element is already in the categories. I needed a special case
in _setitem_with_indexer_missing for this since we don't want to change concat_compat for this.

Please note that in

s = pd.DataFrame([["a", "b"], ["c", "k"]], dtype="category")
s.loc[3] = "a"

both axis are cast to object. This behaviour was not changed. But to be consistent I would expect, that the first column should stay categorical while the second should change.

phofl · 2022-07-17T17:51:59Z

pandas/core/indexing.py

@@ -2119,9 +2120,16 @@ def _setitem_with_indexer_missing(self, indexer, value):
            new_values = Series([value], dtype=new_dtype)._values

            if len(self.obj._values):
+                # GH#47677 handle enlarging with a scalar as a special case


This is not the right place for this. You have to create new_values correctly instead of avoiding concat_compat

I do, here is the value of new_values

(Pdb++) new_values ['a'] Categories (1, object): ['a']

The point of avoiding concat_compat is, that in there we explicitly check if the dtypes are the same, and because the categories aren't the same the concatenated series is cast to object. If we'd change that then this will also have an effect on pd.concat([s, pd.Series(["a", "b"], dtype="category")]) or we create a very specific special case in concat_compat and check for length 1.

No, this is not what I was talking about. But lets start over:

If we want to handle this in maybe_promote, we should return the dtype we get as input, not the string "categories". This loses all information we already got. No need to special case afterwards then.

If we decide against handling this in maybe_promote, we have to handle this earlier, e.g. in line 2120 at the latest, not here. tolist() is never a good idea, if you want to keep the dtype, this looses all precision information, e.g.

result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3], dtype="Int64"))) result.loc[3] = 2

this is converted into CategoricalDtype with int64 not Int64, same would go for int32. It would get converted into int64 too.

Thanks a lot. Of course you're right, just returning the given dtype works fine. I updated the PR, handling this in maybe_promote feels right for me, checking for dtypes and returning them is done there a lot.

phofl · 2022-07-18T00:47:39Z

DataFrame cases don't work yet, this is correct. But this has multiple issues still, so lets focus on Series here

phofl · 2022-07-18T01:24:57Z

Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure

CloseChoice · 2022-07-18T06:17:57Z

Could you also add a test where we enlarge with nan? This is not part of the categories but should probably work? Not sure

Yep, it works in the sense that after enlarging with nan the series has dtype object. But I think that's fine. I added the test

phofl · 2022-07-18T10:09:44Z

I was referring to the opposite.

result = Series([1, 2, 3], dtype=pd.CategoricalDtype(categories=pd.Index([1, 2, 3])))
result.loc[1] = np.nan  # np.nan and pd.NA

this keeps categorical dtype, e.g. enlargement should too. Could you add tests for enlarging and setting into it to test that they are consistent?

Could you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)

CloseChoice · 2022-07-18T16:14:03Z

you also add tests so that the integer dtype is kept like above? e.g. Int64 stays Int64, Int32 stays Int32 and so on. (we have fixtures for that)

Found one fixture in tests/groupby/test_function.py called dtypes_for_minmax but that doesn't look for reusing in other files. Also didn't check for np.int32 and np.float32 since

pd.CategoricalDtype(pd.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}

these are cast implicitly.

phofl · 2022-07-18T16:41:46Z

Its called something like any_numeric_dtype

CloseChoice · 2022-07-19T13:23:20Z

Its called something like any_numeric_dtype

used any_numeric_ea_dtype for this, seems like the most generic fixture available for this case

phofl · 2022-07-19T13:33:32Z

I think there exists one that includes numpy dtypes, but this would work too

CloseChoice · 2022-07-19T13:49:36Z

I think there exists one that includes numpy dtypes, but this would work too

besides that I still don't find a unified extension array + numpy dtype fixture it seems like this wouldn't work for np.int32 and np.float32 since these are implicitly cast to their 64 bit counterparts

pd.CategoricalDtype(np.array([1, 2, 3], dtype=np.int32)).__dict__
>>> {'_categories': Int64Index([1, 2, 3], dtype='int64'), '_ordered': False}

phofl · 2022-07-19T13:53:00Z

Could you try Index([1, 2, 3], dtype=int32), this should work

CloseChoice · 2022-07-19T13:58:40Z

Index([1, 2, 3], dtype=int32)

Same issue:

pd.Index([1, 2, 3], dtype='int32')
>>> Int64Index([1, 2, 3], dtype='int64')

phofl · 2022-07-19T13:59:33Z

Hm ok, then ea dtypes only

CloseChoice · 2022-07-19T14:03:52Z

Any specific reasons for closing and not merging?

phofl · 2022-07-19T14:18:16Z

Äh no, sorry for that. Pressed the wrong button

…EGR-47677

phofl · 2022-08-12T15:24:15Z

pandas/core/dtypes/cast.py

@@ -646,6 +648,12 @@ def _maybe_promote(dtype: np.dtype, fill_value=np.nan):

        return np.dtype("object"), fill_value

+    elif isinstance(dtype, CategoricalDtype):


Could you move this before the isna check and add the Categorical handling of nan values into this block?

Keeps the Categorical specific code together

Ohterwise this looks good

phofl · 2022-08-16T21:24:23Z

merged #48106, thx @CloseChoice

CloseChoice added 2 commits July 16, 2022 18:44

fix regression when loc is used to create a new element on an categor…

f520eae

…ical series

add whatsnew

4edc956

CloseChoice changed the title ~~2022 07 16 regr 47677~~ FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error Jul 16, 2022

CloseChoice requested a review from phofl July 16, 2022 16:49

phofl reviewed Jul 16, 2022

View reviewed changes

simonjayhawkins added this to the 1.5 milestone Jul 16, 2022

simonjayhawkins reviewed Jul 16, 2022

View reviewed changes

fix enlarging by scalar

a0e2934

phofl reviewed Jul 17, 2022

View reviewed changes

update due to PR discussions

5acbf42

remove unnecessary comment

59538cd

CloseChoice added 2 commits July 18, 2022 17:09

WIP: fix nan for enlarging

efdcf1c

add tests; fix nan in _maybe_promote

83f75ce

remove unnecessary statement

7bfb9b7

use any_numeric_ea_dtype for tests

d25f7eb

phofl closed this Jul 19, 2022

CloseChoice reopened this Jul 19, 2022

CloseChoice added 2 commits July 19, 2022 18:12

Merge branch 'main' of github.com:pandas-dev/pandas into 2022-07-16-R…

d15ce50

…EGR-47677

Merge branch 'main' of github.com:pandas-dev/pandas into 2022-07-16-R…

c3a7109

…EGR-47677

mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels Jul 22, 2022

Merge branch 'main' of github.com:pandas-dev/pandas into 2022-07-16-R…

476fe52

…EGR-47677

phofl self-requested a review August 9, 2022 07:12

phofl reviewed Aug 12, 2022

View reviewed changes

phofl mentioned this pull request Aug 15, 2022

RLS: 1.5 #45223

Closed

phofl added the Blocker for rc Blocking issue or pull request for release candidate label Aug 15, 2022

phofl mentioned this pull request Aug 16, 2022

REGR: setting numeric value in Categorical Series with enlargement raise internal error #48106

Merged

5 tasks

phofl closed this Aug 16, 2022

phofl removed the Blocker for rc Blocking issue or pull request for release candidate label Aug 16, 2022

phofl removed this from the 1.5 milestone Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751

FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751

CloseChoice commented Jul 16, 2022 •

edited

Loading

phofl left a comment

CloseChoice commented Jul 16, 2022 •

edited

Loading

phofl commented Jul 16, 2022

CloseChoice commented Jul 16, 2022

simonjayhawkins Jul 16, 2022

phofl commented Jul 16, 2022

CloseChoice commented Jul 17, 2022 •

edited

Loading

phofl commented Jul 17, 2022

phofl commented Jul 17, 2022

CloseChoice commented Jul 17, 2022 •

edited

Loading

phofl commented Jul 17, 2022

CloseChoice commented Jul 17, 2022 •

edited

Loading

phofl Jul 17, 2022 •

edited

Loading

CloseChoice Jul 17, 2022 •

edited

Loading

phofl Jul 18, 2022

CloseChoice Jul 18, 2022

phofl commented Jul 18, 2022

phofl commented Jul 18, 2022

CloseChoice commented Jul 18, 2022

phofl commented Jul 18, 2022

CloseChoice commented Jul 18, 2022

phofl commented Jul 18, 2022

CloseChoice commented Jul 19, 2022

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022 •

edited

Loading

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022 •

edited

Loading

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022

phofl commented Jul 19, 2022

phofl Aug 12, 2022

phofl commented Aug 16, 2022

		@@ -646,6 +648,12 @@ def _maybe_promote(dtype: np.dtype, fill_value=np.nan):

		return np.dtype("object"), fill_value

		elif isinstance(dtype, CategoricalDtype):

FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751

FIX: REGR: setting numeric value in Categorical Series with enlargement raise internal error #47751

Conversation

CloseChoice commented Jul 16, 2022 • edited Loading

phofl left a comment

Choose a reason for hiding this comment

CloseChoice commented Jul 16, 2022 • edited Loading

phofl commented Jul 16, 2022

CloseChoice commented Jul 16, 2022

simonjayhawkins Jul 16, 2022

Choose a reason for hiding this comment

phofl commented Jul 16, 2022

CloseChoice commented Jul 17, 2022 • edited Loading

phofl commented Jul 17, 2022

phofl commented Jul 17, 2022

CloseChoice commented Jul 17, 2022 • edited Loading

phofl commented Jul 17, 2022

CloseChoice commented Jul 17, 2022 • edited Loading

phofl Jul 17, 2022 • edited Loading

Choose a reason for hiding this comment

CloseChoice Jul 17, 2022 • edited Loading

Choose a reason for hiding this comment

phofl Jul 18, 2022

Choose a reason for hiding this comment

CloseChoice Jul 18, 2022

Choose a reason for hiding this comment

phofl commented Jul 18, 2022

phofl commented Jul 18, 2022

CloseChoice commented Jul 18, 2022

phofl commented Jul 18, 2022

CloseChoice commented Jul 18, 2022

phofl commented Jul 18, 2022

CloseChoice commented Jul 19, 2022

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022 • edited Loading

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022 • edited Loading

phofl commented Jul 19, 2022

CloseChoice commented Jul 19, 2022

phofl commented Jul 19, 2022

phofl Aug 12, 2022

Choose a reason for hiding this comment

phofl commented Aug 16, 2022

CloseChoice commented Jul 16, 2022 •

edited

Loading

CloseChoice commented Jul 16, 2022 •

edited

Loading

CloseChoice commented Jul 17, 2022 •

edited

Loading

CloseChoice commented Jul 17, 2022 •

edited

Loading

CloseChoice commented Jul 17, 2022 •

edited

Loading

phofl Jul 17, 2022 •

edited

Loading

CloseChoice Jul 17, 2022 •

edited

Loading

CloseChoice commented Jul 19, 2022 •

edited

Loading

CloseChoice commented Jul 19, 2022 •

edited

Loading