BUG: Categorical.unique should keep dtype unchanged #38140

topper-123 · 2020-11-28T19:59:54Z

closes Any categorical dtype object's .unique() changes categories #18291
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

We want to keep the same dtype as in the original after applying unique. For example:

>>> dtype = pd.CategoricalDtype(['very bad', 'bad', 'neutral',  'good', 'very good'], ordered=True)
>>> cat = pd.Categorical(['good','good', 'bad', 'bad'], dtype=dtype)
>>> cat
['good', 'good', 'bad', 'bad']
Categories (5, object): ['very bad' < 'bad' < 'neutral' < 'good' < 'very good']
>>> cat.unique()
['good', 'bad']
Categories (5, object): ['bad' < 'good']
>>> cat.unique().dtype == cat.dtype
False  # master
True  # this PR

topper-123 · 2020-11-28T20:02:50Z

pandas/core/groupby/categorical.py

+    take_codes = cat.codes[cat.codes != -1]
+    if cat.ordered:
+        take_codes = np.sort(take_codes)
+    cat = cat.set_categories(cat.categories.take(take_codes))


Moved from Categorical.unique. This keeps groupbys working unchanged.

What changes/breaks if you don't include this?

Normally, the categorical its categories doesn't include any NaNs, so I don't fully understand the comment.

Also, if we don't want to drop unobserved categories in unique, don't we want to do the same change in groupby?

The comment refers to nans in cat (a Categorical), not its categories.

This section is not optimal and is just to keep the same behaviour as previously in groupbys. This can be changed in follow-ups. I don't think this can be untangled without breaking behaviour, e.g. this method of removing unused categories is different than using remove_unused categories.

I think we should just accept this smelly bit here here and fix this whole function in followups.

But can you explain (or show with an example) what behaviour in groupby would change if this code was not included here?

This section ... is just to keep the same behaviour as previously in groupbys. This can be changed in follow-ups. I don't think this can be untangled without breaking behaviour

Yes, but we are breaking the behaviour of unique() on purpose, so it might be that we want to make the exact same break in groupby?

But can you explain (or show with an example) what behaviour in groupby would change if this code was not included here?

I suppose it is related to the categories of the resulting key index/column after grouping on a categorical:

In [45]: df = pd.DataFrame({"key": pd.Categorical(["c", "b", "b"], categories=["a", "b", "c"]), "values": range(3)}) In [46]: df.groupby("key").sum().index Out[46]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, name='key', dtype='category') In [47]: df.groupby("key", sort=False).sum().index Out[47]: CategoricalIndex(['c', 'b', 'a'], categories=['c', 'b', 'a'], ordered=False, name='key', dtype='category') In [51]: df = pd.DataFrame({"key": pd.Categorical(["c", "b", "b"], categories=["a", "b", "c"], ordered=True), "values": range(3)}) In [52]: df.groupby("key").sum().index Out[52]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=True, name='key', dtype='category') In [53]: df.groupby("key", sort=False).sum().index Out[53]: CategoricalIndex(['b', 'c', 'a'], categories=['b', 'c', 'a'], ordered=True, name='key', dtype='category')

So it seems we have a very similar issue here about the order of the resulting categories of the dtype which is not preserved (since it relied on the behaviour of unique before)

hmm i think @jorisvandenbossche is right here. if you don't add the code here, how much to update the groupby tests?

doc/source/whatsnew/v1.2.0.rst

topper-123 · 2020-11-30T20:26:38Z

Any comments, @jreback , @jbrockmendel & @gfyoung ?

jreback

I like this, nicely simplifies logic, and as stated in the OP. This follows our current conventions. Since we aren't actually breaking any downstream logic (e.g. groupby), this is totally fine.

jreback · 2020-12-02T02:20:42Z

cc @jorisvandenbossche @TomAugspurger any objections. I actually think we should include this in the RC, cc @simonjayhawkins

jreback · 2020-12-02T02:20:59Z

cc @toobaz from the OP

pandas/core/arrays/categorical.py

topper-123 · 2020-12-02T19:44:52Z

Updated.

jreback · 2020-12-02T21:09:12Z

doc/source/whatsnew/v1.2.0.rst

@@ -523,6 +523,7 @@ Categorical
 - :meth:`Categorical.fillna` will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow an ``NaT`` as a fill value for numeric categories (:issue:`36530`)
 - Bug in :meth:`Categorical.__setitem__` that incorrectly raised when trying to set a tuple value (:issue:`20439`)
 - Bug in :meth:`CategoricalIndex.equals` incorrectly casting non-category entries to ``np.nan`` (:issue:`37667`)
+- Bug in :meth:`Categorical.unique`, where the dtype changed in the unique array if there were unused categories in the original array (:issue:`38140`).


i would make a sub-section abou tthis, showing before & after; this is a non-trivial change to absorb.

I agree, we can move it to the "notable bug fixes" section, with an example and showing how you can get back at the old behaviour.

jorisvandenbossche

I am fully on board with the actual change (it's really strange behaviour), but I think we need to think a bit more about how to do this change. Because it's not a bug fix, this was intentional, documented behaviour. And especially in the case of an ordered categorical, it's potentially breaking.
So IMO we don't need to "hurry" it into 1.2 (it's a longstanding issue, I don't think there is a reason why it is now pressing or blocking for 1.2?). We can also discuss a bit more and merge it early in the 1.3 cycle.

jreback · 2020-12-02T23:29:13Z

I am fully on board with the actual change (it's really strange behaviour), but I think we need to think a bit more about how to do this change. Because it's not a bug fix, this was intentional, documented behaviour. And especially in the case of an ordered categorical, it's potentially breaking.
So IMO we don't need to "hurry" it into 1.2 (it's a longstanding issue, I don't think there is a reason why it is now pressing or blocking for 1.2?). We can also discuss a bit more and merge it early in the 1.3 cycle.

I don't think its possible to actually deprecate this. Further this is on an actual Categorical only. I honestly don't think this is that big of a deal. am ok with releasing it here with a sub-section.

jorisvandenbossche · 2020-12-03T07:27:04Z

Further this is on an actual Categorical only.

No, the Series unique method dispatches to the underlying EA, so Categorical.unique in this case, so changing this also affects the Series behaviour.

jreback · 2020-12-03T14:21:09Z

I see @topper-123 you did have to change a couple of series tests.

@jorisvandenbossche how would you deprecate this behavior?

topper-123 · 2020-12-03T16:23:32Z

I think the current behaviour is buggy behaviour, because users wouldn't expect arr.unique to change dtype (even though it's documented here, as you also say). I think the current behaviour stems from before we had categorial dtypes, so the concept of dtype equality for categoricals didn't really exist back.

A deprecation cycle would require a extra parameter on both Series, Categorical and CategoricalIndex. Can we postpone this to 1.3, but treat is as a bug? That would give potential downstream problems more time to be fixed.

jreback · 2020-12-07T13:46:28Z

@jorisvandenbossche unless you can conceive of a good way to cleanly deprecate this I am +1 to merging here. I agree its a change, but the prior behavior was just grandfathered in w/o consideration. This makes me ok with just changing it.

jorisvandenbossche · 2020-12-07T21:12:56Z

I think the current behaviour stems from before we had categorial dtypes

This specific unique behaviour was explicitly implemented for categorical, I think: #10508

But anyway, it's indeed true that it's not that easy to deprecate .. As @topper-123 says, it would require an extra keyword in the different unique methods, specifically for this, and then afterwards we will also want to get rid of that keyword again ..
The other alternative is to keep it as a breaking change for 2.0.

I actually first thought this also affected the order of the actual returned values (which would be a much bigger change), but apparently it's only the order of the categories of the returned categorical (in addition to dropping the unobserved ones). So in that case, I am fine treating it as a bug fix. But still a slight preference to keep it for 1.3.

For people wanting the old behaviour, we should probably point to remove_unused_categories

jorisvandenbossche · 2020-12-07T21:14:12Z

doc/source/whatsnew/v1.2.0.rst

@@ -523,6 +523,7 @@ Categorical
 - :meth:`Categorical.fillna` will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow an ``NaT`` as a fill value for numeric categories (:issue:`36530`)
 - Bug in :meth:`Categorical.__setitem__` that incorrectly raised when trying to set a tuple value (:issue:`20439`)
 - Bug in :meth:`CategoricalIndex.equals` incorrectly casting non-category entries to ``np.nan`` (:issue:`37667`)
+- Bug in :meth:`Categorical.unique`, where the dtype changed in the unique array if there were unused categories in the original array (:issue:`38140`).


I agree, we can move it to the "notable bug fixes" section, with an example and showing how you can get back at the old behaviour.

jorisvandenbossche · 2020-12-07T21:18:36Z

pandas/core/groupby/categorical.py

+    take_codes = cat.codes[cat.codes != -1]
+    if cat.ordered:
+        take_codes = np.sort(take_codes)
+    cat = cat.set_categories(cat.categories.take(take_codes))


What changes/breaks if you don't include this?

Normally, the categorical its categories doesn't include any NaNs, so I don't fully understand the comment.

Also, if we don't want to drop unobserved categories in unique, don't we want to do the same change in groupby?

jbrockmendel · 2021-03-19T15:46:18Z

merge this as is (fixes only the Categorical.unique issue)

i think this is worthwhile

topper-123 · 2021-03-27T14:08:33Z

@jreback & @jorisvandenbossche, do you agree?

jreback

ok i think we are prob ok to merge this and open a followup issue for the groupby. @topper-123 can you merge master.

topper-123 · 2021-04-04T02:02:09Z

Rebased.

topper-123 · 2021-04-16T10:53:46Z

Gentle ping.

jreback · 2021-04-16T12:50:31Z

@topper-123 one more rebase pls, ping on green

topper-123 · 2021-04-16T17:38:56Z

Ok, thanks, I've just rebased.

jreback · 2021-04-16T17:43:36Z

thanks @topper-123

topper-123 changed the title ~~ENH: Categorical.unique can keep same dtype~~ BUG: Categorical.unique should keep dtype unchanged Nov 28, 2020

topper-123 commented Nov 28, 2020

View reviewed changes

gfyoung added API Design Bug Categorical Categorical Data Type labels Nov 28, 2020

gfyoung reviewed Nov 28, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

topper-123 force-pushed the Categorical.unique_unchanged_dtype branch 5 times, most recently from 5c5dded to cba219c Compare November 29, 2020 08:31

jreback approved these changes Dec 2, 2020

View reviewed changes

jreback added this to the 1.2 milestone Dec 2, 2020

jreback added the Blocker Blocking issue or pull request for an upcoming release label Dec 2, 2020

jbrockmendel reviewed Dec 2, 2020

View reviewed changes

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved

jreback requested changes Dec 2, 2020

View reviewed changes

jorisvandenbossche requested changes Dec 2, 2020

View reviewed changes

simonjayhawkins mentioned this pull request Dec 7, 2020

RLS: 1.2 #37784

Closed

jorisvandenbossche removed the Blocker Blocking issue or pull request for an upcoming release label Dec 7, 2020

jorisvandenbossche requested changes Dec 7, 2020

View reviewed changes

jreback approved these changes Apr 2, 2021

View reviewed changes

topper-123 force-pushed the Categorical.unique_unchanged_dtype branch from e5f9fcd to 8ec98d5 Compare April 4, 2021 00:37

topper-123 added 14 commits April 16, 2021 15:25

ENH: Categorical.unique can keep same dtype

6de5608

fixes

b0aed5c

fix doc string

9135f45

fix doc strings

8fcf4e1

fix categorical tests

356267b

fix test failure

1c8f4f9

fix value_count test

f31837c

values_count fix

e261f3c

update

a9859b6

fixes

9e29a11

Use series in whatsnew example

5ed054c

Update version in docs to v1.3.0

f68a38b

diff from rebase

a5e5096

isort cleanup

0616c20

topper-123 force-pushed the Categorical.unique_unchanged_dtype branch from 8ec98d5 to 0616c20 Compare April 16, 2021 14:27

jreback merged commit ab622f2 into pandas-dev:master Apr 16, 2021

topper-123 deleted the Categorical.unique_unchanged_dtype branch April 16, 2021 18:00

jbrockmendel mentioned this pull request Apr 17, 2021

CLN: remove CategoricalIndex.unique #40995

Merged

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021

BUG: Categorical.unique should keep dtype unchanged (pandas-dev#38140)

ded8fc8

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

BUG: Categorical.unique should keep dtype unchanged (pandas-dev#38140)

4479cfe

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

BUG: Categorical.unique should keep dtype unchanged (pandas-dev#38140)

3d4158c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Categorical.unique should keep dtype unchanged #38140

BUG: Categorical.unique should keep dtype unchanged #38140

topper-123 commented Nov 28, 2020 •

edited

Loading

topper-123 Nov 28, 2020

jorisvandenbossche Dec 7, 2020

topper-123 Dec 12, 2020

jorisvandenbossche Dec 14, 2020

jorisvandenbossche Dec 14, 2020

jreback Jan 5, 2021

topper-123 commented Nov 30, 2020

jreback left a comment

jreback commented Dec 2, 2020

jreback commented Dec 2, 2020

topper-123 commented Dec 2, 2020

jreback Dec 2, 2020

jorisvandenbossche Dec 7, 2020

jorisvandenbossche left a comment •

edited

Loading

jreback commented Dec 2, 2020

jorisvandenbossche commented Dec 3, 2020

jreback commented Dec 3, 2020

topper-123 commented Dec 3, 2020

jreback commented Dec 7, 2020

jorisvandenbossche commented Dec 7, 2020

jorisvandenbossche Dec 7, 2020

jorisvandenbossche Dec 7, 2020

jbrockmendel commented Mar 19, 2021

topper-123 commented Mar 27, 2021

jreback left a comment

topper-123 commented Apr 4, 2021 •

edited

Loading

topper-123 commented Apr 16, 2021

jreback commented Apr 16, 2021

topper-123 commented Apr 16, 2021

jreback commented Apr 16, 2021

BUG: Categorical.unique should keep dtype unchanged #38140

BUG: Categorical.unique should keep dtype unchanged #38140

Conversation

topper-123 commented Nov 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Nov 30, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Dec 2, 2020

jreback commented Dec 2, 2020

topper-123 commented Dec 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

jreback commented Dec 2, 2020

jorisvandenbossche commented Dec 3, 2020

jreback commented Dec 3, 2020

topper-123 commented Dec 3, 2020

jreback commented Dec 7, 2020

jorisvandenbossche commented Dec 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 19, 2021

topper-123 commented Mar 27, 2021

jreback left a comment

Choose a reason for hiding this comment

topper-123 commented Apr 4, 2021 • edited Loading

topper-123 commented Apr 16, 2021

jreback commented Apr 16, 2021

topper-123 commented Apr 16, 2021

jreback commented Apr 16, 2021

topper-123 commented Nov 28, 2020 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

topper-123 commented Apr 4, 2021 •

edited

Loading