Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categorical.unique should keep dtype unchanged #38140

Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Nov 28, 2020

We want to keep the same dtype as in the original after applying unique. For example:

>>> dtype = pd.CategoricalDtype(['very bad', 'bad', 'neutral',  'good', 'very good'], ordered=True)
>>> cat = pd.Categorical(['good','good', 'bad', 'bad'], dtype=dtype)
>>> cat
['good', 'good', 'bad', 'bad']
Categories (5, object): ['very bad' < 'bad' < 'neutral' < 'good' < 'very good']
>>> cat.unique()
['good', 'bad']
Categories (5, object): ['bad' < 'good']
>>> cat.unique().dtype == cat.dtype
False  # master
True  # this PR

@topper-123 topper-123 changed the title ENH: Categorical.unique can keep same dtype BUG: Categorical.unique should keep dtype unchanged Nov 28, 2020
take_codes = cat.codes[cat.codes != -1]
if cat.ordered:
take_codes = np.sort(take_codes)
cat = cat.set_categories(cat.categories.take(take_codes))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from Categorical.unique. This keeps groupbys working unchanged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes/breaks if you don't include this?

Normally, the categorical its categories doesn't include any NaNs, so I don't fully understand the comment.

Also, if we don't want to drop unobserved categories in unique, don't we want to do the same change in groupby?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment refers to nans in cat (a Categorical), not its categories.

This section is not optimal and is just to keep the same behaviour as previously in groupbys. This can be changed in follow-ups. I don't think this can be untangled without breaking behaviour, e.g. this method of removing unused categories is different than using remove_unused categories.

I think we should just accept this smelly bit here here and fix this whole function in followups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can you explain (or show with an example) what behaviour in groupby would change if this code was not included here?

This section ... is just to keep the same behaviour as previously in groupbys. This can be changed in follow-ups. I don't think this can be untangled without breaking behaviour

Yes, but we are breaking the behaviour of unique() on purpose, so it might be that we want to make the exact same break in groupby?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But can you explain (or show with an example) what behaviour in groupby would change if this code was not included here?

I suppose it is related to the categories of the resulting key index/column after grouping on a categorical:

In [45]: df = pd.DataFrame({"key": pd.Categorical(["c", "b", "b"], categories=["a", "b", "c"]), "values": range(3)})

In [46]: df.groupby("key").sum().index
Out[46]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, name='key', dtype='category')

In [47]: df.groupby("key", sort=False).sum().index
Out[47]: CategoricalIndex(['c', 'b', 'a'], categories=['c', 'b', 'a'], ordered=False, name='key', dtype='category')

In [51]: df = pd.DataFrame({"key": pd.Categorical(["c", "b", "b"], categories=["a", "b", "c"],  ordered=True), "values": range(3)})

In [52]: df.groupby("key").sum().index
Out[52]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=True, name='key', dtype='category')

In [53]: df.groupby("key", sort=False).sum().index
Out[53]: CategoricalIndex(['b', 'c', 'a'], categories=['b', 'c', 'a'], ordered=True, name='key', dtype='category')

So it seems we have a very similar issue here about the order of the resulting categories of the dtype which is not preserved (since it relied on the behaviour of unique before)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i think @jorisvandenbossche is right here. if you don't add the code here, how much to update the groupby tests?

@gfyoung gfyoung added API Design Bug Categorical Categorical Data Type labels Nov 28, 2020
@topper-123 topper-123 force-pushed the Categorical.unique_unchanged_dtype branch 5 times, most recently from 5c5dded to cba219c Compare November 29, 2020 08:31
@topper-123
Copy link
Contributor Author

Any comments, @jreback , @jbrockmendel & @gfyoung ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, nicely simplifies logic, and as stated in the OP. This follows our current conventions. Since we aren't actually breaking any downstream logic (e.g. groupby), this is totally fine.

@jreback jreback added this to the 1.2 milestone Dec 2, 2020
@jreback
Copy link
Contributor

jreback commented Dec 2, 2020

cc @jorisvandenbossche @TomAugspurger any objections. I actually think we should include this in the RC, cc @simonjayhawkins

@jreback jreback added the Blocker Blocking issue or pull request for an upcoming release label Dec 2, 2020
@jreback
Copy link
Contributor

jreback commented Dec 2, 2020

cc @toobaz from the OP

@topper-123
Copy link
Contributor Author

Updated.

@@ -523,6 +523,7 @@ Categorical
- :meth:`Categorical.fillna` will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow an ``NaT`` as a fill value for numeric categories (:issue:`36530`)
- Bug in :meth:`Categorical.__setitem__` that incorrectly raised when trying to set a tuple value (:issue:`20439`)
- Bug in :meth:`CategoricalIndex.equals` incorrectly casting non-category entries to ``np.nan`` (:issue:`37667`)
- Bug in :meth:`Categorical.unique`, where the dtype changed in the unique array if there were unused categories in the original array (:issue:`38140`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would make a sub-section abou tthis, showing before & after; this is a non-trivial change to absorb.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can move it to the "notable bug fixes" section, with an example and showing how you can get back at the old behaviour.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fully on board with the actual change (it's really strange behaviour), but I think we need to think a bit more about how to do this change. Because it's not a bug fix, this was intentional, documented behaviour. And especially in the case of an ordered categorical, it's potentially breaking.
So IMO we don't need to "hurry" it into 1.2 (it's a longstanding issue, I don't think there is a reason why it is now pressing or blocking for 1.2?). We can also discuss a bit more and merge it early in the 1.3 cycle.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2020

I am fully on board with the actual change (it's really strange behaviour), but I think we need to think a bit more about how to do this change. Because it's not a bug fix, this was intentional, documented behaviour. And especially in the case of an ordered categorical, it's potentially breaking.
So IMO we don't need to "hurry" it into 1.2 (it's a longstanding issue, I don't think there is a reason why it is now pressing or blocking for 1.2?). We can also discuss a bit more and merge it early in the 1.3 cycle.

I don't think its possible to actually deprecate this. Further this is on an actual Categorical only. I honestly don't think this is that big of a deal. am ok with releasing it here with a sub-section.

@jorisvandenbossche
Copy link
Member

Further this is on an actual Categorical only.

No, the Series unique method dispatches to the underlying EA, so Categorical.unique in this case, so changing this also affects the Series behaviour.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2020

I see @topper-123 you did have to change a couple of series tests.

@jorisvandenbossche how would you deprecate this behavior?

@topper-123
Copy link
Contributor Author

I think the current behaviour is buggy behaviour, because users wouldn't expect arr.unique to change dtype (even though it's documented here, as you also say). I think the current behaviour stems from before we had categorial dtypes, so the concept of dtype equality for categoricals didn't really exist back.

A deprecation cycle would require a extra parameter on both Series, Categorical and CategoricalIndex. Can we postpone this to 1.3, but treat is as a bug? That would give potential downstream problems more time to be fixed.

@jreback
Copy link
Contributor

jreback commented Dec 7, 2020

@jorisvandenbossche unless you can conceive of a good way to cleanly deprecate this I am +1 to merging here. I agree its a change, but the prior behavior was just grandfathered in w/o consideration. This makes me ok with just changing it.

@simonjayhawkins simonjayhawkins mentioned this pull request Dec 7, 2020
@jorisvandenbossche
Copy link
Member

I think the current behaviour stems from before we had categorial dtypes

This specific unique behaviour was explicitly implemented for categorical, I think: #10508

But anyway, it's indeed true that it's not that easy to deprecate .. As @topper-123 says, it would require an extra keyword in the different unique methods, specifically for this, and then afterwards we will also want to get rid of that keyword again ..
The other alternative is to keep it as a breaking change for 2.0.

I actually first thought this also affected the order of the actual returned values (which would be a much bigger change), but apparently it's only the order of the categories of the returned categorical (in addition to dropping the unobserved ones). So in that case, I am fine treating it as a bug fix. But still a slight preference to keep it for 1.3.

For people wanting the old behaviour, we should probably point to remove_unused_categories

@jorisvandenbossche jorisvandenbossche removed the Blocker Blocking issue or pull request for an upcoming release label Dec 7, 2020
@@ -523,6 +523,7 @@ Categorical
- :meth:`Categorical.fillna` will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow an ``NaT`` as a fill value for numeric categories (:issue:`36530`)
- Bug in :meth:`Categorical.__setitem__` that incorrectly raised when trying to set a tuple value (:issue:`20439`)
- Bug in :meth:`CategoricalIndex.equals` incorrectly casting non-category entries to ``np.nan`` (:issue:`37667`)
- Bug in :meth:`Categorical.unique`, where the dtype changed in the unique array if there were unused categories in the original array (:issue:`38140`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can move it to the "notable bug fixes" section, with an example and showing how you can get back at the old behaviour.

take_codes = cat.codes[cat.codes != -1]
if cat.ordered:
take_codes = np.sort(take_codes)
cat = cat.set_categories(cat.categories.take(take_codes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes/breaks if you don't include this?

Normally, the categorical its categories doesn't include any NaNs, so I don't fully understand the comment.

Also, if we don't want to drop unobserved categories in unique, don't we want to do the same change in groupby?

@jbrockmendel
Copy link
Member

merge this as is (fixes only the Categorical.unique issue)

i think this is worthwhile

@topper-123
Copy link
Contributor Author

@jreback & @jorisvandenbossche, do you agree?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think we are prob ok to merge this and open a followup issue for the groupby. @topper-123 can you merge master.

@topper-123 topper-123 force-pushed the Categorical.unique_unchanged_dtype branch from e5f9fcd to 8ec98d5 Compare April 4, 2021 00:37
@topper-123
Copy link
Contributor Author

topper-123 commented Apr 4, 2021

Rebased.

@topper-123
Copy link
Contributor Author

Gentle ping.

@jreback
Copy link
Contributor

jreback commented Apr 16, 2021

@topper-123 one more rebase pls, ping on green

@topper-123 topper-123 force-pushed the Categorical.unique_unchanged_dtype branch from 8ec98d5 to 0616c20 Compare April 16, 2021 14:27
@topper-123
Copy link
Contributor Author

Ok, thanks, I've just rebased.

@jreback jreback merged commit ab622f2 into pandas-dev:master Apr 16, 2021
@jreback
Copy link
Contributor

jreback commented Apr 16, 2021

thanks @topper-123

@topper-123 topper-123 deleted the Categorical.unique_unchanged_dtype branch April 16, 2021 18:00
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Any categorical dtype object's .unique() changes categories
6 participants