-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Categorical.unique should keep dtype unchanged #38140
BUG: Categorical.unique should keep dtype unchanged #38140
Conversation
take_codes = cat.codes[cat.codes != -1] | ||
if cat.ordered: | ||
take_codes = np.sort(take_codes) | ||
cat = cat.set_categories(cat.categories.take(take_codes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from Categorical.unique
. This keeps groupbys working unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes/breaks if you don't include this?
Normally, the categorical its categories doesn't include any NaNs, so I don't fully understand the comment.
Also, if we don't want to drop unobserved categories in unique, don't we want to do the same change in groupby?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment refers to nans in cat
(a Categorical
), not its categories.
This section is not optimal and is just to keep the same behaviour as previously in groupbys. This can be changed in follow-ups. I don't think this can be untangled without breaking behaviour, e.g. this method of removing unused categories is different than using remove_unused categories
.
I think we should just accept this smelly bit here here and fix this whole function in followups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can you explain (or show with an example) what behaviour in groupby would change if this code was not included here?
This section ... is just to keep the same behaviour as previously in groupbys. This can be changed in follow-ups. I don't think this can be untangled without breaking behaviour
Yes, but we are breaking the behaviour of unique()
on purpose, so it might be that we want to make the exact same break in groupby?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But can you explain (or show with an example) what behaviour in groupby would change if this code was not included here?
I suppose it is related to the categories of the resulting key index/column after grouping on a categorical:
In [45]: df = pd.DataFrame({"key": pd.Categorical(["c", "b", "b"], categories=["a", "b", "c"]), "values": range(3)})
In [46]: df.groupby("key").sum().index
Out[46]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, name='key', dtype='category')
In [47]: df.groupby("key", sort=False).sum().index
Out[47]: CategoricalIndex(['c', 'b', 'a'], categories=['c', 'b', 'a'], ordered=False, name='key', dtype='category')
In [51]: df = pd.DataFrame({"key": pd.Categorical(["c", "b", "b"], categories=["a", "b", "c"], ordered=True), "values": range(3)})
In [52]: df.groupby("key").sum().index
Out[52]: CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=True, name='key', dtype='category')
In [53]: df.groupby("key", sort=False).sum().index
Out[53]: CategoricalIndex(['b', 'c', 'a'], categories=['b', 'c', 'a'], ordered=True, name='key', dtype='category')
So it seems we have a very similar issue here about the order of the resulting categories of the dtype which is not preserved (since it relied on the behaviour of unique
before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i think @jorisvandenbossche is right here. if you don't add the code here, how much to update the groupby tests?
5c5dded
to
cba219c
Compare
Any comments, @jreback , @jbrockmendel & @gfyoung ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, nicely simplifies logic, and as stated in the OP. This follows our current conventions. Since we aren't actually breaking any downstream logic (e.g. groupby), this is totally fine.
cc @jorisvandenbossche @TomAugspurger any objections. I actually think we should include this in the RC, cc @simonjayhawkins |
cc @toobaz from the OP |
Updated. |
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -523,6 +523,7 @@ Categorical | |||
- :meth:`Categorical.fillna` will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow an ``NaT`` as a fill value for numeric categories (:issue:`36530`) | |||
- Bug in :meth:`Categorical.__setitem__` that incorrectly raised when trying to set a tuple value (:issue:`20439`) | |||
- Bug in :meth:`CategoricalIndex.equals` incorrectly casting non-category entries to ``np.nan`` (:issue:`37667`) | |||
- Bug in :meth:`Categorical.unique`, where the dtype changed in the unique array if there were unused categories in the original array (:issue:`38140`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would make a sub-section abou tthis, showing before & after; this is a non-trivial change to absorb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can move it to the "notable bug fixes" section, with an example and showing how you can get back at the old behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fully on board with the actual change (it's really strange behaviour), but I think we need to think a bit more about how to do this change. Because it's not a bug fix, this was intentional, documented behaviour. And especially in the case of an ordered categorical, it's potentially breaking.
So IMO we don't need to "hurry" it into 1.2 (it's a longstanding issue, I don't think there is a reason why it is now pressing or blocking for 1.2?). We can also discuss a bit more and merge it early in the 1.3 cycle.
I don't think its possible to actually deprecate this. Further this is on an actual Categorical only. I honestly don't think this is that big of a deal. am ok with releasing it here with a sub-section. |
No, the Series unique method dispatches to the underlying EA, so Categorical.unique in this case, so changing this also affects the Series behaviour. |
I see @topper-123 you did have to change a couple of series tests. @jorisvandenbossche how would you deprecate this behavior? |
I think the current behaviour is buggy behaviour, because users wouldn't expect A deprecation cycle would require a extra parameter on both |
@jorisvandenbossche unless you can conceive of a good way to cleanly deprecate this I am +1 to merging here. I agree its a change, but the prior behavior was just grandfathered in w/o consideration. This makes me ok with just changing it. |
This specific But anyway, it's indeed true that it's not that easy to deprecate .. As @topper-123 says, it would require an extra keyword in the different I actually first thought this also affected the order of the actual returned values (which would be a much bigger change), but apparently it's only the order of the categories of the returned categorical (in addition to dropping the unobserved ones). So in that case, I am fine treating it as a bug fix. But still a slight preference to keep it for 1.3. For people wanting the old behaviour, we should probably point to |
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -523,6 +523,7 @@ Categorical | |||
- :meth:`Categorical.fillna` will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow an ``NaT`` as a fill value for numeric categories (:issue:`36530`) | |||
- Bug in :meth:`Categorical.__setitem__` that incorrectly raised when trying to set a tuple value (:issue:`20439`) | |||
- Bug in :meth:`CategoricalIndex.equals` incorrectly casting non-category entries to ``np.nan`` (:issue:`37667`) | |||
- Bug in :meth:`Categorical.unique`, where the dtype changed in the unique array if there were unused categories in the original array (:issue:`38140`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can move it to the "notable bug fixes" section, with an example and showing how you can get back at the old behaviour.
take_codes = cat.codes[cat.codes != -1] | ||
if cat.ordered: | ||
take_codes = np.sort(take_codes) | ||
cat = cat.set_categories(cat.categories.take(take_codes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes/breaks if you don't include this?
Normally, the categorical its categories doesn't include any NaNs, so I don't fully understand the comment.
Also, if we don't want to drop unobserved categories in unique, don't we want to do the same change in groupby?
i think this is worthwhile |
@jreback & @jorisvandenbossche, do you agree? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i think we are prob ok to merge this and open a followup issue for the groupby. @topper-123 can you merge master.
e5f9fcd
to
8ec98d5
Compare
Rebased. |
Gentle ping. |
@topper-123 one more rebase pls, ping on green |
8ec98d5
to
0616c20
Compare
Ok, thanks, I've just rebased. |
thanks @topper-123 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
We want to keep the same dtype as in the original after applying
unique
. For example: