-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Categorical.unique() should not drop unused categories #21648
Comments
look at past issues - pls link them if you wish to reopen the discussion this was heavily discussed |
Ok, found it here #8559. I had looked for the word "categorical.unique", which isn't found in that issue... Anyway, not sure I understand the reasoning there, e.g this
is not true (see my example above). Is the issue about plotting? I.e. unused categories are plotted, and you'd often want to not plot unused categories? That's reasonable, though in other instances you'd would want to plot unused categories (e.g. to show no observation occurred). Most often you'd probably want to not plot unused categories, I guess. If the issue is about plotting this should IMO ideally be handled at the plotting level/library, though I can see that that would be a bigger operation and may not be worth it to change this at the pandas level. |
what actually should happen is that .unique() (and some other methods) should gain observed= kwargs |
I don't that's work, as Series.unique(observed=True) wouldn't work/make sense if the underlying Array is not a Categorical. I think the reasonable answer to finding unique values for Categoricals is to show categories also, i.e.: [good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad] is more informative than [good, bad]
Categories (5, object): [good < bad] So I don't quite understand the motivation for the current implementation (unless there is a specific reason, e.g. plotting). |
@jankatins what's the reasoning for dropping unused categores in your PR #8937? do you have an argument for not keeping the categories? @jorisvandenbossche , you commented in #8559:
Do you have reason for this? I don't really understand why these should be dropped. In R, they're not dropped AFAIKT: > f <- factor(c('a', 'a', 'b', 'b'))[1:2]
> f
a a
Levels:
'a' 'b'
> unique(f)
a
Levels:
'a' 'b' Notice how the levels (categories) have not been reduced. EDIT: Changed the R example, it was showing a wrong example |
The discussion happened here: #8559 (comment) + #8559 (comment) There the argument was that we follow Rs handling in these cases (see the first link). |
Ok, but it seems that pandas does not do it the same way as R, right (See my example above). In the discussion you also say
So to follow R, we should have: >>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad] Correct? |
Ah yes, I see that now. The relevant PR is actually #10508. From that PR I saw referenced #18291, an issue that proposes the same changes as I propose + a few more details. I agree with the sentiment in #18291 and this issue is more or less a duplicate of that one. So I'll close this later today, unless someone thinks otherwise. |
Closing as a duplicate of #18291 |
Currently,
Categorical.unique
andCategoricalIndex.unique
drop unused categories:So,
.unique()
both uniquefies and drops unused categories (does two things in one operation)Often, even if you want to uniquefy values, you still want to control whether to drop unused categories or not. So
Categorical/CategoricalIndex.unique
should IMO keep all categories, and categories should be dropped in a seperate action. So, this would be a better API:If you want to drop unused categories, you should do it explicitly like so:
cat.unique().remove_unused_categories()
.The proposed API is also faster, as dropping unused categories requires recoding the categories/codes, which is potentially expensive.
The text was updated successfully, but these errors were encountered: