Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Categorical.unique() should not drop unused categories #21648

Closed
topper-123 opened this issue Jun 26, 2018 · 10 comments
Closed

API: Categorical.unique() should not drop unused categories #21648

topper-123 opened this issue Jun 26, 2018 · 10 comments
Labels
API Design Categorical Categorical Data Type

Comments

@topper-123
Copy link
Contributor

topper-123 commented Jun 26, 2018

Currently, Categorical.unique and CategoricalIndex.unique drop unused categories:

>>> categories = ['very good', 'good', 'neutral', 'bad', 'very bad']
>>> cat = pd.Categorical(['good','good', 'bad', 'bad'], categories=categories, ordered=True)
>>> cat
[good, good, bad, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]
>>> cat.unique()
[good, bad]
Categories (2, object): [good < bad]  # unused categories dropped

So, .unique() both uniquefies and drops unused categories (does two things in one operation)

Often, even if you want to uniquefy values, you still want to control whether to drop unused categories or not. So Categorical/CategoricalIndex.unique should IMO keep all categories, and categories should be dropped in a seperate action. So, this would be a better API:

>>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]    # unused not dropped

If you want to drop unused categories, you should do it explicitly like so: cat.unique().remove_unused_categories().

The proposed API is also faster, as dropping unused categories requires recoding the categories/codes, which is potentially expensive.

@jreback
Copy link
Contributor

jreback commented Jun 26, 2018

look at past issues - pls link them if you wish to reopen the discussion

this was heavily discussed

@topper-123
Copy link
Contributor Author

Ok, found it here #8559. I had looked for the word "categorical.unique", which isn't found in that issue...

Anyway, not sure I understand the reasoning there, e.g this

So df_quotes[outlier_filter].symbol.unique() is equivalent to df_quotes.symbol.cat.categories.

is not true (see my example above).

Is the issue about plotting? I.e. unused categories are plotted, and you'd often want to not plot unused categories? That's reasonable, though in other instances you'd would want to plot unused categories (e.g. to show no observation occurred). Most often you'd probably want to not plot unused categories, I guess.

If the issue is about plotting this should IMO ideally be handled at the plotting level/library, though I can see that that would be a bigger operation and may not be worth it to change this at the pandas level.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

what actually should happen is that .unique() (and some other methods) should gain observed= kwargs

@gfyoung gfyoung added Categorical Categorical Data Type API Design labels Jun 27, 2018
@topper-123
Copy link
Contributor Author

I don't that's work, as Series.unique(observed=True) wouldn't work/make sense if the underlying Array is not a Categorical.

I think the reasonable answer to finding unique values for Categoricals is to show categories also, i.e.:

[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]

is more informative than

[good, bad]
Categories (5, object): [good < bad]

So I don't quite understand the motivation for the current implementation (unless there is a specific reason, e.g. plotting).

@topper-123
Copy link
Contributor Author

topper-123 commented Jul 16, 2018

@jankatins what's the reasoning for dropping unused categores in your PR #8937? do you have an argument for not keeping the categories?

@jorisvandenbossche , you commented in #8559:

For the unique case I also think we should only return the categories that occur in the series (or return a Categorical)

Do you have reason for this? I don't really understand why these should be dropped.

In R, they're not dropped AFAIKT:

> f <- factor(c('a', 'a', 'b', 'b'))[1:2]
> f
a a
Levels:
    'a' 'b' 
> unique(f)
a 
Levels:
    'a' 'b' 

Notice how the levels (categories) have not been reduced.

EDIT: Changed the R example, it was showing a wrong example

@jankatins
Copy link
Contributor

jankatins commented Jul 17, 2018

The discussion happened here: #8559 (comment) + #8559 (comment)

There the argument was that we follow Rs handling in these cases (see the first link).

@topper-123
Copy link
Contributor Author

Ok, but it seems that pandas does not do it the same way as R, right (See my example above). In the discussion you also say

Interestingly unique returns a factor (with all levels, but only the "used" levels as values) when the input is a factor:

So to follow R, we should have:

>>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]

Correct?

@jankatins
Copy link
Contributor

I would usually say yes (I'm usually arguing that Categorical should be optimized for Lickert scales and not as a pd.String replacement). If I read the commit from #8937 right, then at that time unique() was returning a numpy array and not a Categorical (8290a4d)

@topper-123
Copy link
Contributor Author

Ah yes, I see that now. The relevant PR is actually #10508. From that PR I saw referenced #18291, an issue that proposes the same changes as I propose + a few more details.

I agree with the sentiment in #18291 and this issue is more or less a duplicate of that one. So I'll close this later today, unless someone thinks otherwise.

@topper-123
Copy link
Contributor Author

Closing as a duplicate of #18291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

No branches or pull requests

4 participants