API: Categorical.unique() should not drop unused categories #21648

topper-123 · 2018-06-26T23:24:29Z

Currently, Categorical.unique and CategoricalIndex.unique drop unused categories:

>>> categories = ['very good', 'good', 'neutral', 'bad', 'very bad']
>>> cat = pd.Categorical(['good','good', 'bad', 'bad'], categories=categories, ordered=True)
>>> cat
[good, good, bad, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]
>>> cat.unique()
[good, bad]
Categories (2, object): [good < bad]  # unused categories dropped

So, .unique() both uniquefies and drops unused categories (does two things in one operation)

Often, even if you want to uniquefy values, you still want to control whether to drop unused categories or not. So Categorical/CategoricalIndex.unique should IMO keep all categories, and categories should be dropped in a seperate action. So, this would be a better API:

>>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]    # unused not dropped

If you want to drop unused categories, you should do it explicitly like so: cat.unique().remove_unused_categories().

The proposed API is also faster, as dropping unused categories requires recoding the categories/codes, which is potentially expensive.

The text was updated successfully, but these errors were encountered:

jreback · 2018-06-26T23:31:09Z

look at past issues - pls link them if you wish to reopen the discussion

this was heavily discussed

topper-123 · 2018-06-27T00:22:16Z

Ok, found it here #8559. I had looked for the word "categorical.unique", which isn't found in that issue...

Anyway, not sure I understand the reasoning there, e.g this

So df_quotes[outlier_filter].symbol.unique() is equivalent to df_quotes.symbol.cat.categories.

is not true (see my example above).

Is the issue about plotting? I.e. unused categories are plotted, and you'd often want to not plot unused categories? That's reasonable, though in other instances you'd would want to plot unused categories (e.g. to show no observation occurred). Most often you'd probably want to not plot unused categories, I guess.

If the issue is about plotting this should IMO ideally be handled at the plotting level/library, though I can see that that would be a bigger operation and may not be worth it to change this at the pandas level.

jreback · 2018-06-27T00:46:15Z

what actually should happen is that .unique() (and some other methods) should gain observed= kwargs

topper-123 · 2018-06-28T20:40:41Z

I don't that's work, as Series.unique(observed=True) wouldn't work/make sense if the underlying Array is not a Categorical.

I think the reasonable answer to finding unique values for Categoricals is to show categories also, i.e.:

[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]

is more informative than

[good, bad]
Categories (5, object): [good < bad]

So I don't quite understand the motivation for the current implementation (unless there is a specific reason, e.g. plotting).

topper-123 · 2018-07-16T23:22:55Z

@jankatins what's the reasoning for dropping unused categores in your PR #8937? do you have an argument for not keeping the categories?

@jorisvandenbossche , you commented in #8559:

For the unique case I also think we should only return the categories that occur in the series (or return a Categorical)

Do you have reason for this? I don't really understand why these should be dropped.

In R, they're not dropped AFAIKT:

> f <- factor(c('a', 'a', 'b', 'b'))[1:2]
> f
a a
Levels:
    'a' 'b' 
> unique(f)
a 
Levels:
    'a' 'b'

Notice how the levels (categories) have not been reduced.

EDIT: Changed the R example, it was showing a wrong example

jankatins · 2018-07-17T10:02:44Z

The discussion happened here: #8559 (comment) + #8559 (comment)

There the argument was that we follow Rs handling in these cases (see the first link).

topper-123 · 2018-07-18T16:45:08Z

Ok, but it seems that pandas does not do it the same way as R, right (See my example above). In the discussion you also say

Interestingly unique returns a factor (with all levels, but only the "used" levels as values) when the input is a factor:

So to follow R, we should have:

>>> cat.unique()
[good, bad]
Categories (5, object): [very good < good < neutral < bad < very bad]

Correct?

jankatins · 2018-07-18T17:13:17Z

I would usually say yes (I'm usually arguing that Categorical should be optimized for Lickert scales and not as a pd.String replacement). If I read the commit from #8937 right, then at that time unique() was returning a numpy array and not a Categorical (8290a4d)

topper-123 · 2018-07-18T21:58:10Z

Ah yes, I see that now. The relevant PR is actually #10508. From that PR I saw referenced #18291, an issue that proposes the same changes as I propose + a few more details.

I agree with the sentiment in #18291 and this issue is more or less a duplicate of that one. So I'll close this later today, unless someone thinks otherwise.

topper-123 · 2018-07-18T23:13:24Z

Closing as a duplicate of #18291

gfyoung added Categorical Categorical Data Type API Design labels Jun 27, 2018

topper-123 closed this as completed Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Categorical.unique() should not drop unused categories #21648

API: Categorical.unique() should not drop unused categories #21648

topper-123 commented Jun 26, 2018 •

edited

Loading

jreback commented Jun 26, 2018

topper-123 commented Jun 27, 2018

jreback commented Jun 27, 2018

topper-123 commented Jun 28, 2018

topper-123 commented Jul 16, 2018 •

edited

Loading

jankatins commented Jul 17, 2018 •

edited

Loading

topper-123 commented Jul 18, 2018

jankatins commented Jul 18, 2018

topper-123 commented Jul 18, 2018

topper-123 commented Jul 18, 2018

API: Categorical.unique() should not drop unused categories #21648

API: Categorical.unique() should not drop unused categories #21648

Comments

topper-123 commented Jun 26, 2018 • edited Loading

jreback commented Jun 26, 2018

topper-123 commented Jun 27, 2018

jreback commented Jun 27, 2018

topper-123 commented Jun 28, 2018

topper-123 commented Jul 16, 2018 • edited Loading

jankatins commented Jul 17, 2018 • edited Loading

topper-123 commented Jul 18, 2018

jankatins commented Jul 18, 2018

topper-123 commented Jul 18, 2018

topper-123 commented Jul 18, 2018

topper-123 commented Jun 26, 2018 •

edited

Loading

topper-123 commented Jul 16, 2018 •

edited

Loading

jankatins commented Jul 17, 2018 •

edited

Loading