Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimum of ordered categorical data in Panda DataFrames #25299

Closed
Guillaume1801 opened this issue Feb 13, 2019 · 7 comments
Closed

Minimum of ordered categorical data in Panda DataFrames #25299

Guillaume1801 opened this issue Feb 13, 2019 · 7 comments
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@Guillaume1801
Copy link

I have a Pandas DataFrame with one Serie containing ordered Categorical data. Some value of this Serie may be missing (NaN). I want to get the minimum without taking into account NaNs but I obtained strange results ...

Code:

raw_cat = pd.Categorical(["a", "b", "c", "a"],
                         categories=["b", "c", "d"],
                         ordered=True)
s = pd.Series(raw_cat)
raw_cat.min(numeric_only=True), s.min(numeric_only=True)

Output:

('b', nan)

Expected utput:

('b', 'b')

I am getting the desired output when running this code with pandas 0.23.4 but not with pandas 0.24.0 and above.

Is this an issue or a misunderstanding? Thank you for your help.

@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type labels Feb 13, 2019
@jorisvandenbossche jorisvandenbossche added this to the 0.24.2 milestone Feb 13, 2019
@jorisvandenbossche
Copy link
Member

Thanks for the report! I can confirm this regression.

@jorisvandenbossche
Copy link
Member

So it seems that the numeric_only keyword is no longer properly passed through to the Categorical.min implementation. Investigation welcome!

@arnov
Copy link
Contributor

arnov commented Feb 13, 2019

I dove into this a bit, but shouldn't the argument be skipna? Because I am unsure what numeric_only would mean for a categorical series.

@Guillaume1801
Copy link
Author

I agree with you ! It makes no sense to use this argument while the argument used to removed NaNs in all other Pandas' methods is skipna ...

@arnov
Copy link
Contributor

arnov commented Feb 13, 2019

To add to the confusion, Categorical supports the dropna argument in the mode method, while it seems to be skipna in a lot of other places.

@jorisvandenbossche
Copy link
Member

@arnov I actually thought exactly the same when answering on this issue, so I opened #25303 (but forgot to link to it here).

So I agree that skipna is more logical, but I don't think we can't simply change it as you did in #25304, we will have to deprecate the keyword and behaviour first.

Short term, I think it would be good to "just" fix it using numeric_only (so we can include this for 0.24.2), and then for 0.25.0 we could think about deprecating it. But let's first discuss that in #25303

@jreback jreback modified the milestones: 0.24.2, 0.25.0 Feb 16, 2019
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.25.0, 0.24.2 Feb 16, 2019
@jreback
Copy link
Contributor

jreback commented Feb 16, 2019

closed by #25304

@jreback jreback closed this as completed Feb 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants