Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max #27929

Merged
merged 15 commits into from
Dec 2, 2019

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Aug 15, 2019

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved
@@ -38,29 +38,38 @@ def test_min_max(self):
cat = Categorical(
[np.nan, "b", "c", np.nan], categories=["d", "c", "b", "a"], ordered=True
)
_min = cat.min()
_max = cat.max()
_min = cat.min(skipna=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this test on skipna=True/False

@jreback jreback added Categorical Categorical Data Type Deprecate Functionality to remove in pandas labels Aug 15, 2019
@jreback jreback added this to the 1.0 milestone Aug 16, 2019
@makbigc
Copy link
Contributor Author

makbigc commented Aug 22, 2019

@jreback Anything else? Please tell me.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions

  1. Did the default change? numeric_only=None seems to be functionally equivalent to skipna=False.
  2. What's the reason for the change of the implementation? It's not clear to me if this is going to have a performance impact.

@@ -1028,7 +1028,7 @@ def test_min_max(self):
)
_min = cat.min()
_max = cat.max()
assert np.isnan(_min)
assert _min == "c"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why idd this change?

Copy link
Contributor Author

@makbigc makbigc Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to the following comment.

@makbigc
Copy link
Contributor Author

makbigc commented Aug 26, 2019

The default behavior will change when Categorical contains NA values. In the present code, numeric_only is None by default. No matter NA value is involved, the min operation is carried over the entire categorical, i.e., the else clause.

if numeric_only:
good = self._codes != -1
pointer = self._codes[good].min(**kwargs)
else:
pointer = self._codes.min(**kwargs)

The Categorical.min returns nan but Categorical.max doesn't if nan is contained. -1 stands for nan which is usually the minimum in Categorical._code

In [22]: from pandas import Categorical

In [23]: cat = Categorical([np.nan, 1, 2, np.nan], ordered=True)

In [24]: cat.min()
Out[24]: nan

In [25]: cat.max()
Out[25]: 2

In this PR, the default behavior of min and max is to drop NA values in advance, i.e., skipna=True.
cat.min() should return 1 and cat.max() return 2.

@TomAugspurger
Copy link
Contributor

OK, that behavior looks pretty buggy. But I'm not sure if we should be just changing the default output of .min() or .max().... @jorisvandenbossche do you have thoughts here?

Given that users will need to update their code anyway to use the new argument, I think that we should try to get the correct behavior when skipna=True, while preserving the buggy bheavior with skipna=False. How much of a hassle will that be?

I also think the error message can be improved.

In [6]: c.max(numeric_only=False)
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: the 'numeric_only' keyword is deprecated, use 'skipna' instead
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[6]: nan

Reading that, it seems like I just need to replace numeric_only with skipna. But I also need to invert the value. The warning should indicate that.

@makbigc
Copy link
Contributor Author

makbigc commented Sep 16, 2019

@jorisvandenbossche Would you tell us your thought about Tom's suggestion? That is keeping the buggy behaviour when skipna=False, while having a desired behaviour when skipna=True.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

@makbigc can you merge master.

@jorisvandenbossche can you respond to questions here: #27929 (comment)

@jorisvandenbossche
Copy link
Member

Focusing on the default behaviour for a moment, so when no arguments are specified (and not how to handle the numeric_only keyword): what default behaviour do we want?

Currently, Series.min (and also Categorical.min) returns the "wrong" thing:

In [28]: cat = pd.Categorical([1, 2, np.nan], ordered=True) 

In [29]: pd.Series(cat).min()
Out[29]: nan

To be consistent with the rest of pandas, this result should be 1 instead of nan (since we have a default skipna=True).

I think we agree that we want that correct behaviour long term?

Question is then how to get there:

  • breaking change: start skipping NaNs by default in 1.0
  • first introduce a warning that in the future this will skip NaNs by default. We could only raise this warning if we detect that there are actually NaNs present (so a case where the result would change). And in this case, the user could silence the warning by specifying explicitly skipna=True

Personally, I might have a slight preference to actually do a breaking change on this for 1.0 for the default behaviour (we still need to deprecate the numeric_only keyword when specified, that's a separate thing). But, it's certainly possible to do it with a deprecation (we will only need to change skipna=True into skipna=None to know when it was explicitly specified by the user).

@makbigc
Copy link
Contributor Author

makbigc commented Oct 7, 2019

@jorisvandenbossche Thanks for your detail reply.

If I take the first approach (i.e., breaking change), what I should do is:

  1. Remove the deprecation message that the numeric_only is replaced with skipna
  2. In whatsnew, state explicitly the previous behaviour and the changed behaviour

Anything else? Please tell me.

@jorisvandenbossche
Copy link
Member

Let's wait a bit to see what others think about the default behaviour.
But in any case, we want to keep the deprecation of numeric_only keyword (passing it should raise a warning)

@jreback
Copy link
Contributor

jreback commented Oct 18, 2019

I agree we should deprecate numeric_only; I think we need to default skipna=None and then warn; changing this in the future.

Though a breaking change is simpler.

@jorisvandenbossche
Copy link
Member

@TomAugspurger thoughts on #27929 (comment) ?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 20, 2019 via email

@jorisvandenbossche
Copy link
Member

Sorry, in the linked comment I ask several questions. So what is the "that seems good" exactly answering to?

@TomAugspurger
Copy link
Contributor

Sorry, in the linked comment I ask several questions. So what is the "that seems good" exactly answering to?

Your summary at the end. Breaking change for default + a deprecation saying that numeric_only will be removed entirely.

Personally, I might have a slight preference to actually do a breaking change on this for 1.0 for the default behaviour (we still need to deprecate the numeric_only keyword when specified, that's a separate thing). But, it's certainly possible to do it with a deprecation (we will only need to change skipna=True into skipna=None to know when it was explicitly specified by the user).

@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

can you merge master.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now that we have agreement on the way forward (breaking change for default behaviour + deprecate numeric_only), can you

  • Add a section to the whatsnew for 1.0.0 in the API breaking changes section about this?
  • Update to already use skipna=True as the new default? (I think there is then no need to first have skipna as None as default and raise a warning for that?)

@@ -2193,7 +2193,8 @@ def _reduce(self, name, axis=0, **kwargs):
raise TypeError(msg.format(op=name))
return func(**kwargs)

def min(self, numeric_only=None, **kwargs):
@deprecate_kwarg(old_arg_name="numeric_only", new_arg_name="skipna")
def min(self, skipna=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be skipna=True ?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!
A few more comments, should be almost good now

By default :meth:`Categorical.min` and :meth:`Categorical.max` return the min and the max respectively
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min` and :meth:`Categorical.max`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min` and :meth:`Categorical.max`
When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min`

It was only min that returned NaN (can you update the title as well?)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min` and :meth:`Categorical.max`
no longer return ``np.nan`` by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add something like "to honor the default of skipna=True" to make it clear that this change makes it consistent with the rest of pandas

"The default value of skipna will be changed to "
"True in the future version."
)
warn(msg, FutureWarning, stacklevel=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this if skipna is None block with the warning I think?

if skipna:
pointer = self._codes[good].max(**kwargs)
else:
if skipna is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

[np.nan, 1, 2, np.nan], categories=[5, 4, 3, 2, 1], ordered=True
)
with tm.assert_produces_warning(
expected_warning=FutureWarning, check_stacklevel=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work without check_stacklevel=False ?

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
_max = cat.max(numeric_only=True)
assert _max == "b"
if skipna is False:
assert np.isnan(_min)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isna/notna

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.isnan is a more strict / correct test in this case, since we are actually returning NaN (and not None, NA or NaT)

if skipna:
pointer = self._codes[good].min(**kwargs)
else:
return np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct for i8 types, which should be pd.NaT. how to fix this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check the categories.dtype.na_value if it exists. But since this is the current behaviour, it's not critical to fix in this PR I think.

@makbigc
Copy link
Contributor Author

makbigc commented Nov 27, 2019

It is strange that the failed tests don't call categorical.min or categorical.max.

@jorisvandenbossche
Copy link
Member

@makbigc those are indeed unrelated, so you can ignore those for now (they are being fixed in #29877)

@jorisvandenbossche
Copy link
Member

@makbigc I merged that other PR. So if you merge latest master in this branch, the error should be solved.

@makbigc
Copy link
Contributor Author

makbigc commented Dec 2, 2019

@jreback anything else? Please tell me. #27801 is pending for it

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one remaining comment about removing the kwargs (and added two wording suggestions you can commit)

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.0.0.rst Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now, thanks @makbigc for keeping on! (as this took some time ..)

@jorisvandenbossche jorisvandenbossche changed the title DEPR: Deprecate numeric_only parameter in Categorical.min and max API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max Dec 2, 2019
@jorisvandenbossche jorisvandenbossche merged commit 37526c1 into pandas-dev:master Dec 2, 2019
@jorisvandenbossche
Copy link
Member

@jreback I opened an issue for your remaining comment about always returning NaN regardless of the dtype: #29962

keechongtan added a commit to keechongtan/pandas that referenced this pull request Dec 2, 2019
…ndexing-1row-df

* upstream/master: (49 commits)
  repr() (pandas-dev#29959)
  DOC : Typo fix in userguide/Styling (pandas-dev#29956)
  CLN: small things in pytables (pandas-dev#29958)
  API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max (pandas-dev#27929)
  DEPR: DTI/TDI/PI constructor arguments (pandas-dev#29930)
  CLN: fix pytables passing too many kwargs (pandas-dev#29951)
  Typing (pandas-dev#29947)
  repr() (pandas-dev#29948)
  repr() (pandas-dev#29950)
  Added space at the end of the sentence (pandas-dev#29949)
  ENH: add NA scalar for missing value indicator, use in StringArray. (pandas-dev#29597)
  CLN: BlockManager.apply (pandas-dev#29825)
  TST: add test for rolling max/min/mean with DatetimeIndex over different frequencies (pandas-dev#29932)
  CLN: explicit signature for to_hdf (pandas-dev#29939)
  CLN: make kwargs explicit for pytables read_ methods (pandas-dev#29935)
  Convert core/indexes/base.py to f-strings (pandas-dev#29903)
  DEPR: dropna multiple axes, fillna int for td64, from_codes with floats, Series.nonzero (pandas-dev#29875)
  CLN: make kwargs explicit in pytables constructors (pandas-dev#29936)
  DEPR: tz_convert in the Timestamp constructor raises (pandas-dev#29929)
  STY: F-strings and repr (pandas-dev#29938)
  ...
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: What is the rationale for numeric_only of Categorical reductions?
4 participants