API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max #27929

makbigc · 2019-08-15T10:33:36Z

closes API: What is the rationale for numeric_only of Categorical reductions? #25303 and follow up ENH: Add argmax and argmin to ExtensionArray #27801 (comment)
1 test added
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

doc/source/whatsnew/v0.25.1.rst

jreback · 2019-08-15T12:16:09Z

pandas/tests/arrays/categorical/test_analytics.py

@@ -38,29 +38,38 @@ def test_min_max(self):
        cat = Categorical(
            [np.nan, "b", "c", np.nan], categories=["d", "c", "b", "a"], ordered=True
        )
-        _min = cat.min()
-        _max = cat.max()
+        _min = cat.min(skipna=False)


can you parameterize this test on skipna=True/False

makbigc · 2019-08-22T08:17:25Z

@jreback Anything else? Please tell me.

TomAugspurger

Two questions

Did the default change? numeric_only=None seems to be functionally equivalent to skipna=False.
What's the reason for the change of the implementation? It's not clear to me if this is going to have a performance impact.

TomAugspurger · 2019-08-23T18:07:10Z

pandas/tests/reductions/test_reductions.py

@@ -1028,7 +1028,7 @@ def test_min_max(self):
        )
        _min = cat.min()
        _max = cat.max()
-        assert np.isnan(_min)
+        assert _min == "c"


Why idd this change?

Please refer to the following comment.

makbigc · 2019-08-26T10:50:11Z

The default behavior will change when Categorical contains NA values. In the present code, numeric_only is None by default. No matter NA value is involved, the min operation is carried over the entire categorical, i.e., the else clause.

pandas/pandas/core/arrays/categorical.py

Lines 2236 to 2240 in ea60c19

    
           if numeric_only: 
        
               good = self._codes != -1 
        
               pointer = self._codes[good].min(**kwargs) 
        
           else: 
        
               pointer = self._codes.min(**kwargs)

The Categorical.min returns nan but Categorical.max doesn't if nan is contained. -1 stands for nan which is usually the minimum in Categorical._code

In [22]: from pandas import Categorical

In [23]: cat = Categorical([np.nan, 1, 2, np.nan], ordered=True)

In [24]: cat.min()
Out[24]: nan

In [25]: cat.max()
Out[25]: 2

In this PR, the default behavior of min and max is to drop NA values in advance, i.e., skipna=True.
cat.min() should return 1 and cat.max() return 2.

TomAugspurger · 2019-08-26T13:52:19Z

OK, that behavior looks pretty buggy. But I'm not sure if we should be just changing the default output of .min() or .max().... @jorisvandenbossche do you have thoughts here?

Given that users will need to update their code anyway to use the new argument, I think that we should try to get the correct behavior when skipna=True, while preserving the buggy bheavior with skipna=False. How much of a hassle will that be?

I also think the error message can be improved.

In [6]: c.max(numeric_only=False)
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: the 'numeric_only' keyword is deprecated, use 'skipna' instead
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[6]: nan

Reading that, it seems like I just need to replace numeric_only with skipna. But I also need to invert the value. The warning should indicate that.

makbigc · 2019-09-16T07:51:36Z

@jorisvandenbossche Would you tell us your thought about Tom's suggestion? That is keeping the buggy behaviour when skipna=False, while having a desired behaviour when skipna=True.

jreback · 2019-10-06T23:43:54Z

@makbigc can you merge master.

@jorisvandenbossche can you respond to questions here: #27929 (comment)

jorisvandenbossche · 2019-10-07T08:11:23Z

Focusing on the default behaviour for a moment, so when no arguments are specified (and not how to handle the numeric_only keyword): what default behaviour do we want?

Currently, Series.min (and also Categorical.min) returns the "wrong" thing:

In [28]: cat = pd.Categorical([1, 2, np.nan], ordered=True) 

In [29]: pd.Series(cat).min()
Out[29]: nan

To be consistent with the rest of pandas, this result should be 1 instead of nan (since we have a default skipna=True).

I think we agree that we want that correct behaviour long term?

Question is then how to get there:

breaking change: start skipping NaNs by default in 1.0
first introduce a warning that in the future this will skip NaNs by default. We could only raise this warning if we detect that there are actually NaNs present (so a case where the result would change). And in this case, the user could silence the warning by specifying explicitly skipna=True

Personally, I might have a slight preference to actually do a breaking change on this for 1.0 for the default behaviour (we still need to deprecate the numeric_only keyword when specified, that's a separate thing). But, it's certainly possible to do it with a deprecation (we will only need to change skipna=True into skipna=None to know when it was explicitly specified by the user).

makbigc · 2019-10-07T11:49:01Z

@jorisvandenbossche Thanks for your detail reply.

If I take the first approach (i.e., breaking change), what I should do is:

Remove the deprecation message that the numeric_only is replaced with skipna
In whatsnew, state explicitly the previous behaviour and the changed behaviour

Anything else? Please tell me.

jorisvandenbossche · 2019-10-07T12:11:37Z

Let's wait a bit to see what others think about the default behaviour.
But in any case, we want to keep the deprecation of numeric_only keyword (passing it should raise a warning)

jreback · 2019-10-18T21:40:59Z

I agree we should deprecate numeric_only; I think we need to default skipna=None and then warn; changing this in the future.

Though a breaking change is simpler.

jorisvandenbossche · 2019-10-20T12:35:41Z

@TomAugspurger thoughts on #27929 (comment) ?

TomAugspurger · 2019-10-20T12:38:47Z

That seems good.

…

On Oct 20, 2019, at 07:35, Joris Van den Bossche ***@***.***> wrote: @TomAugspurger thoughts on #27929 (comment) ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jorisvandenbossche · 2019-10-20T12:59:52Z

Sorry, in the linked comment I ask several questions. So what is the "that seems good" exactly answering to?

TomAugspurger · 2019-11-12T20:54:39Z

Sorry, in the linked comment I ask several questions. So what is the "that seems good" exactly answering to?

Your summary at the end. Breaking change for default + a deprecation saying that numeric_only will be removed entirely.

Personally, I might have a slight preference to actually do a breaking change on this for 1.0 for the default behaviour (we still need to deprecate the numeric_only keyword when specified, that's a separate thing). But, it's certainly possible to do it with a deprecation (we will only need to change skipna=True into skipna=None to know when it was explicitly specified by the user).

jreback · 2019-11-20T13:47:35Z

can you merge master.

jorisvandenbossche

OK, now that we have agreement on the way forward (breaking change for default behaviour + deprecate numeric_only), can you

Add a section to the whatsnew for 1.0.0 in the API breaking changes section about this?
Update to already use skipna=True as the new default? (I think there is then no need to first have skipna as None as default and raise a warning for that?)

jorisvandenbossche · 2019-11-21T13:13:39Z

pandas/core/arrays/categorical.py

@@ -2193,7 +2193,8 @@ def _reduce(self, name, axis=0, **kwargs):
            raise TypeError(msg.format(op=name))
        return func(**kwargs)

-    def min(self, numeric_only=None, **kwargs):
+    @deprecate_kwarg(old_arg_name="numeric_only", new_arg_name="skipna")
+    def min(self, skipna=None, **kwargs):


This can be skipna=True ?

jorisvandenbossche

Thanks for the update!
A few more comments, should be almost good now

jorisvandenbossche · 2019-11-23T09:08:16Z

doc/source/whatsnew/v1.0.0.rst

+By default :meth:`Categorical.min` and :meth:`Categorical.max` return the min and the max respectively
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min` and :meth:`Categorical.max`


Suggested change

When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min` and :meth:`Categorical.max`

When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min`

It was only min that returned NaN (can you update the title as well?)

jorisvandenbossche · 2019-11-23T09:10:54Z

doc/source/whatsnew/v1.0.0.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min` and :meth:`Categorical.max`
+no longer return ``np.nan`` by default.


Maybe add something like "to honor the default of skipna=True" to make it clear that this change makes it consistent with the rest of pandas

jorisvandenbossche · 2019-11-23T09:11:41Z

pandas/core/arrays/categorical.py

+                        "The default value of skipna will be changed to "
+                        "True in the future version."
+                    )
+                    warn(msg, FutureWarning, stacklevel=2)


You can remove this if skipna is None block with the warning I think?

jorisvandenbossche · 2019-11-23T09:12:34Z

pandas/core/arrays/categorical.py

+            if skipna:
+                pointer = self._codes[good].max(**kwargs)
+            else:
+                if skipna is None:


jorisvandenbossche · 2019-11-23T09:14:02Z

pandas/tests/arrays/categorical/test_analytics.py

+            [np.nan, 1, 2, np.nan], categories=[5, 4, 3, 2, 1], ordered=True
+        )
+        with tm.assert_produces_warning(
+            expected_warning=FutureWarning, check_stacklevel=False


Does it work without check_stacklevel=False ?

doc/source/whatsnew/v1.0.0.rst

pandas/core/arrays/categorical.py

jreback · 2019-11-25T23:06:24Z

pandas/tests/arrays/categorical/test_analytics.py

-        _max = cat.max(numeric_only=True)
-        assert _max == "b"
+        if skipna is False:
+            assert np.isnan(_min)


use isna/notna

np.isnan is a more strict / correct test in this case, since we are actually returning NaN (and not None, NA or NaT)

jreback · 2019-11-25T23:06:51Z

pandas/core/arrays/categorical.py

+            if skipna:
+                pointer = self._codes[good].min(**kwargs)
+            else:
+                return np.nan


this is not correct for i8 types, which should be pd.NaT. how to fix this?

We could check the categories.dtype.na_value if it exists. But since this is the current behaviour, it's not critical to fix in this PR I think.

makbigc · 2019-11-27T08:42:01Z

It is strange that the failed tests don't call categorical.min or categorical.max.

jorisvandenbossche · 2019-11-27T09:41:12Z

@makbigc those are indeed unrelated, so you can ignore those for now (they are being fixed in #29877)

jorisvandenbossche · 2019-11-27T09:51:46Z

@makbigc I merged that other PR. So if you merge latest master in this branch, the error should be solved.

makbigc · 2019-12-02T09:23:44Z

@jreback anything else? Please tell me. #27801 is pending for it

jorisvandenbossche

Just one remaining comment about removing the kwargs (and added two wording suggestions you can commit)

pandas/core/arrays/categorical.py

doc/source/whatsnew/v1.0.0.rst

jorisvandenbossche

Looks all good now, thanks @makbigc for keeping on! (as this took some time ..)

jorisvandenbossche · 2019-12-02T10:47:34Z

@jreback I opened an issue for your remaining comment about always returning NaN regardless of the dtype: #29962

…ndexing-1row-df * upstream/master: (49 commits) repr() (pandas-dev#29959) DOC : Typo fix in userguide/Styling (pandas-dev#29956) CLN: small things in pytables (pandas-dev#29958) API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max (pandas-dev#27929) DEPR: DTI/TDI/PI constructor arguments (pandas-dev#29930) CLN: fix pytables passing too many kwargs (pandas-dev#29951) Typing (pandas-dev#29947) repr() (pandas-dev#29948) repr() (pandas-dev#29950) Added space at the end of the sentence (pandas-dev#29949) ENH: add NA scalar for missing value indicator, use in StringArray. (pandas-dev#29597) CLN: BlockManager.apply (pandas-dev#29825) TST: add test for rolling max/min/mean with DatetimeIndex over different frequencies (pandas-dev#29932) CLN: explicit signature for to_hdf (pandas-dev#29939) CLN: make kwargs explicit for pytables read_ methods (pandas-dev#29935) Convert core/indexes/base.py to f-strings (pandas-dev#29903) DEPR: dropna multiple axes, fillna int for td64, from_codes with floats, Series.nonzero (pandas-dev#29875) CLN: make kwargs explicit in pytables constructors (pandas-dev#29936) DEPR: tz_convert in the Timestamp constructor raises (pandas-dev#29929) STY: F-strings and repr (pandas-dev#29938) ...

… Categorical.min and max (pandas-dev#27929)

jreback requested changes Aug 15, 2019

View reviewed changes

jreback added Categorical Categorical Data Type Deprecate Functionality to remove in pandas labels Aug 15, 2019

jsexauer mentioned this pull request Aug 15, 2019

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jreback added this to the 1.0 milestone Aug 16, 2019

TomAugspurger reviewed Aug 23, 2019

View reviewed changes

makbigc mentioned this pull request Sep 10, 2019

ENH: Add argmax and argmin to ExtensionArray #27801

Merged

makbigc added 3 commits October 7, 2019 19:22

Deprecate numeric_only parameter in Categorical.min and max

9fab462

Amend after 1st review

4bfe686

Parametrize test_min_max_skipna in reductions/test_reductions.py

e4d5c38

makbigc added 2 commits October 30, 2019 21:49

merge for update

765e506

Set skipna=None by default and add future warning

300a862

makbigc force-pushed the depr-25303 branch from c3d5f22 to 300a862 Compare November 1, 2019 13:22

Fix black issue and the test error

095e6da

merge for update

f094635

jorisvandenbossche reviewed Nov 21, 2019

View reviewed changes

makbigc added 2 commits November 22, 2019 00:02

Set skipna=True and add section in API breaking

7d73163

merge for update

f9d9bc5

jorisvandenbossche reviewed Nov 23, 2019

View reviewed changes

Modify after review

6a184c8

jreback requested changes Nov 25, 2019

View reviewed changes

merge for solving conflict

ac89bcd

makbigc force-pushed the depr-25303 branch from ce7f0da to ac89bcd Compare November 27, 2019 02:07

makbigc added 2 commits November 27, 2019 22:58

merge for update

9691c44

merge for update

d281650

jorisvandenbossche reviewed Dec 2, 2019

View reviewed changes

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.0.0.rst Show resolved Hide resolved

makbigc added 2 commits December 2, 2019 17:34

Remove kwarg in categorical.min and max

23ffd16

Change wording in v1.0.0.rst

260201c

jorisvandenbossche approved these changes Dec 2, 2019

View reviewed changes

jorisvandenbossche changed the title ~~DEPR: Deprecate numeric_only parameter in Categorical.min and max~~ API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max Dec 2, 2019

jorisvandenbossche added the API Design label Dec 2, 2019

jorisvandenbossche merged commit 37526c1 into pandas-dev:master Dec 2, 2019

jorisvandenbossche mentioned this pull request Dec 2, 2019

API: return "correct" missing value scalar from Categorical? #29962

Open

jorisvandenbossche mentioned this pull request Dec 12, 2019

DEPR: log of deprecations in 1.x (to be removed in 2.0) #30228

Closed

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API/DEPR: Change default skipna behaviour + deprecate numeric_only in…

72930e8

… Categorical.min and max (pandas-dev#27929)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

API/DEPR: Change default skipna behaviour + deprecate numeric_only in…

a3095e3

… Categorical.min and max (pandas-dev#27929)

simonjayhawkins mentioned this pull request Mar 29, 2020

REGR: unhelpful error message with np.min on unordered Categorical #33115

Closed

This was referenced May 6, 2020

BUG: maximum of pd.Series([np.nan],dtype=ordered_category) raise #33450

Closed

Backport PR #33513 on branch 1.0.x (BUG: Fix Categorical.min / max bug) #34022

Merged

mroeschke mentioned this pull request Sep 27, 2022

DEPR: Enforce deprecation of Categorical.min/max(numeric_only) #48821

Merged

1 task

	When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min` and :meth:`Categorical.max`
	When :class:`Categorical` contains ``np.nan``, :meth:`Categorical.min`

API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max #27929

API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max #27929

Conversation

makbigc commented Aug 15, 2019

Choose a reason for hiding this comment

makbigc commented Aug 22, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makbigc Aug 26, 2019 • edited Loading

Choose a reason for hiding this comment

makbigc commented Aug 26, 2019

TomAugspurger commented Aug 26, 2019

makbigc commented Sep 16, 2019

jreback commented Oct 6, 2019

jorisvandenbossche commented Oct 7, 2019

makbigc commented Oct 7, 2019

jorisvandenbossche commented Oct 7, 2019

jreback commented Oct 18, 2019

jorisvandenbossche commented Oct 20, 2019

TomAugspurger commented Oct 20, 2019 via email

jorisvandenbossche commented Oct 20, 2019

TomAugspurger commented Nov 12, 2019

jreback commented Nov 20, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makbigc commented Nov 27, 2019

jorisvandenbossche commented Nov 27, 2019

jorisvandenbossche commented Nov 27, 2019

makbigc commented Dec 2, 2019

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 2, 2019

makbigc Aug 26, 2019 •

edited

Loading