Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: DataFrameGroupBy numeric_only defaulting to True #46072

Closed
Tracked by #46560 ...
rhshadrach opened this issue Feb 19, 2022 · 6 comments · Fixed by #47025
Closed
Tracked by #46560 ...

DEPR: DataFrameGroupBy numeric_only defaulting to True #46072

rhshadrach opened this issue Feb 19, 2022 · 6 comments · Fixed by #47025
Labels
Deprecate Functionality to remove in pandas Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Feb 19, 2022

Context

A summary of this behavior and the consensus thus far that DataFrameGroupBy will have numeric_only default to False in 2.0 can be found here: #42395 (comment).

In #41475, the silent dropping of nuisance columns was deprecated.

In #43154, the behavior was changed so that when a DataFrame has numeric_only unspecified and subsetting to numeric only columns would leave the DataFrame empty, internally pandas treats numeric_only as False.

Even though there is consensus that numeric_only should default to False, because of the above changes I wanted to make sure there is a consensus on how to go about doing so before proceeding.

For the discussion below, it is useful to have three types of columns in mind:

  • Numeric: Columns that remain in the input when numeric_only=True.
  • Nonnumeric, can agg: Columns that do not remain in the input when numeric_only=True but can still be successfully aggregated; e.g. strings with sum.
  • Nonnumeric, can't agg: Columns that do not remain in the input when numeric_only=True and cannot be successfully aggregated; e.g. object.

Code

To investigate this on 1.4.x, I have been using the following code. In this code, I am using .sum(). However the results for any reduction or transform, whether it be string or callable, should have the same behavior (though that is not the case today). This includes apply and using axis=1 (for which you may want to tilt your head 90 degrees to the left).

Code
numeric = [1, 1]
nonnumeric_noagg = [object, object]
nonnumeric_agg = ["2", "2"]
for has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg in it.product([True, False], repeat=3):
    for numeric_only in [True, False, lib.no_default]:
        print(has_numeric, has_nonnumeric_agg, has_nonnumeric_noagg, numeric_only)
        df = pd.DataFrame({"A": [1, 1]})
        if has_numeric:
            df["B"] = numeric
        if has_nonnumeric_agg:
            df["C"] = nonnumeric_agg
        if has_nonnumeric_noagg:
            df["D"] = nonnumeric_noagg
        warning_msg = ""
        try:
            with warnings.catch_warnings(record=True) as w:
                result = df.groupby("A").sum(numeric_only=numeric_only)
                if len(w) > 0:
                    assert len(w) == 1
                    assert issubclass(w[-1].category, FutureWarning)
                    warning_msg = str(w[-1].message)
        except TypeError:
            print("  TypeError")
        else:
            print("  Columns:", result.columns.tolist(), "Warning:", warning_msg[:20])

Current and Future behavior

numeric_only=True

Current behavior appears entirely correct and will go unchanged in 1.5/2.0. In particular, when there are no numeric columns in the input, the output is empty as well.

numeric_only=False

Current behavior appears entirely correct, in that if there are to be any behavior changes in 2.0, we already emit the appropriate FutureWarning today. The only case where there will be a behavior change from 1.4.x to 2.0 is if the frame contains a nonnumeric column that can't be aggregated. 1.4.x drops the column whereas 2.0 will raise a TypeError.

numeric_only unspecified (lib.no_default)

I'll refer to the columns as in the code above:

  • B: Numeric column
  • C: Nonnumeric column that can be aggregated
  • D: Nonnumeric column that cannot be aggregated
  1. Columns ['B', 'C', 'D']

    • In 1.4.x we get column B and no warning is raised. In 2.0 this will raise a TypeError.
    • We should emit a warning in 1.5 about numeric_only defaulting to False in 2.0.
  2. Columns ['B', 'C']

    • In 1.4.x we get column B and no warning is raised. In 2.0 will we get both B and C in the result.
    • We should emit a warning in 1.5 about numeric_only defaulting to False in 2.0.
  3. Columns ['B', 'D']

    • In 1.4.x we get column B and no warning is raised. In 2.0 we will raise a TypeError.
    • We should emit a warning in 1.5 about numeric_only defaulting to False in 2.0.
  4. Columns ['C', 'D']

    • In 2.0 we will raise a TypeError, and 1.4.x currently warns that this will happen.
    • No change.
  5. Columns ['C']

    • In 1.4.x we get column C and no warning is raised. This is the correct result on 2.0, but in my opinion is not the correct result on 1.4.x where we should be treating numeric_only as True.
    • No change. It is not worth it to change behavior and raise a FutureWarning that the behavior will go back to what it is now.
  6. Columns ['D']

    • On 1.4.x this returns an empty result and warns that it will raise a TypeError. In 2.0, it will raise a TypeError.
    • No change.

cc @jreback @jbrockmendel @jorisvandenbossche @simonjayhawkins @Dr-Irv

@rhshadrach rhshadrach added Groupby Deprecate Functionality to remove in pandas Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Feb 19, 2022
@rhshadrach rhshadrach added this to the 1.5 milestone Feb 19, 2022
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 19, 2022

Nice analysis. The case that bit my team was (5) above - we had a DF with a single column that contained lists, and used sum(). Between 1.2 and 1.3, the behavior was changed, but then reverted back to the 1.2.5 behavior in some 1.3.x version because of the issue I raised and others may have raised as well.

One thing to consider - the parameter numeric_only is also used in non-groupby operations. (e.g. DataFrame.sum() . So I hope someone has looked at whether the meaning and defaults in both of those cases will be consistent for 2.0

@rhshadrach
Copy link
Member Author

So I hope someone has looked at whether the meaning and defaults in both of those cases will be consistent for 2.0

Good call - I do plan to look into this as well, but would be delighted if someone else wants to :)

@rhshadrach rhshadrach changed the title DEPR: DataFrameGroupBy numeric_only=True DEPR: DataFrameGroupBy numeric_only defaulting to True Feb 19, 2022
@jbrockmendel
Copy link
Member

@rhshadrach finally caught up on the thread, not clear what input you're looking for

@rhshadrach
Copy link
Member Author

@jbrockmendel - In the bottom half (points 1-6) I've outlined what the behavior is in 1.4 and will be in 2.0 (first bullet point) and what the behavior should be in 1.5 (second bullet point). Just looking for a thumbs up or down (and why).

@jbrockmendel
Copy link
Member

In 2.0 we're going to have numeric_only=False be the default, and when the user specifies numeric_only=True we're actually going to fully respect that right? And be consistent across the board? If so then thumbs up.

carlescn added a commit to carlescn/MSc_bioinformatics_thesis that referenced this issue Jun 12, 2023
From Pandas changelog:
"Changed default of numeric_only in various DataFrameGroupBy methods; all methods now default to numeric_only=False (GH46072)."
See pandas-dev/pandas#46072
@travisturenne
Copy link

Why was this changed? numeric_only=False should be the edge case not the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants