Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.groupby drops timedelta column in v1.3.0 #42395

Closed
3 tasks done
venaturum opened this issue Jul 6, 2021 · 33 comments · Fixed by #43154
Closed
3 tasks done

BUG: DataFrame.groupby drops timedelta column in v1.3.0 #42395

venaturum opened this issue Jul 6, 2021 · 33 comments · Fixed by #43154
Assignees
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@venaturum
Copy link
Contributor

venaturum commented Jul 6, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame(
    {
        "duration":[pd.Timedelta(i, unit="days") for i in range(1,6)],
        "value":[1,0,1,2,1]
    }
)

df.groupby("value").sum()

results in:

Empty DataFrame
Columns: []
Index: [0, 1, 2]

The same behaviour occurs whether it's sum, mean, median etc

Note the following gives close to the expected result:

df.groupby("value")["duration"].sum()
value
0   2 days
1   9 days
2   4 days
Name: duration, dtype: timedelta64[ns]

Also, this problem may be unique to Timedeltas as the following similar example, has no problem:

df = pd.DataFrame(
    {
        "duration":[i for i in range(1,6)],
        "value":[1,0,1,2,1]
    }
)

df.groupby("value").sum()

gives:

       duration
value
0             2
1             9
2             4

Problem description

In v1.3.0 (and master, commit dad3e7) when performing a groupby operation on a dataframe a timedelta column goes missing.
This behaviour does not occur in pandas 1.2.5

Expected Output

The expected output is given by pandas 1.2.5:

      duration
value
0       2 days
1       9 days
2       4 days

It is a dataframe, with one column "duration", indexed by the groupby keys

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@venaturum venaturum added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2021
@jreback jreback added Regression Functionality that used to work in a prior pandas version Groupby and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2021
@jreback jreback added this to the 1.3.1 milestone Jul 6, 2021
@venaturum
Copy link
Contributor Author

venaturum commented Jul 6, 2021

note using

df.groupby("value").sum(numeric_only=False)

gives the desired result

@jmcomie
Copy link
Contributor

jmcomie commented Jul 7, 2021

take

@jmcomie
Copy link
Contributor

jmcomie commented Jul 8, 2021

From git bisect, the first commit since v1.2.5 exhibiting a repro is 6b94e24.

commit 6b94e2431536eac4d943e55064d26deaac29c92b
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Jun 2 08:14:08 2021 -0700

    BUG: DataFrameGroupBy with numeric_only and empty non-numeric data (#41706)

I'm reviewing that code change and calls below it. My early thought from looking at _reduce and elsewhere is that the numeric_only implementations are a touch more scattered/ad-hoc than they could be. Will keep exploring and update with progress.

@jmcomie
Copy link
Contributor

jmcomie commented Jul 9, 2021

I'm reviewing the cython -> python aggregation fallback behavior and specifically how 6b94e24 affects our test case. Once I understand the specific error sequence, and details like how numeric_only is shepherded, I should have a good idea of where to pursue a clean fix. If it's just incidental that the cython code can't process these timedelta aggregations, then that strikes me as the place to make a change, rather than restoring the catch-all fallback behavior.

@jmcomie
Copy link
Contributor

jmcomie commented Jul 9, 2021

Update: The behavior is consistent with the documentation if my understanding is correct, since the pandas.core.groupby.GroupBy.sum numeric_only kwarg defaults to True. https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html#
The behavior prior to 6b94e24 was not consistent with the spec.

When sum is called with numeric_only False or None, our repro/test-case behaves now as it did before. Datetime and timedelta are expressly classified as non-numeric in the docs and multiple places in the code. See core/internals/blocks.py:class DatetimeLikeBlock and core/dtypes/common.py:def is_numeric_dtype for examples of this non-numeric determination.

Perhaps it's worth contemplating a longer-term change to the API to broaden the numeric_only classification to include datetime and timedelta since they can be operated on arithmetically, but that's beyond the scope of this issue imo.

@jmcomie
Copy link
Contributor

jmcomie commented Jul 9, 2021

@jreback I defer to you as to whether a code fix should still be pursued here.

@venaturum
Copy link
Contributor Author

Good work @jmcomie, to add some things to the discussion:

  1. df.sum() sums the timedelta column

  2. df.groupby("value").apply(pd.DataFrame.sum) sums the timedelta column (but also the groupby column)

  3. The results for

My 2 cents is that ideally any method X that works with DataFrame.X and df.groupby( ).apply(DataFrame.X) should be consistent with df.groupby( ).X

@jmcomie
Copy link
Contributor

jmcomie commented Jul 13, 2021

@venaturum what strikes me as a clean and useful way to implement that consistency is to follow through on the warning added in aa3bfc4 and eliminate the numeric_only=None (two-stage) behavior, and default it to False everywhere:

FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.

I'm just not seeing a scenario where dropping columns by default is better than throwing an error to the user to let them take action on it. Perhaps a new top-level function to produce the numeric subset of a DataFrame could be the right balance and better support the cases where an external reduction/aggregation is used. @jbrockmendel: is this what's been planned?

@jbrockmendel
Copy link
Member

is this what's been planned?

numeric_only=None is deprecated, so will be removed in 2.0. Not sure if that's what you're asking here

@jmcomie
Copy link
Contributor

jmcomie commented Jul 13, 2021

is this what's been planned?

numeric_only=None is deprecated, so will be removed in 2.0. Not sure if that's what you're asking here

Removing numeric_only=None makes sense. Further, it seems that in order to categorically avoid the kind of column dropping mentioned in commit aa3bfc4, numeric_only will have to default to False everywhere. Do I have it right?

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 14, 2021
@simonjayhawkins
Copy link
Member

From git bisect, the first commit since v1.2.5 exhibiting a repro is 6b94e24.

#41706 (to link issues in Github)

>>> pd.__version__
'1.2.5'
>>> 
>>> df.sum()
duration    15 days 00:00:00
value                      5
dtype: object
>>> 
>>> df.groupby("value").sum()
      duration
value         
0       2 days
1       9 days
2       4 days
>>> 
>>> pd.__version__
'1.3.0'
>>> 
>>> df.sum()
duration    15 days 00:00:00
value                      5
dtype: object
>>> 
>>> df.groupby("value").sum()
Empty DataFrame
Columns: []
Index: [0, 1, 2]

Removing numeric_only=None makes sense. Further, it seems that in order to categorically avoid the kind of column dropping mentioned in commit aa3bfc4, numeric_only will have to default to False everywhere. Do I have it right?

I think this discussion may not be relevant here.

looking at the output from 1.2.5 and 1.3.0 above, there is no FutureWarning for df.sum() in 1.3.0 and the behavior of df.sum() and df.groupby("value").sum() is now inconsistent.

The solution here is to revert to 1.2.5 behavior for 1.3.x imo.

cc @jbrockmendel

@jmcomie
Copy link
Contributor

jmcomie commented Jul 14, 2021

@simonjayhawkins it's because of the different default value of numeric_only in DataFrame/NDFrame.sum and GroupBy.sum that the behavior is different. And that default value is unchanged between 1.2.5 and 1.3.0 in these functions, best I can tell.

@simonjayhawkins
Copy link
Member

tbc, imo the FutureWarning and discussion of the future behavior is not relevant because there is a change in actual behavior in a minor release and behavior should only be changed in a major release.

so as far as I am concerned the correct behavior is not relevant, and the behavior in 1.2.x whether correct or not should be maintained until a future release.

@jmcomie
Copy link
Contributor

jmcomie commented Jul 14, 2021

I guess my disposition is to view the v1.3.0 behavior as a bug fix of the following test case:

import pandas as pd
df = pd.DataFrame(
    {
        "duration":[pd.Timedelta(i, unit="days") for i in range(1,6)],
        "value":[1,0,1,2,1]
    }
)
df.groupby("value").sum(numeric_only=True)

And to change it is to regress on that. But the dimension of user impact is important, and I can see going either way here.

@simonjayhawkins
Copy link
Member

to keep the fix in #41706, if accepted that the change in behavior is a bug fix and does not need a deprecation, could we maybe make the default for numeric_only as lib.no_default? would that make it easier to restore the behavior when numeric_only is not specified and keep the fix for when numeric_only=True is specified?

@jbrockmendel
Copy link
Member

Further, it seems that in order to categorically avoid the kind of column dropping mentioned in commit aa3bfc4, numeric_only will have to default to False everywhere. Do I have it right?

I don't know if it has been explicitly discussed, but that makes sense to me.

to keep the fix in #41706, if accepted that the change in behavior is a bug fix and does not need a deprecation

I'm not clear on the problem here. #41706 was specifically about empty cases, which the examples in this thread are not.

@rhshadrach
Copy link
Member

+1 on defaulting numeric_only to False. To have code

df[["a", "b"]].groupby("a").sum()

not result in summing b is surprising.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 24, 2021

I think it is worth discussing how we want to resolve this. At this comment #43154 (comment) , @jbrockmendel wrote:

Changing the default is something we can do, but that is an API change, not a bugfix, and would require a deprecation cycle.

I think there is an open question of how we want to resolve this. Here's a summary of how I see things based on discussions here, in #43108 (now closed) and in the PR #43154.

As I see it, I think there are a few options for resolving this for 1.3.x:

  1. Change the default to numeric_only=False, revert back to the 1.2.x behavior, and treat this as a documentation error in 1.2.
  2. Follow the 1.3.x new behavior, that corresponds to the docs in 1.2 (numeric_only=True), and fix the 1.3 documentation to not say numeric_only=<no_default> .
  3. Use a different behavior that is based on the data types as suggested in BUG: Regression from 1.2.5 to 1.3.x: groupby using sum on DataFrame containing lists fails #43108

Independent of the above, we need to also decide what we want the behavior (and docs) to be for 2.0, so that we may need to create a FutureWarning indicating a deprecation/change of behavior.

Personally, I would vote for (1), chalk this up to a documentation issue in 1.2, and not have any change in behavior for 2.0.

@jbrockmendel
Copy link
Member

but the behavior in 1.2 and prior corresponded to DataFrame.groupby(numeric_only=False)

That's accurate for the OP example, but not in the general case. The code was pretty tangled. It only behaved as if numeric_only was False if there were no numeric columns. So if you add another numeric column to the OP example df["value2"] = df["value"] then df.groupby("value").sum() did drop the non-numeric column, and the result in 1.2.4 matches what we get on master.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 24, 2021

That's accurate for the OP example, but not in the general case. The code was pretty tangled. It only behaved as if numeric_only was False if there were no numeric columns.

Thanks for pointing that out. So this suggests the following option for 1.3.x:
4. If numeric_only=True but all columns are non-numeric, ignore the numeric_only setting. Document this as the behavior, and then code prior to 1.2 that had only non-numeric columns continues to work, and the behavior is more intuitive.

@kurchi1205
Copy link
Contributor

yeah I was suggesting that

@jbrockmendel
Copy link
Member

If numeric_only=True but all columns are non-numeric, ignore the numeric_only setting

Not in the case where the user explicitly passed it though right?

and the behavior is more intuitive

That's appreciably more complicated than the current meaning of numeric_only, also doesn't match what it means in DataFrame reductions.

@kurchi1205
Copy link
Contributor

This should happen only when numeric_only is set to default . if all the columns of the dataframe is non-numeric , then the default value should be False , else in all other cases it should be true

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 24, 2021

If numeric_only=True but all columns are non-numeric, ignore the numeric_only setting

Not in the case where the user explicitly passed it though right?

and the behavior is more intuitive

That's appreciably more complicated than the current meaning of numeric_only, also doesn't match what it means in DataFrame reductions.

So here is another proposal:
5. numeric_only=None is the default. It has the following consequences:

  • All numeric columns - all get reduced
  • Mix of numeric and non-numeric columns - only numeric get reduced
  • All non-numeric columns - all get reduced

If numeric_only=True, then only reduce numeric columns. If numeric_only=False, reduce all columns.

Therefore, the default numeric_only=None preserves the 1.2.x behavior.

@kurchi1205
Copy link
Contributor

@Dr-Irv Will that solve the above issue ?

@kurchi1205
Copy link
Contributor

I was suggesting why don't we check only the columns that have to be aggregated and not all the columns of the dataframe .
Suppose in the above issue , if we leave out the group key ie. value and check only the columns that have to aggregated ie. duration , then accordingly set the numeric_only value that should solve the issue

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 24, 2021

@Dr-Irv Will that solve the above issue ?

I think we need to get a group of us to agree on what behavior we want implemented to address the issue reported here and in #43108 . That agreement should come from @simonjayhawkins @jbrockmendel and possibly others.

@simonjayhawkins
Copy link
Member

@Dr-Irv your proposal is more eloquent than my comments in #42395 (comment) and #42395 (comment)

I'm fine with restoring the old behavior, keeping the bugfix for when numeric_only=True is specified by the user, and adding a warning that the behavior will change in the future, either that numeric_only is True so that the columns will be dropped or that the default is changed as suggested in #42395 (comment)

on this last point, numeric_only=False as a default makes sense but must also be consistent with the non-groupby cases.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 24, 2021

@Dr-Irv your proposal is more eloquent than my comments in #42395 (comment) and #42395 (comment)

I'm fine with restoring the old behavior, keeping the bugfix for when numeric_only=True is specified by the user, and adding a warning that the behavior will change in the future, either that numeric_only is True so that the columns will be dropped or that the default is changed as suggested in #42395 (comment)

on this last point, numeric_only=False as a default makes sense but must also be consistent with the non-groupby cases.

By "your proposal", do you mean my (5) here: #42395 (comment) ? Or one of the other 4? (first 3 in #42395 (comment)), fourth one here: #42395 (comment)

@jbrockmendel
Copy link
Member

on this last point, numeric_only=False as a default makes sense but must also be consistent with the non-groupby cases.

The DataFrame reductions all default numeric_only=None, which is deprecated, but i dont think we've decided what the future default will be. im fine with False.

@rhshadrach
Copy link
Member

I'm a bit late to the recent parts of the discussion, but I just spent some time piecing them together from various issues/PRs, and wanted to post a summary. Apologies in advance if any of this is outdated/mistaken.

The existing behavior is nicely summarized here (where the second link corrects the understanding of a case in the first link):

Now the way forward seems to have a consensus:

There is a reference implementation here: #43154 (comment)

Where I think there is some disagreement is in whether there should be a warning in the patch for 1.3.x from @simonjayhawkins:

I don't want to see any changes for now other than restoring 1.2 behavior for the case where the user does not explicitly pass numeric_only. i.e. no existing tests changes, no new warnings, no changes to keyword defaults, so that we can backport and close the regression issue.

I'm +1 on removing the warning for 1.3.x and only having a FutureWarning in 1.4 for the pending behavior change. @jbrockmendel - any thoughts here?

This implementation looks correct to me - in particular the use of self._obj_with_exclusions. In the example:

df = pd.DataFrame({'a': ['x', 'x'], 'b': [2, 3], 'c': ['z', 'w']})
print(df.groupby('a')[['c']].sum())

self.obj has columns a, b, and c whereas self._obj_with_exclusions has a single column c. The result of the above code on 1.2.x is

    c
a    
x  zw

regardless of the passed value of numeric_only (including not passing in any value at all). To retain the default behavior in this example, we want to use self._obj_with_exclusions.

@simonjayhawkins
Copy link
Member

Where I think there is some disagreement is in whether there should be a warning in the patch for 1.3.x from @simonjayhawkins:

I don't want to see any changes for now other than restoring 1.2 behavior for the case where the user does not explicitly pass numeric_only. i.e. no existing tests changes, no new warnings, no changes to keyword defaults, so that we can backport and close the regression issue.

I'm +1 on removing the warning for 1.3.x and only having a FutureWarning in 1.4 for the pending behavior change. @jbrockmendel - any thoughts here?

I could back down on that point to a certain extent.

from the policy, https://pandas.pydata.org/pandas-docs/dev/development/policies.html#version-policy

We will not introduce new deprecations in patch releases.

That statement is very explicit, but where we are fixing the regressed behavior, we could maybe add a warning for the changed cases only (that we are reverting to 1.2.x behavior)?

I think we maybe able to justify that as part of the regression fix if the deprecation/future warning is only given for those cases for now. We would still need to then add that warning in 1.4 for the other cases so it could also be better to wait and do them all at the same time.

@jbrockmendel
Copy link
Member

I'm +1 on removing the warning for 1.3.x and only having a FutureWarning in 1.4 for the pending behavior change. @jbrockmendel - any thoughts here?

I'm fine with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants