-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.groupby drops timedelta column in v1.3.0 #42395
Comments
note using
gives the desired result |
take |
From git bisect, the first commit since v1.2.5 exhibiting a repro is 6b94e24.
I'm reviewing that code change and calls below it. My early thought from looking at _reduce and elsewhere is that the numeric_only implementations are a touch more scattered/ad-hoc than they could be. Will keep exploring and update with progress. |
I'm reviewing the cython -> python aggregation fallback behavior and specifically how 6b94e24 affects our test case. Once I understand the specific error sequence, and details like how numeric_only is shepherded, I should have a good idea of where to pursue a clean fix. If it's just incidental that the cython code can't process these timedelta aggregations, then that strikes me as the place to make a change, rather than restoring the catch-all fallback behavior. |
Update: The behavior is consistent with the documentation if my understanding is correct, since the pandas.core.groupby.GroupBy.sum numeric_only kwarg defaults to True. https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.sum.html# When sum is called with numeric_only False or None, our repro/test-case behaves now as it did before. Datetime and timedelta are expressly classified as non-numeric in the docs and multiple places in the code. See core/internals/blocks.py:class DatetimeLikeBlock and core/dtypes/common.py:def is_numeric_dtype for examples of this non-numeric determination. Perhaps it's worth contemplating a longer-term change to the API to broaden the numeric_only classification to include datetime and timedelta since they can be operated on arithmetically, but that's beyond the scope of this issue imo. |
@jreback I defer to you as to whether a code fix should still be pursued here. |
Good work @jmcomie, to add some things to the discussion:
My 2 cents is that ideally any method X that works with |
@venaturum what strikes me as a clean and useful way to implement that consistency is to follow through on the warning added in aa3bfc4 and eliminate the numeric_only=None (two-stage) behavior, and default it to False everywhere:
I'm just not seeing a scenario where dropping columns by default is better than throwing an error to the user to let them take action on it. Perhaps a new top-level function to produce the numeric subset of a DataFrame could be the right balance and better support the cases where an external reduction/aggregation is used. @jbrockmendel: is this what's been planned? |
numeric_only=None is deprecated, so will be removed in 2.0. Not sure if that's what you're asking here |
Removing numeric_only=None makes sense. Further, it seems that in order to categorically avoid the kind of column dropping mentioned in commit aa3bfc4, numeric_only will have to default to False everywhere. Do I have it right? |
#41706 (to link issues in Github)
I think this discussion may not be relevant here. looking at the output from 1.2.5 and 1.3.0 above, there is no The solution here is to revert to 1.2.5 behavior for 1.3.x imo. |
@simonjayhawkins it's because of the different default value of numeric_only in DataFrame/NDFrame.sum and GroupBy.sum that the behavior is different. And that default value is unchanged between 1.2.5 and 1.3.0 in these functions, best I can tell. |
tbc, imo the so as far as I am concerned the correct behavior is not relevant, and the behavior in 1.2.x whether correct or not should be maintained until a future release. |
I guess my disposition is to view the v1.3.0 behavior as a bug fix of the following test case:
And to change it is to regress on that. But the dimension of user impact is important, and I can see going either way here. |
to keep the fix in #41706, if accepted that the change in behavior is a bug fix and does not need a deprecation, could we maybe make the default for |
I don't know if it has been explicitly discussed, but that makes sense to me.
I'm not clear on the problem here. #41706 was specifically about empty cases, which the examples in this thread are not. |
+1 on defaulting
not result in summing |
I think it is worth discussing how we want to resolve this. At this comment #43154 (comment) , @jbrockmendel wrote:
I think there is an open question of how we want to resolve this. Here's a summary of how I see things based on discussions here, in #43108 (now closed) and in the PR #43154.
As I see it, I think there are a few options for resolving this for 1.3.x:
Independent of the above, we need to also decide what we want the behavior (and docs) to be for 2.0, so that we may need to create a Personally, I would vote for (1), chalk this up to a documentation issue in 1.2, and not have any change in behavior for 2.0. |
That's accurate for the OP example, but not in the general case. The code was pretty tangled. It only behaved as if numeric_only was False if there were no numeric columns. So if you add another numeric column to the OP example |
Thanks for pointing that out. So this suggests the following option for 1.3.x: |
yeah I was suggesting that |
Not in the case where the user explicitly passed it though right?
That's appreciably more complicated than the current meaning of numeric_only, also doesn't match what it means in DataFrame reductions. |
This should happen only when numeric_only is set to default . if all the columns of the dataframe is non-numeric , then the default value should be False , else in all other cases it should be true |
So here is another proposal:
If Therefore, the default |
@Dr-Irv Will that solve the above issue ? |
I was suggesting why don't we check only the columns that have to be aggregated and not all the columns of the dataframe . |
I think we need to get a group of us to agree on what behavior we want implemented to address the issue reported here and in #43108 . That agreement should come from @simonjayhawkins @jbrockmendel and possibly others. |
@Dr-Irv your proposal is more eloquent than my comments in #42395 (comment) and #42395 (comment) I'm fine with restoring the old behavior, keeping the bugfix for when on this last point, numeric_only=False as a default makes sense but must also be consistent with the non-groupby cases. |
By "your proposal", do you mean my (5) here: #42395 (comment) ? Or one of the other 4? (first 3 in #42395 (comment)), fourth one here: #42395 (comment) |
The DataFrame reductions all default numeric_only=None, which is deprecated, but i dont think we've decided what the future default will be. im fine with False. |
I'm a bit late to the recent parts of the discussion, but I just spent some time piecing them together from various issues/PRs, and wanted to post a summary. Apologies in advance if any of this is outdated/mistaken. The existing behavior is nicely summarized here (where the second link corrects the understanding of a case in the first link):
Now the way forward seems to have a consensus:
There is a reference implementation here: #43154 (comment) Where I think there is some disagreement is in whether there should be a warning in the patch for 1.3.x from @simonjayhawkins:
I'm +1 on removing the warning for 1.3.x and only having a FutureWarning in 1.4 for the pending behavior change. @jbrockmendel - any thoughts here? This implementation looks correct to me - in particular the use of
self.obj has columns a, b, and c whereas self._obj_with_exclusions has a single column c. The result of the above code on 1.2.x is
regardless of the passed value of |
I could back down on that point to a certain extent. from the policy, https://pandas.pydata.org/pandas-docs/dev/development/policies.html#version-policy
That statement is very explicit, but where we are fixing the regressed behavior, we could maybe add a warning for the changed cases only (that we are reverting to 1.2.x behavior)? I think we maybe able to justify that as part of the regression fix if the deprecation/future warning is only given for those cases for now. We would still need to then add that warning in 1.4 for the other cases so it could also be better to wait and do them all at the same time. |
I'm fine with that. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
results in:
The same behaviour occurs whether it's sum, mean, median etc
Note the following gives close to the expected result:
Also, this problem may be unique to Timedeltas as the following similar example, has no problem:
gives:
Problem description
In v1.3.0 (and master, commit dad3e7) when performing a groupby operation on a dataframe a timedelta column goes missing.
This behaviour does not occur in pandas 1.2.5
Expected Output
The expected output is given by pandas 1.2.5:
It is a dataframe, with one column "duration", indexed by the groupby keys
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : f00ed8f
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: