-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/REG: RollingGroupby MultiIndex levels dropped #38737
Conversation
looks like pls check that perf is not regressed |
For the relevant ASV
|
("val1", "val1", "val1", "val1"), | ||
("val2", "val2", "val2", "val2"), | ||
], | ||
names=["idx1", "idx2", "idx1", "idx2"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure this is the behaviour we want?
It might be the most consistent, but it's also kind of useless to repeat those index levels .. So we should at least have a way to avoid getting that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say the consistency is more maintainable on our end compared to additionally including logic to de-duplicate index levels given some condition.
I would prefer the user explicitly call droplevel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm
In [134]: df.groupby(level=0).transform('max')
Out[134]:
Max Speed
Animal Type
Falcon Captive 390.0
Wild 390.0
Parrot Captive 30.0
Wild 30.0
so i think we are doing some magic in groupby for this. These are conceptually similar operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before indexers were implemented for groupby().rolling()
, this was the result:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.0.5'
In [3]: from pandas import *
In [4]: arrays = [
...: ["Falcon", "Falcon", "Parrot", "Parrot"],
...: ["Captive", "Wild", "Captive", "Wild"],
...: ]
...: index = MultiIndex.from_arrays(arrays, names=("Animal", "Type"))
...: df = DataFrame({"Max Speed": [390.0, 350.0, 30.0, 20.0]}, index=index)
...: result = df.groupby(level=0)["Max Speed"].rolling(2).sum()
In [5]: result
Out[5]:
Animal Animal Type
Falcon Falcon Captive NaN
Wild 740.0
Parrot Parrot Captive NaN
Wild 50.0
Name: Max Speed, dtype: float64
which I think we should be trying to match. Though I'm not sure if we have solid conventions of the resulting index when using groupby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep i see that. ok i think that we should revert for 1.2.x and then decide for 1.3 is prob ok. i am leaning towards have the same as groupby here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay once this is merged in I can create another issue to discuss what the index behavior should be for groupby rolling with duplicate index levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When comparing to 1.0.5 behaviour, we also had:
In [5]: s.groupby(["idx1", "idx2"], group_keys=False).rolling(1).mean()
Out[5]:
idx1 idx2
val1 val1 1.0
val1 2.0
val2 val2 3.0
dtype: float64
In [9]: pd.__version__
Out[9]: '1.0.5'
So this PR then deviates from that for this case.
(I know the influence of group_keys=False
has been considered a bug, but we could also reconsider that, since the above seems to actually give the desired result?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The group_keys
result different I think is due to the implementation change in groupby().rolling()
Before groupby().rolling()
under the hood was groupby().apply(lambda x: x.rolling()...)
and therefore group_keys
impacted the result (since the argument is only applicable for groupby().apply()
).
After groupby().rolling()
moved away from using apply
, group_keys
didn't impact the result.
So IMO, group_keys
shouldn't have ever really influenced the result since groupby().apply()
was never called directly from the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add testing for group_keys in any event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's parameterized here already: https://github.com/pandas-dev/pandas/pull/38737/files#diff-e338c43cbd06b849f3d6a1b97cd787a48770c616b627f33eb20f67a6fc56b116R559
for future issue (these are various return values now, using 1.0.5)
|
thanks @mroeschke pls open an issue for 1.3 to discuss what we should do with this. |
@meeseeksdev backport 1.2.x |
You could also interpret that as "because of how it was implemented, I don't think we should hurry in adding this fix to 1.2.x. Let's first properly discuss what behaviour we want (because we could end up going back again to the previous behaviour ..) |
Exploring the return value across the recent pandas versions for one of the cases: https://nbviewer.jupyter.org/gist/jorisvandenbossche/500ff1d082c288c378dc1972e24e6b4e |
…#38784) Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>
…dex levels dropped (pandas-dev#38737)"
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
…#38737)" (pandas-dev#39191) This reverts commit a37f1a4.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This was accidentally caused by #37661 where I thought #37641 was a bug when it was actually the correct behavior.
Therefore, I had to change the behavior of some prior tests.