Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

Closed
chrisaycock opened this issue Aug 16, 2016 · 12 comments · Fixed by #20984
Closed

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

chrisaycock opened this issue Aug 16, 2016 · 12 comments · Fixed by #20984
Labels
API Design Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@chrisaycock
Copy link
Contributor

I found another oddity while digging through #13966.

Begin with the initial DataFrame in that issue:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

Save the grouping:

In [215]: g = df.groupby('A')

Compute the rolling sum:

In [216]: r = g.rolling(4)

In [217]: r.sum()
Out[217]:
         A      B
A
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
  5    4.0   14.0
  6    4.0   18.0
  7    4.0   22.0
  8    4.0   26.0
  9    4.0   30.0
...    ...    ...
2 30   8.0  114.0
  31   8.0  118.0
3 32   NaN    NaN
  33   NaN    NaN
  34   NaN    NaN
  35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

It maintains the by column (A)! That column should not be in the resulting DataFrame.

It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by column is gone as expected:

In [218]: g.sum()
Out[218]:
     B
A
1  190
2  306
3  284

In [219]: r.sum()
Out[219]:
          B
A
1 0     NaN
  1     NaN
  2     NaN
  3     6.0
  4    10.0
  5    14.0
  6    18.0
  7    22.0
  8    26.0
  9    30.0
...     ...
2 30  114.0
  31  118.0
3 32    NaN
  33    NaN
  34    NaN
  35  134.0
  36  138.0
  37  142.0
  38  146.0
  39  150.0

[40 rows x 1 columns]

So the grouping summation has some sort of side effect.

@chrisaycock
Copy link
Contributor Author

A little note while digging through more code: _convert_grouper in groupby.py has:

    if isinstance(grouper, dict):
        ...
    elif isinstance(grouper, Series):
        ...
    elif isinstance(grouper, (list, Series, Index, np.ndarray)):
        ...
    else:
        ...

The grouper is compared twice to Series. I will fix this when I clean-up everything.

@chrisaycock
Copy link
Contributor Author

chrisaycock commented Aug 16, 2016

I can fix the issue if I set the group selection:

g._set_group_selection()

I think we need this function at the start of .rolling().

Seems similar to #12839

@jreback
Copy link
Contributor

jreback commented Aug 17, 2016

This is defined behavior; in, that it is identical to .apply on the groupby.

In [10]: df.groupby('A').rolling(4).sum()
Out[10]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

In [11]: df.groupby('A').rolling(4).apply(lambda x: x.sum())
Out[11]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

you can look back at the issues, IIRC @jorisvandenbossche and I had a long conversation about this.

@chrisaycock
Copy link
Contributor Author

Hmm:

In [617]: df.groupby('A').sum()
Out[617]:
     B
A
1  190
2  306
3  284

In [618]: df.groupby('A').apply(lambda x: x.sum())
Out[618]:
    A    B
A
1  20  190
2  24  306
3  24  284

In addition to .rolling() and .apply(), .ohlc() and .expanding() keep the by column following a .groupby().

@jreback
Copy link
Contributor

jreback commented Sep 1, 2016

on reread this should be consistent - so marking as a bug
prob should not include the grouping column/level even though apply does

@ohadle
Copy link

ohadle commented Feb 15, 2017

A similar thing happens with index columns.

from pandas import DataFrame, Timestamp

c = pandas.DataFrame({u'ul_payload': {('a', Timestamp('2016-11-01 06:15:00')): 5, ('a', Timestamp('2016-11-01 07:45:00')): 8, ('a', Timestamp('2016-11-01 09:00:00')): 9, ('a', Timestamp('2016-11-01 07:15:00')): 6, ('a', Timestamp('2016-11-01 07:30:00')): 7, ('a', Timestamp('2016-11-01 06:00:00')): 4}, u'dl_payload': {('a', Timestamp('2016-11-01 06:15:00')): 15, ('a', Timestamp('2016-11-01 07:45:00')): 18, ('a', Timestamp('2016-11-01 09:00:00')): 19, ('a', Timestamp('2016-11-01 07:15:00')): 16, ('a', Timestamp('2016-11-01 07:30:00')): 17, ('a', Timestamp('2016-11-01 06:00:00')): 14}})

In [27]: c
Out[27]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00          14           4
  2016-11-01 06:15:00          15           5
  2016-11-01 07:15:00          16           6
  2016-11-01 07:30:00          17           7
  2016-11-01 07:45:00          18           8
  2016-11-01 09:00:00          19           9

In [29]: c.groupby(level=0).rolling(window=3).agg(np.sum)
Out[29]:
                         dl_payload  ul_payload
a a 2016-11-01 06:00:00         NaN         NaN
    2016-11-01 06:15:00         NaN         NaN
    2016-11-01 07:15:00        45.0        15.0
    2016-11-01 07:30:00        48.0        18.0
    2016-11-01 07:45:00        51.0        21.0
    2016-11-01 09:00:00        54.0        24.0

But not with group_keys=False:

In [48]: c.groupby(level=0, group_keys=False).rolling(window=3).agg(np.sum)
Out[48]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00         NaN         NaN
  2016-11-01 06:15:00         NaN         NaN
  2016-11-01 07:15:00        45.0        15.0
  2016-11-01 07:30:00        48.0        18.0
  2016-11-01 07:45:00        51.0        21.0
  2016-11-01 09:00:00        54.0        24.0

@jreback jreback modified the milestones: Next Minor Release, 0.20.0 Mar 29, 2017
@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017
@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 9, 2018
@vss888
Copy link

vss888 commented Jul 2, 2019

Why is the issue closed? The problem is still there (pandas 0.24.2).

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

this is closed in 0.25 coming soon

@chiachong
Copy link

Still the same problem in 0.25.

Workaround:
df.groupby('A').rolling(4).sum().reset_index(level=0, drop=True)

@andreas-vester
Copy link

The problem still exists in v1.0.1

@davidgilbertson
Copy link

davidgilbertson commented Jul 27, 2023

Still an issue in v2. So for future travellers...

.reset_index() is a good workaround as mentioned above. You can also use .tolist() if you're just assigning back to a new column of the original data frame, and the sort order isn't being changed by gruopby.

df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().reset_index(level=[0, 1], drop=True)

# OR, if your data is already sorted by A, B
df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().tolist()

@asaph-marvin
Copy link

Still an issue in v2.2.1:

df = pd.DataFrame(data=range(12), index=pd.MultiIndex.from_product(
    [['one', 'two', 'three'], ['a', 'b', 'c', 'd']]), columns=['vals'])
df
Out[108]: 
         vals
one   a     0
      b     1
      c     2
      d     3
two   a     4
      b     5
      c     6
      d     7
three a     8
      b     9
      c    10
      d    11

regular .sum() output looks fine:

df.groupby(level=0).sum()
Out[103]: 
       vals
one       6
three    38
two      22

with .rolling() the groupby index level is duplicated:

df.groupby(level=0).rolling(3).sum()
Out[104]: 
               vals
one   one   a   NaN
            b   NaN
            c   3.0
            d   6.0
three three a   NaN
            b   NaN
            c  27.0
            d  30.0
two   two   a   NaN
            b   NaN
            c  15.0
            d  18.0

Is there a fix planned for this issue?

From the discussion above it's not clear if this is intended behavior or a bug? If it's intended, can someone explain why?
As a user, my expectation is to get something like this:

df.groupby(level=0).rolling(3).sum().droplevel(0)
Out[110]: 
         vals
one   a   NaN
      b   NaN
      c   3.0
      d   6.0
three a   NaN
      b   NaN
      c  27.0
      d  30.0
two   a   NaN
      b   NaN
      c  15.0
      d  18.0

but without having to do .droplevel()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants