Rolling groupby should not maintain the by column in the resulting DataFrame #14013

chrisaycock · 2016-08-16T18:50:15Z

I found another oddity while digging through #13966.

Begin with the initial DataFrame in that issue:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

Save the grouping:

In [215]: g = df.groupby('A')

Compute the rolling sum:

In [216]: r = g.rolling(4)

In [217]: r.sum()
Out[217]:
         A      B
A
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
  5    4.0   14.0
  6    4.0   18.0
  7    4.0   22.0
  8    4.0   26.0
  9    4.0   30.0
...    ...    ...
2 30   8.0  114.0
  31   8.0  118.0
3 32   NaN    NaN
  33   NaN    NaN
  34   NaN    NaN
  35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

It maintains the by column (A)! That column should not be in the resulting DataFrame.

It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by column is gone as expected:

In [218]: g.sum()
Out[218]:
     B
A
1  190
2  306
3  284

In [219]: r.sum()
Out[219]:
          B
A
1 0     NaN
  1     NaN
  2     NaN
  3     6.0
  4    10.0
  5    14.0
  6    18.0
  7    22.0
  8    26.0
  9    30.0
...     ...
2 30  114.0
  31  118.0
3 32    NaN
  33    NaN
  34    NaN
  35  134.0
  36  138.0
  37  142.0
  38  146.0
  39  150.0

[40 rows x 1 columns]

So the grouping summation has some sort of side effect.

The text was updated successfully, but these errors were encountered:

chrisaycock · 2016-08-16T21:28:20Z

A little note while digging through more code: _convert_grouper in groupby.py has:

    if isinstance(grouper, dict):
        ...
    elif isinstance(grouper, Series):
        ...
    elif isinstance(grouper, (list, Series, Index, np.ndarray)):
        ...
    else:
        ...

The grouper is compared twice to Series. I will fix this when I clean-up everything.

chrisaycock · 2016-08-16T23:28:30Z

I can fix the issue if I set the group selection:

g._set_group_selection()

I think we need this function at the start of .rolling().

Seems similar to #12839

jreback · 2016-08-17T10:11:17Z

This is defined behavior; in, that it is identical to .apply on the groupby.

In [10]: df.groupby('A').rolling(4).sum()
Out[10]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

In [11]: df.groupby('A').rolling(4).apply(lambda x: x.sum())
Out[11]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

you can look back at the issues, IIRC @jorisvandenbossche and I had a long conversation about this.

chrisaycock · 2016-08-17T14:44:27Z

Hmm:

In [617]: df.groupby('A').sum()
Out[617]:
     B
A
1  190
2  306
3  284

In [618]: df.groupby('A').apply(lambda x: x.sum())
Out[618]:
    A    B
A
1  20  190
2  24  306
3  24  284

In addition to .rolling() and .apply(), .ohlc() and .expanding() keep the by column following a .groupby().

jreback · 2016-09-01T11:50:58Z

on reread this should be consistent - so marking as a bug
prob should not include the grouping column/level even though apply does

ohadle · 2017-02-15T14:05:33Z

A similar thing happens with index columns.

from pandas import DataFrame, Timestamp

c = pandas.DataFrame({u'ul_payload': {('a', Timestamp('2016-11-01 06:15:00')): 5, ('a', Timestamp('2016-11-01 07:45:00')): 8, ('a', Timestamp('2016-11-01 09:00:00')): 9, ('a', Timestamp('2016-11-01 07:15:00')): 6, ('a', Timestamp('2016-11-01 07:30:00')): 7, ('a', Timestamp('2016-11-01 06:00:00')): 4}, u'dl_payload': {('a', Timestamp('2016-11-01 06:15:00')): 15, ('a', Timestamp('2016-11-01 07:45:00')): 18, ('a', Timestamp('2016-11-01 09:00:00')): 19, ('a', Timestamp('2016-11-01 07:15:00')): 16, ('a', Timestamp('2016-11-01 07:30:00')): 17, ('a', Timestamp('2016-11-01 06:00:00')): 14}})

In [27]: c
Out[27]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00          14           4
  2016-11-01 06:15:00          15           5
  2016-11-01 07:15:00          16           6
  2016-11-01 07:30:00          17           7
  2016-11-01 07:45:00          18           8
  2016-11-01 09:00:00          19           9

In [29]: c.groupby(level=0).rolling(window=3).agg(np.sum)
Out[29]:
                         dl_payload  ul_payload
a a 2016-11-01 06:00:00         NaN         NaN
    2016-11-01 06:15:00         NaN         NaN
    2016-11-01 07:15:00        45.0        15.0
    2016-11-01 07:30:00        48.0        18.0
    2016-11-01 07:45:00        51.0        21.0
    2016-11-01 09:00:00        54.0        24.0

But not with group_keys=False:

In [48]: c.groupby(level=0, group_keys=False).rolling(window=3).agg(np.sum)
Out[48]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00         NaN         NaN
  2016-11-01 06:15:00         NaN         NaN
  2016-11-01 07:15:00        45.0        15.0
  2016-11-01 07:30:00        48.0        18.0
  2016-11-01 07:45:00        51.0        21.0
  2016-11-01 09:00:00        54.0        24.0

vss888 · 2019-07-02T20:58:12Z

Why is the issue closed? The problem is still there (pandas 0.24.2).

jreback · 2019-07-02T21:09:04Z

this is closed in 0.25 coming soon

chiachong · 2019-08-01T07:30:02Z

Still the same problem in 0.25.

Workaround:
df.groupby('A').rolling(4).sum().reset_index(level=0, drop=True)

andreas-vester · 2020-02-20T14:14:13Z

The problem still exists in v1.0.1

davidgilbertson · 2023-07-27T22:42:48Z

Still an issue in v2. So for future travellers...

.reset_index() is a good workaround as mentioned above. You can also use .tolist() if you're just assigning back to a new column of the original data frame, and the sort order isn't being changed by gruopby.

df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().reset_index(level=[0, 1], drop=True)

# OR, if your data is already sorted by A, B
df["RollingMean"] = df.groupby(["A", "B"]).rolling(2).Value.mean().tolist()

asaph-marvin · 2024-07-18T07:20:19Z

Still an issue in v2.2.1:

df = pd.DataFrame(data=range(12), index=pd.MultiIndex.from_product(
    [['one', 'two', 'three'], ['a', 'b', 'c', 'd']]), columns=['vals'])
df
Out[108]: 
         vals
one   a     0
      b     1
      c     2
      d     3
two   a     4
      b     5
      c     6
      d     7
three a     8
      b     9
      c    10
      d    11

regular .sum() output looks fine:

df.groupby(level=0).sum()
Out[103]: 
       vals
one       6
three    38
two      22

with .rolling() the groupby index level is duplicated:

df.groupby(level=0).rolling(3).sum()
Out[104]: 
               vals
one   one   a   NaN
            b   NaN
            c   3.0
            d   6.0
three three a   NaN
            b   NaN
            c  27.0
            d  30.0
two   two   a   NaN
            b   NaN
            c  15.0
            d  18.0

Is there a fix planned for this issue?

From the discussion above it's not clear if this is intended behavior or a bug? If it's intended, can someone explain why?
As a user, my expectation is to get something like this:

df.groupby(level=0).rolling(3).sum().droplevel(0)
Out[110]: 
         vals
one   a   NaN
      b   NaN
      c   3.0
      d   6.0
three a   NaN
      b   NaN
      c  27.0
      d  30.0
two   a   NaN
      b   NaN
      c  15.0
      d  18.0

but without having to do .droplevel()

jreback added Groupby API Design labels Aug 17, 2016

jreback mentioned this issue Sep 1, 2016

DataFrame.groupby.expanding duplicates grouped dimensions #14134

Closed

jreback added Bug Difficulty Intermediate labels Sep 1, 2016

jreback added this to the 0.20.0 milestone Sep 1, 2016

jreback mentioned this issue Oct 13, 2016

DataFrame.groupby(level=...).rolling(n).mean() #14420

Closed

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Oct 13, 2016

jreback mentioned this issue Feb 8, 2017

Groupby with rolling on MultiIndex duplicate index level #15350

Closed

ohadle mentioned this issue Feb 15, 2017

Exception on df.groupby().rolling().agg([f1, f2]) #15409

Closed

jreback modified the milestones: Next Minor Release, 0.20.0 Mar 29, 2017

jreback added the Prio-high label Mar 29, 2017

jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

WillAyd mentioned this issue May 9, 2018

Consistent Return Structure for Rolling Apply #20984

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.23.0 May 9, 2018

jreback closed this as completed in #20984 May 9, 2018

andreas-vester mentioned this issue Feb 26, 2020

Rolling groupby should not maintain the by column in the resulting DataFrame #32262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

chrisaycock commented Aug 16, 2016

chrisaycock commented Aug 16, 2016

chrisaycock commented Aug 16, 2016 •

edited

Loading

jreback commented Aug 17, 2016

chrisaycock commented Aug 17, 2016

jreback commented Sep 1, 2016

ohadle commented Feb 15, 2017 •

edited

Loading

vss888 commented Jul 2, 2019

jreback commented Jul 2, 2019

chiachong commented Aug 1, 2019

andreas-vester commented Feb 20, 2020

davidgilbertson commented Jul 27, 2023 •

edited

Loading

asaph-marvin commented Jul 18, 2024

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

Comments

chrisaycock commented Aug 16, 2016

chrisaycock commented Aug 16, 2016

chrisaycock commented Aug 16, 2016 • edited Loading

jreback commented Aug 17, 2016

chrisaycock commented Aug 17, 2016

jreback commented Sep 1, 2016

ohadle commented Feb 15, 2017 • edited Loading

vss888 commented Jul 2, 2019

jreback commented Jul 2, 2019

chiachong commented Aug 1, 2019

andreas-vester commented Feb 20, 2020

davidgilbertson commented Jul 27, 2023 • edited Loading

asaph-marvin commented Jul 18, 2024

chrisaycock commented Aug 16, 2016 •

edited

Loading

ohadle commented Feb 15, 2017 •

edited

Loading

davidgilbertson commented Jul 27, 2023 •

edited

Loading