BUG: wrong output index mapping of results when groupby both scan and reduction aggregation are passed #40695

karthikeyann · 2021-03-30T20:36:06Z

Edit (#40695 (comment)): This issue is about groupby(...).transform not accepting list-likes or dict-likes.

Code Sample, a copy-pastable example

In [1]: import pandas as pd


In [2]: pdf = pd.DataFrame({'a' : [2, 2, 1, 1, 6], 'b' : [2, 2, 1, 1, 3]})


In [3]: pdf.groupby('a').agg(['sum'])
Out[3]: 
    b
  sum
a    
1   2
2   4
6   3

In [4]: pdf.groupby('a').agg(['cumsum'])
Out[4]: 
       b
  cumsum
0      2
1      4
2      1
3      2
4      3

In [5]: pdf.groupby('a').agg(['cumsum', 'sum'])
Out[5]: 
       b     
  cumsum  sum
0    2.0  NaN
1    4.0  2.0
2    1.0  4.0
3    2.0  NaN
4    3.0  NaN
6    NaN  3.0

Problem description

When both scan aggregrations (cumsum, cummin, cummax, cumcount) and any of reduction aggregartions (sum, min, max, prod, ...), are passed to groupby.agg , output value mapping to index in the result is wrong.

scan aggregations produce index as RangeIndex (0 to N-1).
reduction aggregations usually map to grouped keys. In the given example, it maps to values of keys in resulting index.

Expected Output

Common method to align the output should be decided and implemented.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : f2c8480
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-137-generic
Version : #141-Ubuntu SMP Fri Feb 19 13:46:27 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.3
numpy : 1.19.2
pytz : 2020.4
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20201009
Cython : 0.29.22
pytest : 6.2.2
hypothesis : 6.8.1
sphinx : 3.5.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.21.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 0.8.7
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.52.0

The text was updated successfully, but these errors were encountered:

rhshadrach · 2021-03-30T20:53:35Z

Thanks for the report!

Common method to align the output should be decided and implemented.

Do you have something in mind here? In pandas, agg is ideally meant to for reducers. One should use cumsum (et al) with transform instead of agg. I think this is a duplicate of #35725. The way I envision resolving that issue in this case would be to have the cumsum result entries be Series themselves, representing the cumsum applied to the group. While this does give the alignment (in particular, the resulting index in your example would be [1, 2, 6]), I don't think the result is particularly useful.

A possible alternative, which I think you may want here, is to treat both operations as transformers. You can do this today, but it has to be as two separate ops joined together by concat:

s1 = df.groupby(...).transform('sum')
s2 = df.groupby(...).transform('cumsum')
result = pd.concat([s1, s2], axis=1)

I think groupby(...).transform should accept lists, but currently it does not.

rhshadrach · 2021-03-30T21:01:24Z

With #35725 giving users the option to make all arguments reduce (agg), I would propose to make this issue about transform accepting list-like and dict-like arguments. In this way, users can opt for either alignment (reduce or transform) as they desire.

karthikeyann · 2021-03-31T14:14:43Z

Output mapping for transform looks better. Broadcasts for reducers.
Making transform accept list-like, dict-like arguments would be nice and would be good solution for this issue.

Also, agg could be disallowed to accept both reducers and transform like operations in single list/dict.

rhshadrach · 2021-04-01T02:00:10Z

Thanks @karthikeyann for the response, I've updated the OP. This looks doable similar to the way groupby.agg works, but some care will need to be taken. If a user supplies a function that is not a transformer, it seems to me we should make the assumption that it is a reducer and whatever value such a function produces will then be broadcasted to the entire group. This will likely produce undesirable results if the user supplies e.g. a filter, but it would at least be predictable.

rhshadrach · 2021-04-01T02:03:12Z

Another option would be to always align the result. In this way, a filter would then leave null values in the result. If the index is completely mismatched, the result would then be a column of all null values.

karthikeyann · 2021-04-16T20:02:31Z

Whichever option all users prefer, is good. I don't have any preference as long as it does not allow wrong mapping in the output.

karthikeyann added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 30, 2021

karthikeyann mentioned this issue Mar 30, 2021

Add groupby scan aggregation to cudf rapidsai/cudf#7759

Merged

3 tasks

rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby labels Mar 30, 2021

mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: wrong output index mapping of results when groupby both scan and reduction aggregation are passed #40695

BUG: wrong output index mapping of results when groupby both scan and reduction aggregation are passed #40695

karthikeyann commented Mar 30, 2021 •

edited by rhshadrach

Loading

INSTALLED VERSIONS

rhshadrach commented Mar 30, 2021

rhshadrach commented Mar 30, 2021

karthikeyann commented Mar 31, 2021

rhshadrach commented Apr 1, 2021

rhshadrach commented Apr 1, 2021 •

edited

Loading

karthikeyann commented Apr 16, 2021

BUG: wrong output index mapping of results when groupby both scan and reduction aggregation are passed #40695

BUG: wrong output index mapping of results when groupby both scan and reduction aggregation are passed #40695

Comments

karthikeyann commented Mar 30, 2021 • edited by rhshadrach Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

rhshadrach commented Mar 30, 2021

rhshadrach commented Mar 30, 2021

karthikeyann commented Mar 31, 2021

rhshadrach commented Apr 1, 2021

rhshadrach commented Apr 1, 2021 • edited Loading

karthikeyann commented Apr 16, 2021

karthikeyann commented Mar 30, 2021 •

edited by rhshadrach

Loading

Output of `pd.show_versions()`

rhshadrach commented Apr 1, 2021 •

edited

Loading