Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: wrong output index mapping of results when groupby both scan and reduction aggregation are passed #40695

Open
karthikeyann opened this issue Mar 30, 2021 · 6 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action

Comments

@karthikeyann
Copy link

karthikeyann commented Mar 30, 2021

Edit (#40695 (comment)): This issue is about groupby(...).transform not accepting list-likes or dict-likes.

Code Sample, a copy-pastable example

In [1]: import pandas as pd


In [2]: pdf = pd.DataFrame({'a' : [2, 2, 1, 1, 6], 'b' : [2, 2, 1, 1, 3]})


In [3]: pdf.groupby('a').agg(['sum'])
Out[3]: 
    b
  sum
a    
1   2
2   4
6   3

In [4]: pdf.groupby('a').agg(['cumsum'])
Out[4]: 
       b
  cumsum
0      2
1      4
2      1
3      2
4      3

In [5]: pdf.groupby('a').agg(['cumsum', 'sum'])
Out[5]: 
       b     
  cumsum  sum
0    2.0  NaN
1    4.0  2.0
2    1.0  4.0
3    2.0  NaN
4    3.0  NaN
6    NaN  3.0

Problem description

When both scan aggregrations (cumsum, cummin, cummax, cumcount) and any of reduction aggregartions (sum, min, max, prod, ...), are passed to groupby.agg , output value mapping to index in the result is wrong.

scan aggregations produce index as RangeIndex (0 to N-1).
reduction aggregations usually map to grouped keys. In the given example, it maps to values of keys in resulting index.

Expected Output

Common method to align the output should be decided and implemented.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2c8480
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-137-generic
Version : #141-Ubuntu SMP Fri Feb 19 13:46:27 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.3
numpy : 1.19.2
pytz : 2020.4
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20201009
Cython : 0.29.22
pytest : 6.2.2
hypothesis : 6.8.1
sphinx : 3.5.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.21.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 0.8.7
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.52.0

@karthikeyann karthikeyann added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 30, 2021
@rhshadrach
Copy link
Member

Thanks for the report!

Common method to align the output should be decided and implemented.

Do you have something in mind here? In pandas, agg is ideally meant to for reducers. One should use cumsum (et al) with transform instead of agg. I think this is a duplicate of #35725. The way I envision resolving that issue in this case would be to have the cumsum result entries be Series themselves, representing the cumsum applied to the group. While this does give the alignment (in particular, the resulting index in your example would be [1, 2, 6]), I don't think the result is particularly useful.

A possible alternative, which I think you may want here, is to treat both operations as transformers. You can do this today, but it has to be as two separate ops joined together by concat:

s1 = df.groupby(...).transform('sum')
s2 = df.groupby(...).transform('cumsum')
result = pd.concat([s1, s2], axis=1)

I think groupby(...).transform should accept lists, but currently it does not.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby labels Mar 30, 2021
@rhshadrach
Copy link
Member

With #35725 giving users the option to make all arguments reduce (agg), I would propose to make this issue about transform accepting list-like and dict-like arguments. In this way, users can opt for either alignment (reduce or transform) as they desire.

@karthikeyann
Copy link
Author

Output mapping for transform looks better. Broadcasts for reducers.
Making transform accept list-like, dict-like arguments would be nice and would be good solution for this issue.

Also, agg could be disallowed to accept both reducers and transform like operations in single list/dict.

@rhshadrach
Copy link
Member

Thanks @karthikeyann for the response, I've updated the OP. This looks doable similar to the way groupby.agg works, but some care will need to be taken. If a user supplies a function that is not a transformer, it seems to me we should make the assumption that it is a reducer and whatever value such a function produces will then be broadcasted to the entire group. This will likely produce undesirable results if the user supplies e.g. a filter, but it would at least be predictable.

@rhshadrach
Copy link
Member

rhshadrach commented Apr 1, 2021

Another option would be to always align the result. In this way, a filter would then leave null values in the result. If the index is completely mismatched, the result would then be a column of all null values.

@karthikeyann
Copy link
Author

Whichever option all users prefer, is good. I don't have any preference as long as it does not allow wrong mapping in the output.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants