Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.apply() unexpectedly returns DataFrame when Series is explicitly specified #35782

Closed
2 of 3 tasks
kartiksubbarao opened this issue Aug 18, 2020 · 3 comments
Closed
2 of 3 tasks
Labels
Apply Apply, Aggregate, Transform, Map Bug Duplicate Report Duplicate issue or pull request Groupby

Comments

@kartiksubbarao
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

def calcscore(df):
    return pd.Series([[','.join(df.item), df.score.sum()]], index=pd.Index([df.index[0]]))

df = pd.DataFrame(columns=['time', 'item', 'param', 'score'],
        data=[[1, 'N1', 10, 100],
              [1, 'N2', 20, 150],
              [2, 'N3', 70, 300]])

for p in [5, 50]:
    s = df[df.param > p].groupby('time', group_keys=False).apply(calcscore)
    print(type(s))
    print(s.to_list())

Problem description

I'm looping through a dataframe, selectively filtering by varying values for a param column and returning aggregated data based on that filtering (the actual code is more complex). I'm intentionally returning a Series object from the apply function since I've benchmarked it to be faster than returning a DataFrame. The problem is that sometimes, a DataFrame is unexpectedly returned by the apply function instead of the Series object that I explicitly returned. This seems to happen when the Series object has only one row. Here is the unexpected output that I'm seeing:

<class 'pandas.core.series.Series'>
[['N1,N2', 250], ['N3', 300]]
<class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):

  File "/tmp/applytest.py", line 22, in <module>
    print(s.to_list())

  File "/usr/lib64/python3.8/site-packages/pandas/core/generic.py", line 5130, in __getattr__
    return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'to_list'

Expected Output

<class 'pandas.core.series.Series'>
[['N1,N2', 250], ['N3', 300]]
<class 'pandas.core.series.Series'>
[['N3', 300]]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.11-200.fc32.x86_64
Version : #1 SMP Wed Jul 29 17:15:52 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.12.0
pandas_datareader: 0.9.0
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : 0.50.1

@kartiksubbarao kartiksubbarao added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 18, 2020
@andrewkoo
Copy link

I took a look at the source and found that the error may be related to the call to unstack in _wrap_applied_output (pandas>core>groupby>generic.py>DataFrameGroupBy>wrap_applied_output). concat does indeed return a Series but unstack is converting the series to a DataFrame.

Removing the call to unstack seems to result in the expected output for this case, but I'm not sure if that would break other cases...

def _wrap_applied_output(self, keys, values, not_indexed_same=False):
            ...
                if self.axis == 0 and isinstance(v, ABCSeries):
                    if (
                        isinstance(v.index, MultiIndex)
                        or key_index is None
                        or isinstance(key_index, MultiIndex)
                    ):
                        ...
                    else:
                        # GH5788 instead of stacking; concat gets the
                        # dtypes correct
                        from pandas.core.reshape.concat import concat
                        result = concat(
                            values,
                            keys=key_index,
                            names=key_index.names,
                            axis=self.axis,
                        ).unstack()
                        result.columns = index
            ...

Not quite sure how we want to fix the bug (additional checks, etc.), but this is my first time contributing so would love to help any way I can!

@rhshadrach
Copy link
Member

I believe this is a duplicate of #31063.

The issue is that apply attempts to transform the result differently in the cases where all the indices are the same or some are different. This leads to different behavior when there is a single row vs multiple rows. Certainly this is something that needs to be improved on - PRs always welcome!

@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform, Map Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020
@rhshadrach
Copy link
Member

Closing as a duplicate

@rhshadrach rhshadrach added the Duplicate Report Duplicate issue or pull request label Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

4 participants