BUG: groupby.apply() unexpectedly returns DataFrame when Series is explicitly specified #35782

kartiksubbarao · 2020-08-18T01:55:34Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd

def calcscore(df):
    return pd.Series([[','.join(df.item), df.score.sum()]], index=pd.Index([df.index[0]]))

df = pd.DataFrame(columns=['time', 'item', 'param', 'score'],
        data=[[1, 'N1', 10, 100],
              [1, 'N2', 20, 150],
              [2, 'N3', 70, 300]])

for p in [5, 50]:
    s = df[df.param > p].groupby('time', group_keys=False).apply(calcscore)
    print(type(s))
    print(s.to_list())

Problem description

I'm looping through a dataframe, selectively filtering by varying values for a param column and returning aggregated data based on that filtering (the actual code is more complex). I'm intentionally returning a Series object from the apply function since I've benchmarked it to be faster than returning a DataFrame. The problem is that sometimes, a DataFrame is unexpectedly returned by the apply function instead of the Series object that I explicitly returned. This seems to happen when the Series object has only one row. Here is the unexpected output that I'm seeing:

<class 'pandas.core.series.Series'>
[['N1,N2', 250], ['N3', 300]]
<class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):

  File "/tmp/applytest.py", line 22, in <module>
    print(s.to_list())

  File "/usr/lib64/python3.8/site-packages/pandas/core/generic.py", line 5130, in __getattr__
    return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'to_list'

Expected Output

<class 'pandas.core.series.Series'>
[['N1,N2', 250], ['N3', 300]]
<class 'pandas.core.series.Series'>
[['N3', 300]]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.11-200.fc32.x86_64
Version : #1 SMP Wed Jul 29 17:15:52 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.12.0
pandas_datareader: 0.9.0
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : 0.50.1

The text was updated successfully, but these errors were encountered:

andrewkoo · 2020-08-18T18:19:54Z

I took a look at the source and found that the error may be related to the call to unstack in _wrap_applied_output (pandas>core>groupby>generic.py>DataFrameGroupBy>wrap_applied_output). concat does indeed return a Series but unstack is converting the series to a DataFrame.

Removing the call to unstack seems to result in the expected output for this case, but I'm not sure if that would break other cases...

def _wrap_applied_output(self, keys, values, not_indexed_same=False):
            ...
                if self.axis == 0 and isinstance(v, ABCSeries):
                    if (
                        isinstance(v.index, MultiIndex)
                        or key_index is None
                        or isinstance(key_index, MultiIndex)
                    ):
                        ...
                    else:
                        # GH5788 instead of stacking; concat gets the
                        # dtypes correct
                        from pandas.core.reshape.concat import concat
                        result = concat(
                            values,
                            keys=key_index,
                            names=key_index.names,
                            axis=self.axis,
                        ).unstack()
                        result.columns = index
            ...

Not quite sure how we want to fix the bug (additional checks, etc.), but this is my first time contributing so would love to help any way I can!

rhshadrach · 2020-08-25T21:03:43Z

I believe this is a duplicate of #31063.

The issue is that apply attempts to transform the result differently in the cases where all the indices are the same or some are different. This leads to different behavior when there is a single row vs multiple rows. Certainly this is something that needs to be improved on - PRs always welcome!

rhshadrach · 2023-04-23T16:22:28Z

Closing as a duplicate

kartiksubbarao added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 18, 2020

jbrockmendel added Apply Apply, Aggregate, Transform, Map Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2020

rhshadrach closed this as completed Apr 23, 2023

rhshadrach added the Duplicate Report Duplicate issue or pull request label Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.apply() unexpectedly returns DataFrame when Series is explicitly specified #35782

BUG: groupby.apply() unexpectedly returns DataFrame when Series is explicitly specified #35782

kartiksubbarao commented Aug 18, 2020

INSTALLED VERSIONS

andrewkoo commented Aug 18, 2020

rhshadrach commented Aug 25, 2020

rhshadrach commented Apr 23, 2023

BUG: groupby.apply() unexpectedly returns DataFrame when Series is explicitly specified #35782

BUG: groupby.apply() unexpectedly returns DataFrame when Series is explicitly specified #35782

Comments

kartiksubbarao commented Aug 18, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

andrewkoo commented Aug 18, 2020

rhshadrach commented Aug 25, 2020

rhshadrach commented Apr 23, 2023

Output of `pd.show_versions()`