You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm looping through a dataframe, selectively filtering by varying values for a param column and returning aggregated data based on that filtering (the actual code is more complex). I'm intentionally returning a Series object from the apply function since I've benchmarked it to be faster than returning a DataFrame. The problem is that sometimes, a DataFrame is unexpectedly returned by the apply function instead of the Series object that I explicitly returned. This seems to happen when the Series object has only one row. Here is the unexpected output that I'm seeing:
<class 'pandas.core.series.Series'>
[['N1,N2', 250], ['N3', 300]]
<class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):
File "/tmp/applytest.py", line 22, in <module>
print(s.to_list())
File "/usr/lib64/python3.8/site-packages/pandas/core/generic.py", line 5130, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'to_list'
I took a look at the source and found that the error may be related to the call to unstack in _wrap_applied_output (pandas>core>groupby>generic.py>DataFrameGroupBy>wrap_applied_output). concat does indeed return a Series but unstack is converting the series to a DataFrame.
Removing the call to unstack seems to result in the expected output for this case, but I'm not sure if that would break other cases...
def _wrap_applied_output(self, keys, values, not_indexed_same=False):
...
if self.axis == 0 and isinstance(v, ABCSeries):
if (
isinstance(v.index, MultiIndex)
or key_index is None
or isinstance(key_index, MultiIndex)
):
...
else:
# GH5788 instead of stacking; concat gets the
# dtypes correct
from pandas.core.reshape.concat import concat
result = concat(
values,
keys=key_index,
names=key_index.names,
axis=self.axis,
).unstack()
result.columns = index
...
Not quite sure how we want to fix the bug (additional checks, etc.), but this is my first time contributing so would love to help any way I can!
The issue is that apply attempts to transform the result differently in the cases where all the indices are the same or some are different. This leads to different behavior when there is a single row vs multiple rows. Certainly this is something that needs to be improved on - PRs always welcome!
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
I'm looping through a dataframe, selectively filtering by varying values for a
param
column and returning aggregated data based on that filtering (the actual code is more complex). I'm intentionally returning a Series object from the apply function since I've benchmarked it to be faster than returning a DataFrame. The problem is that sometimes, a DataFrame is unexpectedly returned by the apply function instead of the Series object that I explicitly returned. This seems to happen when the Series object has only one row. Here is the unexpected output that I'm seeing:Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.11-200.fc32.x86_64
Version : #1 SMP Wed Jul 29 17:15:52 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 19.3.1
setuptools : 41.6.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : 3.1.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.12.0
pandas_datareader: 0.9.0
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : 0.50.1
The text was updated successfully, but these errors were encountered: