BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result #42380

ivirshup · 2021-07-05T04:31:10Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame({"a": list("abcde"), "b": list("abcde")}, index=list("aabbc"), dtype="category")
df.agg("-".join, axis=1)

Using pandas 1.2.5:

a    a-a
a    b-b
b    c-c
b    d-d
c    e-e
dtype: object

Using pandas 1.3.0:

a    b-b
b    d-d
c    e-e
dtype: object

It does not look like this is an issue if I use df.apply instead of df.agg.

Problem description

When a aggregation of the rows is run on a dataframe with categorical columns and non-unique indices, the result is the wrong length.

It's weird that the output isn't the right length. Since I'm computing a value per row, I expect the same number of rows in the output as in the input. It's especially weird that this only happens if the columns are categorical.

That is:

df = pd.DataFrame({"a": list("abcde"), "b": list("abcde")}, index=list("aabbc"))
df.agg("-".join, axis=1)

a    a-a
a    b-b
b    c-c
b    d-d
c    e-e
dtype: object

in both versions.

Expected Output

I would expect the same output between versions. The result given by 1.2.5 seems more correct to me at the moment.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.8.10.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.5.0
Version          : Darwin Kernel Version 20.5.0: Sat May  8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 21.1.3
setuptools       : 56.0.0
Cython           : 0.29.23
pytest           : 6.2.4
hypothesis       : None
sphinx           : 4.0.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.3
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.23.1
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : 2021.06.0
fastparquet      : 0.4.1
gcsfs            : None
matplotlib       : 3.4.2
numexpr          : 2.7.2
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 4.0.1
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.7.0
sqlalchemy       : 1.3.18
tables           : 3.6.1
tabulate         : 0.8.7
xarray           : 0.18.2
xlrd             : 1.2.0
xlwt             : None
numba            : 0.53.1

The text was updated successfully, but these errors were encountered:

rhshadrach · 2021-07-07T03:35:53Z

On 1.3, agg is using df.T here. It appears to me that the transpose is incorrect on master in this case:

df = pd.DataFrame({"a": list("abcde"), "b": list("abcde")}, index=list("aabbc"), dtype="category")
print(df.T)

gives

   a  b  c
a  b  d  e
b  b  d  e

This behavior is also on 1.2.x in regards to the transpose, but not for agg.

rhshadrach · 2021-07-08T17:13:20Z

I believe this may be due to #30091, which introduced using a dictionary when computing the transpose in a certain case.

cc @TomAugspurger

simonjayhawkins · 2021-07-10T11:05:55Z

I believe this may be due to #30091, which introduced using a dictionary when computing the transpose in a certain case.

the regression reported in the op is due to

first bad commit: [b92526b] CLN: Don't modify state in FrameApply.agg (#40428)

ivirshup added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 5, 2021

ivirshup changed the title ~~BUG: 1.3.0 bug with aggregations over categorical values with non-unique index~~ BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result Jul 5, 2021

ivirshup mentioned this issue Jul 5, 2021

Pandas 1.3.0 compatibility scverse/scanpy#1917

Closed

rhshadrach added Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 7, 2021

rhshadrach added this to the 1.3.1 milestone Jul 7, 2021

rhshadrach mentioned this issue Jul 8, 2021

REGR: DataFrame.agg with axis=1, EA dtype, and duplicate index #42449

Merged

4 tasks

jreback closed this as completed in #42449 Jul 9, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 10, 2021

code sample for pandas-dev#42380

d5209f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result #42380

BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result #42380

ivirshup commented Jul 5, 2021

rhshadrach commented Jul 7, 2021 •

edited

Loading

rhshadrach commented Jul 8, 2021

simonjayhawkins commented Jul 10, 2021

BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result #42380

BUG: 1.3.0 DataFrame.agg over categorical columns with non-unique index returns wrong size result #42380

Comments

ivirshup commented Jul 5, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

rhshadrach commented Jul 7, 2021 • edited Loading

rhshadrach commented Jul 8, 2021

simonjayhawkins commented Jul 10, 2021

Output of `pd.show_versions()`

rhshadrach commented Jul 7, 2021 •

edited

Loading