Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Khris777 · 2018-05-29T10:22:11Z

Problem Description

I stumbled over some very specific off behaviour when trying to do a multi-index groupby over several boolean columns using agg with a dictionary to apply a function to each column that would count the number of True-values by using the boolean series as a boolean index on itself and getting the length of that.

Example code:

NN = 1
l0 = NN*['a','a','b','b','b','b','b','b','b']
l1 = NN*['x','y','x','x','x','y','y','z','z']
l2 = NN*[True,True,False,True,False,False,False,True,True]
l3 = NN*[True,False,False,True,False,True,False,True,False]
l4 = NN*[False,False,True,True,True,True,False,True,False]

df = pd.DataFrame({"C0":l0,'C1':l1,'C2':l2,'C3':l3,'C4':l4})
print(df.groupby(["C0","C1"])[["C2","C3","C4"]]\
.agg({"C2":lambda x: len(x[x]),"C3":lambda x: len(x[x]),"C4":lambda x: len(x[x])}))

The result I'm getting is:

	   C2     C3  C4
C0 C1               
a  x    1   True   0
   y    1  False   0
b  x    1   True   2
   y    0   True   1
   z    2   True   1

As you can see, the second column isn't properly aggregated for some reason.

If I set NN = 10, I'm getting a proper result:

	   C2  C3  C4
C0 C1            
a  x   10  10   0
   y   10   0   0
b  x   10  10  30
   y    0  10  10
   z   20  10  10

The problem only happens when using a multi-index, only with the C3 column, only when NN=1 and only when using the agg function (it works fine with apply on the single column), so it can be reduced to this call:

print(df.groupby(["C0","C1"])["C3"].agg(lambda x: len(x[x])))

It seems like the specific combination of the multi-index and the boolean values causes some odd behaviour.

EDIT: I did not write the C3 data specifically to cause the error, I just played with random combinations around until the error appeared, so I have no clue, why exactly that combination causes the problem.

In my real application this happens to two of three columns in a dataframe with 570 rows, so it's not limited to very short dataframes like in the example.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-05-29T11:04:20Z

Do you have any way to simplify this example?

FWIW you could just sum your boolean columns to get the same result

Khris777 · 2018-05-29T11:23:08Z

This simplified example still shows the errorneous behaviour:

df = pd.DataFrame({"C0":['a','a'],'C1':['x','y'],'C3':[True,False]})
print(df.groupby(["C0","C1"])["C3"].agg(lambda x: len(x[x])))

Whereas using apply shows the correct behaviour:

print(df.groupby(["C0","C1"])["C3"].apply(lambda x: len(x[x])))

WillAyd · 2018-05-29T11:42:02Z

OK thanks for the last example. I suppose the difference is that agg is casting the result of your anonymous function back to boolean when the result is zeros / ones and apply is not. This also explains why none of the other columns were cast back to boolean (they had more than just 0s and 1s) and also why having N != 1 would not cast back to boolean either.

There are a variety of ways to handle this from the end user perspective, whether it be through explicit typing or again using something like sum which would be more idiomatic and performant than what you are trying to do anyway.

That said, investigation and PRs to align the casting rules for apply and agg are always welcome!

jreback · 2018-05-29T11:47:07Z

xref #14873 is the root issue here.

WillAyd added the Needs Info Clarification about behavior needed to assess issue label May 29, 2018

WillAyd added Apply Apply, Aggregate, Transform, Map Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Info Clarification about behavior needed to assess issue labels May 29, 2018

TrigonaMinima mentioned this issue Feb 13, 2019

BUG: groupby.agg coercing booleans #14873

Closed

jbrockmendel added the Groupby label Oct 17, 2019

mroeschke added the Bug label Jun 28, 2020

rhshadrach mentioned this issue Apr 5, 2021

BUG: groupby.agg/transform casts UDF results #40790

Merged

5 tasks

jreback added this to the 1.3 milestone Apr 30, 2021

jreback closed this as completed in #40790 May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Khris777 commented May 29, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented May 29, 2018

Khris777 commented May 29, 2018 •

edited

Loading

WillAyd commented May 29, 2018 •

edited

Loading

jreback commented May 29, 2018

Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Comments

Khris777 commented May 29, 2018 • edited Loading

Problem Description

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented May 29, 2018

Khris777 commented May 29, 2018 • edited Loading

WillAyd commented May 29, 2018 • edited Loading

jreback commented May 29, 2018

Khris777 commented May 29, 2018 •

edited

Loading

Output of `pd.show_versions()`

Khris777 commented May 29, 2018 •

edited

Loading

WillAyd commented May 29, 2018 •

edited

Loading