Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x #29772

normanius · 2019-11-21T12:49:33Z

Code Sample

Find the dataset bug.csv here

import pandas as pd

df = pd.read_csv("bug.csv", header=[0,1], index_col=[0])
g = df.groupby(level="property", axis=1)
thisFails = g.sum()
print(thisFails)

The bug possibly applies to multi-level indices as well (not just headers) - haven't checked it though.

Problem description

A groupby object g fails at aggregating the sum if g was created on a df with a multi-level header, with grouping along one of the column levels.

The code used to work for pandas 0.24.2 and 0.23.4. But it fails for pandas 0.25.x (0.25.3 as of writing this).

On failure, the following exception occurs:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'a'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 5, in <module>
    thisFails = g.sum()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 1382, in f
    result[col] = self._try_cast(result[col], self.obj[col])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2994, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3043, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2674, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2939, in _get_level_indexer
    code = level_index.get_loc(key)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'a'

Expected Output

property    a    b    c    d    e    f
dataset
case 0    0.0  2.0  0.0  0.0  1.0  0.0
case 1    3.0  3.0  1.0  1.0  0.0  1.0
case 2    1.0  1.0  0.0  0.0  0.0  1.0
case 3    2.0  3.0  0.0  0.0  0.0  0.0
case 4    2.0  0.0  0.0  0.0  1.0  0.0
case 5    2.0  0.0  0.0  0.0  1.0  0.0
case 6    2.0  0.0  0.0  0.0  1.0  0.0
case 7    0.0  0.0  0.0  0.0  1.0  0.0
case 8    0.0  0.0  0.0  0.0  0.0  1.0
case 9    3.0  1.0  0.0  1.0  0.0  0.0

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None

pandas : 0.25.3
numpy : 1.17.2
pytz : 2018.4
dateutil : 2.7.2
pip : 19.3.1
setuptools : 41.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 7.1.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-11-24T16:31:30Z

@normanius thanks for the report! This seems to work again on the latest master. I am not sure if there was a specific fix for this, but so it might be good to still add your example as a test case, to ensure it keeps working.

prakhar987 · 2019-11-25T15:22:15Z

Working on this

…ndexing-1row-df * upstream/master: (32 commits) DEPR: Series.cat.categorical (pandas-dev#29914) DEPR: infer_dtype default for skipna is now True (pandas-dev#29876) Fix broken asv (pandas-dev#29906) DEPR: Remove weekday_name (pandas-dev#29831) Fix mypy errors for pandas\tests\series\test_operators.py (pandas-dev#29826) CI: Setting path only once in GitHub Actions (pandas-dev#29867) DEPR: passing td64 data to DTA or dt64 data to TDA (pandas-dev#29794) CLN: remove unsupported sparse code from io.pytables (pandas-dev#29863) x.__class__ TO type(x) (pandas-dev#29889) DEPR: ftype, ftypes (pandas-dev#29895) REF: use named funcs instead of lambdas (pandas-dev#29841) Correct type inference for UInt64Index during access (pandas-dev#29420) CLN: follow-up to 29725 (pandas-dev#29890) CLN: trim unnecessary code in indexing tests (pandas-dev#29845) TST added test for groupby agg on mulitlevel column (pandas-dev#29772) (pandas-dev#29866) mypy fix (pandas-dev#29891) Typing annotations (pandas-dev#29850) Fix mypy error in pandas/tests.indexes.test_base.py (pandas-dev#29188) CLN: remove never-used kwargs, make kwargs explicit (pandas-dev#29873) TYP: Added typing to __eq__ functions (pandas-dev#29818) ...

pandas-dev#29866)

normanius changed the title ~~Aggregation (sum) fails for a grouping along a column level starting from pandas 0.25.x~~ Aggregation (sum) fails for a grouping along a multi-column level in pandas >=0.25.x Nov 21, 2019

jorisvandenbossche added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Nov 24, 2019

jorisvandenbossche added this to the Contributions Welcome milestone Nov 24, 2019

normanius changed the title ~~Aggregation (sum) fails for a grouping along a multi-column level in pandas >=0.25.x~~ Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x Nov 25, 2019

prakhar987 added a commit to prakhar987/pandas that referenced this issue Nov 26, 2019

TST added test for groupby agg on mulitlevel column (pandas-dev#29772)

5a5aabd

prakhar987 mentioned this issue Nov 26, 2019

TST added test for groupby agg on mulitlevel column (#29772) #29866

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.0 Nov 27, 2019

simonjayhawkins closed this as completed in #29866 Nov 27, 2019

simonjayhawkins pushed a commit that referenced this issue Nov 27, 2019

TST added test for groupby agg on mulitlevel column (#29772) (#29866)

d7328d3

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST added test for groupby agg on mulitlevel column (pandas-dev#29772) (

4259dc1

pandas-dev#29866)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

TST added test for groupby agg on mulitlevel column (pandas-dev#29772) (

766151e

pandas-dev#29866)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x #29772

Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x #29772

normanius commented Nov 21, 2019 •

edited

Loading

INSTALLED VERSIONS

jorisvandenbossche commented Nov 24, 2019

prakhar987 commented Nov 25, 2019

Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x #29772

Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x #29772

Comments

normanius commented Nov 21, 2019 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Nov 24, 2019

prakhar987 commented Nov 25, 2019

normanius commented Nov 21, 2019 •

edited

Loading

Output of `pd.show_versions()`