Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x #29772

Closed
normanius opened this issue Nov 21, 2019 · 2 comments · Fixed by #29866
Closed
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@normanius
Copy link

normanius commented Nov 21, 2019

Code Sample

Find the dataset bug.csv here

import pandas as pd

df = pd.read_csv("bug.csv", header=[0,1], index_col=[0])
g = df.groupby(level="property", axis=1)
thisFails = g.sum()
print(thisFails)

The bug possibly applies to multi-level indices as well (not just headers) - haven't checked it though.

Problem description

A groupby object g fails at aggregating the sum if g was created on a df with a multi-level header, with grouping along one of the column levels.

The code used to work for pandas 0.24.2 and 0.23.4. But it fails for pandas 0.25.x (0.25.3 as of writing this).

On failure, the following exception occurs:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'a'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 5, in <module>
    thisFails = g.sum()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 1382, in f
    result[col] = self._try_cast(result[col], self.obj[col])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 2994, in __getitem__
    return self._getitem_multilevel(key)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3043, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2674, in get_loc
    loc = self._get_level_indexer(key, level=0)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2939, in _get_level_indexer
    code = level_index.get_loc(key)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'a'

Expected Output

property    a    b    c    d    e    f
dataset
case 0    0.0  2.0  0.0  0.0  1.0  0.0
case 1    3.0  3.0  1.0  1.0  0.0  1.0
case 2    1.0  1.0  0.0  0.0  0.0  1.0
case 3    2.0  3.0  0.0  0.0  0.0  0.0
case 4    2.0  0.0  0.0  0.0  1.0  0.0
case 5    2.0  0.0  0.0  0.0  1.0  0.0
case 6    2.0  0.0  0.0  0.0  1.0  0.0
case 7    0.0  0.0  0.0  0.0  1.0  0.0
case 8    0.0  0.0  0.0  0.0  0.0  1.0
case 9    3.0  1.0  0.0  1.0  0.0  0.0

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None

pandas : 0.25.3
numpy : 1.17.2
pytz : 2018.4
dateutil : 2.7.2
pip : 19.3.1
setuptools : 41.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 7.1.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@normanius normanius changed the title Aggregation (sum) fails for a grouping along a column level starting from pandas 0.25.x Aggregation (sum) fails for a grouping along a multi-column level in pandas >=0.25.x Nov 21, 2019
@jorisvandenbossche
Copy link
Member

@normanius thanks for the report! This seems to work again on the latest master. I am not sure if there was a specific fix for this, but so it might be good to still add your example as a test case, to ensure it keeps working.

@jorisvandenbossche jorisvandenbossche added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Nov 24, 2019
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Nov 24, 2019
@prakhar987
Copy link
Contributor

Working on this

@normanius normanius changed the title Aggregation (sum) fails for a grouping along a multi-column level in pandas >=0.25.x Aggregation (sum) fails for a grouping along a multi-level column in pandas >=0.25.x Nov 25, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Nov 27, 2019
keechongtan added a commit to keechongtan/pandas that referenced this issue Nov 29, 2019
…ndexing-1row-df

* upstream/master: (32 commits)
  DEPR: Series.cat.categorical (pandas-dev#29914)
  DEPR: infer_dtype default for skipna is now True (pandas-dev#29876)
  Fix broken asv (pandas-dev#29906)
  DEPR: Remove weekday_name (pandas-dev#29831)
  Fix mypy errors for pandas\tests\series\test_operators.py (pandas-dev#29826)
  CI: Setting path only once in GitHub Actions (pandas-dev#29867)
  DEPR: passing td64 data to DTA or dt64 data to TDA (pandas-dev#29794)
  CLN: remove unsupported sparse code from io.pytables (pandas-dev#29863)
  x.__class__ TO type(x) (pandas-dev#29889)
  DEPR: ftype, ftypes (pandas-dev#29895)
  REF: use named funcs instead of lambdas (pandas-dev#29841)
  Correct type inference for UInt64Index during access (pandas-dev#29420)
  CLN: follow-up to 29725 (pandas-dev#29890)
  CLN: trim unnecessary code in indexing tests (pandas-dev#29845)
  TST added test for groupby agg on mulitlevel column (pandas-dev#29772) (pandas-dev#29866)
  mypy fix (pandas-dev#29891)
  Typing annotations (pandas-dev#29850)
  Fix mypy error in pandas/tests.indexes.test_base.py (pandas-dev#29188)
  CLN: remove never-used kwargs, make kwargs explicit (pandas-dev#29873)
  TYP: Added typing to __eq__ functions (pandas-dev#29818)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants