Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: aggregation on ordered categorical column drops grouping index or crashes, depending on context #27800

Closed
kpflugshaupt opened this issue Aug 7, 2019 · 14 comments · Fixed by #35630
Labels
Categorical Categorical Data Type good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@kpflugshaupt
Copy link
Contributor

Code Sample

Build the model data frame:

df = pd.DataFrame({
    'nr': [1,2,3,4,5,6,7,8], 
    'cat_ord': list('aabbccdd'), 
    'cat':list('aaaabbbb')
})
df = df.astype({'cat': 'category', 'cat_ord': 'category'})
df['cat_ord'] = df['cat_ord'].cat.as_ordered()

When grouping, single aggregations on a numeric column work:

df.groupby('cat').agg({'nr': 'min'})
	nr
cat	
a	1
b	5

Single aggregations on an ordered categorical column work, but drop the grouping index:

df.groupby('cat').agg({'cat_ord': 'min'})
   cat_ord
0        a
1        c

Combined single aggregations on a numeric and an ordered categorical column work:

df.groupby('cat').agg({'nr': 'min', 'cat_ord': 'min'})
       nr cat_ord
cat		
a	1	a
b	5	c

Multiple aggregations on an ordered categorical column work, but drop the grouping index:

df.groupby('cat').agg({'cat_ord': ['min', 'max']})
          cat_ord
      min     max
0	a	b
1	c	d

Combined aggregations on a numeric (single) and an ordered categorical column (multiple) fail with a TypeError:

df.groupby('cat').agg({'nr': 'min', 'cat_ord': ['min', 'max']})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-148-b446de510106> in <module>
----> 1 df.groupby('cat').agg({'nr': 'min', 'cat_ord': ['min', 'max']})

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, arg, *args, **kwargs)
   1453     @Appender(_shared_docs["aggregate"])
   1454     def aggregate(self, arg=None, *args, **kwargs):
-> 1455         return super().aggregate(arg, *args, **kwargs)
   1456 
   1457     agg = aggregate

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
    227         func = _maybe_mangle_lambdas(func)
    228 
--> 229         result, how = self._aggregate(func, _level=_level, *args, **kwargs)
    230         if how is None:
    231             return result

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
    528                 # return a MI DataFrame
    529 
--> 530                 return concat([result[k] for k in keys], keys=keys, axis=1), True
    531 
    532             elif isinstance(self, ABCSeries) and is_any_series():

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    256     )
    257 
--> 258     return op.get_result()
    259 
    260 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in get_result(self)
    466                     obj_labels = mgr.axes[ax]
    467                     if not new_labels.equals(obj_labels):
--> 468                         indexers[ax] = obj_labels.reindex(new_labels)[1]
    469 
    470                 mgrs_indexers.append((obj._data, indexers))

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\category.py in reindex(self, target, method, level, limit, tolerance)
    616                 # coerce to a regular index here!
    617                 result = Index(np.array(self), name=self.name)
--> 618                 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
    619             else:
    620 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in _reindex_non_unique(self, target)
   3434 
   3435         target = ensure_index(target)
-> 3436         indexer, missing = self.get_indexer_non_unique(target)
   3437         check = indexer != -1
   3438         new_labels = self.take(indexer[check])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
   4792             tgt_values = target._ndarray_values
   4793 
-> 4794         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
   4795         return ensure_platform_int(indexer), missing
   4796 

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()

TypeError: '<' not supported between instances of 'str' and 'int'

Combined aggregations on a numeric (multiple) and an ordered categorical column (single) also fail with the same TypeError:

df.groupby('cat').agg({'nr': ['min', 'max'], 'cat_ord': 'min'})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-164-b1d70184bd81> in <module>
----> 1 df.groupby('cat').agg({'nr': ['min', 'max'], 'cat_ord': 'min'})

...

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()

TypeError: '<' not supported between instances of 'str' and 'int'

Problem description

Aggregations on ordered categoricals drop the grouping index, or crash, as shown above.

This makes it hard to calculate combined aggregations over big data sets correctly and efficiently.

Expected Output

Aggregations on ordered categoricals should work as on non-categorical columns.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.5
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Groupby labels Aug 7, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Aug 7, 2019
@TomAugspurger
Copy link
Contributor

Thanks for the report. Are you interested in debugging to see where things go wrong?

@bongolegend
Copy link
Contributor

bongolegend commented Aug 13, 2019

@kpflugshaupt I ran your code samples and did not get any errors, on pandas 0.24.2. Some later update must have broken this. I can look into this.

@TomAugspurger is there an easy way to see what changes between 0.24.2 and 0.25.0 would have affected this functionality? The whatsnew docs (could be a lot of combing)?

@kpflugshaupt
Copy link
Contributor Author

Hi @ncernek ,

I also cannot recall seeing this ever before 0.25.0, so the upgrade has probably introduced it.

As to how to tackle this: I would (given time) debug the call, and either catch where it's going wrong (compare to working call), or list all the visited files and cross-reference the whatsnew file.

Thanks for looking into this! I may also have a go, but not in the next days -- too much work.

Cheers
Kaspar

@TomAugspurger
Copy link
Contributor

You can also use git bisect to find the faulty commit. Might be able to look through PRs in the git log that mention groupby or categorical to narrow down the bisect range.

@bongolegend
Copy link
Contributor

I went the git bisect route. I wrote a single failing test for this condition

df.groupby("cat").agg({"nr": "min", "cat_ord": ["min", "max"]})

Result of git bisect

d0292fe1e12f1d460e52df4da4250bef32324579 is the first bad commit
commit d0292fe1e12f1d460e52df4da4250bef32324579
Author: Jeff Reback <jeff@reback.net>
Date:   Thu Jun 27 18:13:27 2019 -0500

    BUG: preserve categorical & sparse types when grouping / pivot (#27071)

link to that PR

Looks like a lot of thinking went into that PR, maybe those of you (e.g. @jreback ) who worked on it want to look into this bug further?

@TomAugspurger
Copy link
Contributor

Thanks @ncernek. At least for this specific one, it's from how concat handles a mix of categorical & non-categorical indexes

In [2]: a = pd.DataFrame({"A": [1, 2]})

In [3]: b = pd.DataFrame({"B": [3, 4]}, index=pd.CategoricalIndex(['a', 'b']))

In [4]: pd.concat([a, b], axis=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-a98a78ec2995> in <module>
----> 1 pd.concat([a, b], axis=1)

~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    256     )
    257
--> 258     return op.get_result()
    259
    260

~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
    466                     obj_labels = mgr.axes[ax]
    467                     if not new_labels.equals(obj_labels):
--> 468                         indexers[ax] = obj_labels.reindex(new_labels)[1]
    469
    470                 mgrs_indexers.append((obj._data, indexers))

~/sandbox/pandas/pandas/core/indexes/category.py in reindex(self, target, method, level, limit, tolerance)
    615                 # coerce to a regular index here!
    616                 result = Index(np.array(self), name=self.name)
--> 617                 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
    618             else:
    619

~/sandbox/pandas/pandas/core/indexes/base.py in _reindex_non_unique(self, target)
   3388
   3389         target = ensure_index(target)
-> 3390         indexer, missing = self.get_indexer_non_unique(target)
   3391         check = indexer != -1
   3392         new_labels = self.take(indexer[check])

~/sandbox/pandas/pandas/core/indexes/base.py in get_indexer_non_unique(self, target)
   4751             tgt_values = target._ndarray_values
   4752
-> 4753         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
   4754         return ensure_platform_int(indexer), missing
   4755

~/sandbox/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()
    305             # increasing, then use binary search for each starget
    306             for starget in stargets:
--> 307                 start = values.searchsorted(starget, side='left')
    308                 end = values.searchsorted(starget, side='right')
    309                 if start != end:

TypeError: '<' not supported between instances of 'str' and 'int'

In [5]: a = pd.DataFrame({"A": [1, 2]})

In [6]: b = pd.DataFrame({"B": [3, 4]}, index=pd.CategoricalIndex(['a', 'b']))

In [7]: pd.concat([a, b], axis=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-a98a78ec2995> in <module>
----> 1 pd.concat([a, b], axis=1)

~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    256     )
    257
--> 258     return op.get_result()
    259
    260

~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
    466                     obj_labels = mgr.axes[ax]
    467                     if not new_labels.equals(obj_labels):
--> 468                         indexers[ax] = obj_labels.reindex(new_labels)[1]
    469
    470                 mgrs_indexers.append((obj._data, indexers))

~/sandbox/pandas/pandas/core/indexes/category.py in reindex(self, target, method, level, limit, tolerance)
    615                 # coerce to a regular index here!
    616                 result = Index(np.array(self), name=self.name)
--> 617                 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
    618             else:
    619

~/sandbox/pandas/pandas/core/indexes/base.py in _reindex_non_unique(self, target)
   3388
   3389         target = ensure_index(target)
-> 3390         indexer, missing = self.get_indexer_non_unique(target)
   3391         check = indexer != -1
   3392         new_labels = self.take(indexer[check])

~/sandbox/pandas/pandas/core/indexes/base.py in get_indexer_non_unique(self, target)
   4751             tgt_values = target._ndarray_values
   4752
-> 4753         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
   4754         return ensure_platform_int(indexer), missing
   4755

~/sandbox/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()
    305             # increasing, then use binary search for each starget
    306             for starget in stargets:
--> 307                 start = values.searchsorted(starget, side='left')
    308                 end = values.searchsorted(starget, side='right')
    309                 if start != end:

I think that should coerce both to object-dtype index before going too far.

@TomAugspurger TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 14, 2019
@BaruchYoussin
Copy link

Here is another bug which is probably another face of this bug:

Aggregating a categorical column in a groupby yields a Categorical object rather than a Series.

Here is the code:

df = pd.DataFrame({
    'nr': [1,2,3,4,5,6,7,8], 
    'cat_ord': list('aabbccdd'), 
    'cat':list('aaaabbbb')
})
df = df.astype({'cat': 'category', 'cat_ord': 'category'})
df['cat_ord'] = df['cat_ord'].cat.as_ordered()
df

nr cat_ord cat
0 1 a a
1 2 a a
2 3 b a
3 4 b a

df.groupby('cat_ord')['cat'].first()

[a, a, b, b]
Categories (2, object): [a, b]

type(df.groupby('cat_ord')['cat'].first())

<class 'pandas.core.arrays.categorical.Categorical'>

@BaruchYoussin
Copy link

I suspect that this bug was introduced by the following change in 0.25.

Aggregation in groupby in Pandas 0.24 dropped the categorical dtype, replacing it by object which was a string.

Version 0.25 introduced additional code to restore the original dtype. It seems to me that it is this code which is to blame for this bug.

I also forgot to mention that the aggregation of categorical columns take huge amount of time on large datasets, much much more than for other column types. For this reason I bypassed it entirely in my code, aggregating string columns instead, despite the memory cost.

@mroeschke
Copy link
Member

Looks like the examples work on master. Could use tests

In [125]: df = pd.DataFrame({
     ...:     'nr': [1,2,3,4,5,6,7,8],
     ...:     'cat_ord': list('aabbccdd'),
     ...:     'cat':list('aaaabbbb')
     ...: })
     ...: df = df.astype({'cat': 'category', 'cat_ord': 'category'})
     ...: df['cat_ord'] = df['cat_ord'].cat.as_ordered()

In [126]: df.groupby('cat').agg({'nr': 'min', 'cat_ord': ['min', 'max']})
     ...:
Out[126]:
     nr cat_ord
    min     min max
cat
a     1       a   b
b     5       c   d

In [127]: df.groupby('cat').agg({'nr': ['min', 'max'], 'cat_ord': 'min'})
     ...:
Out[127]:
     nr     cat_ord
    min max     min
cat
a     1   4       a
b     5   8       c

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Categorical Categorical Data Type Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 28, 2020
@kpflugshaupt
Copy link
Contributor Author

This seems solved on the current productive release.

In my testing, all cases from the initial report came out correctly, with no regressions.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Switzerland.1252
pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : 1.2.9
numba : None

Well done, whoever fixed this!

@mroeschke: Should I close this, or would you wait for more positive test results? (I don't have an installation of the development branch, so cannot test on that)

@BaruchYoussin
Copy link

Works fine on 1.0.4.

@mroeschke
Copy link
Member

@kpflugshaupt we'll want to add a regression test to ensure that this issue doesn't occur again. Are you interested in submitted a PR with that test?

@kpflugshaupt
Copy link
Contributor Author

kpflugshaupt commented Jul 7, 2020

@kpflugshaupt we'll want to add a regression test to ensure that this issue doesn't occur again. Are you interested in submitted a PR with that test?

@mroeschke Not sure I'll have the time -- work & family tend to get in the way. Possibly in the week starting July 20th. I'll try!

@mathurk1
Copy link
Contributor

mathurk1 commented Aug 3, 2020

take

@mathurk1 mathurk1 removed their assignment Aug 7, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.2 Aug 10, 2020
@jreback jreback added Categorical Categorical Data Type Groupby labels Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants