BUG: aggregation on ordered categorical column drops grouping index or crashes, depending on context #27800

kpflugshaupt · 2019-08-07T13:13:15Z

Code Sample

Build the model data frame:

df = pd.DataFrame({
    'nr': [1,2,3,4,5,6,7,8], 
    'cat_ord': list('aabbccdd'), 
    'cat':list('aaaabbbb')
})
df = df.astype({'cat': 'category', 'cat_ord': 'category'})
df['cat_ord'] = df['cat_ord'].cat.as_ordered()

When grouping, single aggregations on a numeric column work:

df.groupby('cat').agg({'nr': 'min'})

	nr
cat	
a	1
b	5

Single aggregations on an ordered categorical column work, but drop the grouping index:

df.groupby('cat').agg({'cat_ord': 'min'})

   cat_ord
0        a
1        c

Combined single aggregations on a numeric and an ordered categorical column work:

df.groupby('cat').agg({'nr': 'min', 'cat_ord': 'min'})

       nr cat_ord
cat		
a	1	a
b	5	c

Multiple aggregations on an ordered categorical column work, but drop the grouping index:

df.groupby('cat').agg({'cat_ord': ['min', 'max']})

          cat_ord
      min     max
0	a	b
1	c	d

Combined aggregations on a numeric (single) and an ordered categorical column (multiple) fail with a TypeError:

df.groupby('cat').agg({'nr': 'min', 'cat_ord': ['min', 'max']})

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-148-b446de510106> in <module>
----> 1 df.groupby('cat').agg({'nr': 'min', 'cat_ord': ['min', 'max']})

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, arg, *args, **kwargs)
   1453     @Appender(_shared_docs["aggregate"])
   1454     def aggregate(self, arg=None, *args, **kwargs):
-> 1455         return super().aggregate(arg, *args, **kwargs)
   1456 
   1457     agg = aggregate

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, *args, **kwargs)
    227         func = _maybe_mangle_lambdas(func)
    228 
--> 229         result, how = self._aggregate(func, _level=_level, *args, **kwargs)
    230         if how is None:
    231             return result

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
    528                 # return a MI DataFrame
    529 
--> 530                 return concat([result[k] for k in keys], keys=keys, axis=1), True
    531 
    532             elif isinstance(self, ABCSeries) and is_any_series():

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    256     )
    257 
--> 258     return op.get_result()
    259 
    260 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in get_result(self)
    466                     obj_labels = mgr.axes[ax]
    467                     if not new_labels.equals(obj_labels):
--> 468                         indexers[ax] = obj_labels.reindex(new_labels)[1]
    469 
    470                 mgrs_indexers.append((obj._data, indexers))

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\category.py in reindex(self, target, method, level, limit, tolerance)
    616                 # coerce to a regular index here!
    617                 result = Index(np.array(self), name=self.name)
--> 618                 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
    619             else:
    620 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in _reindex_non_unique(self, target)
   3434 
   3435         target = ensure_index(target)
-> 3436         indexer, missing = self.get_indexer_non_unique(target)
   3437         check = indexer != -1
   3438         new_labels = self.take(indexer[check])

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
   4792             tgt_values = target._ndarray_values
   4793 
-> 4794         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
   4795         return ensure_platform_int(indexer), missing
   4796 

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()

TypeError: '<' not supported between instances of 'str' and 'int'

Combined aggregations on a numeric (multiple) and an ordered categorical column (single) also fail with the same TypeError:

df.groupby('cat').agg({'nr': ['min', 'max'], 'cat_ord': 'min'})

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-164-b1d70184bd81> in <module>
----> 1 df.groupby('cat').agg({'nr': ['min', 'max'], 'cat_ord': 'min'})

...

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()

TypeError: '<' not supported between instances of 'str' and 'int'

Problem description

Aggregations on ordered categoricals drop the grouping index, or crash, as shown above.

This makes it hard to calculate combined aggregations over big data sets correctly and efficiently.

Expected Output

Aggregations on ordered categoricals should work as on non-categorical columns.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.11.1
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.3.5
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-08-07T21:27:18Z

Thanks for the report. Are you interested in debugging to see where things go wrong?

bongolegend · 2019-08-13T00:03:14Z

@kpflugshaupt I ran your code samples and did not get any errors, on pandas 0.24.2. Some later update must have broken this. I can look into this.

@TomAugspurger is there an easy way to see what changes between 0.24.2 and 0.25.0 would have affected this functionality? The whatsnew docs (could be a lot of combing)?

kpflugshaupt · 2019-08-13T16:24:16Z

Hi @ncernek ,

I also cannot recall seeing this ever before 0.25.0, so the upgrade has probably introduced it.

As to how to tackle this: I would (given time) debug the call, and either catch where it's going wrong (compare to working call), or list all the visited files and cross-reference the whatsnew file.

Thanks for looking into this! I may also have a go, but not in the next days -- too much work.

Cheers
Kaspar

TomAugspurger · 2019-08-13T16:35:59Z

You can also use git bisect to find the faulty commit. Might be able to look through PRs in the git log that mention groupby or categorical to narrow down the bisect range.

bongolegend · 2019-08-13T18:53:11Z

I went the git bisect route. I wrote a single failing test for this condition

df.groupby("cat").agg({"nr": "min", "cat_ord": ["min", "max"]})

Result of git bisect

d0292fe1e12f1d460e52df4da4250bef32324579 is the first bad commit
commit d0292fe1e12f1d460e52df4da4250bef32324579
Author: Jeff Reback <jeff@reback.net>
Date:   Thu Jun 27 18:13:27 2019 -0500

    BUG: preserve categorical & sparse types when grouping / pivot (#27071)

link to that PR

Looks like a lot of thinking went into that PR, maybe those of you (e.g. @jreback ) who worked on it want to look into this bug further?

TomAugspurger · 2019-08-14T20:33:48Z

Thanks @ncernek. At least for this specific one, it's from how concat handles a mix of categorical & non-categorical indexes

In [2]: a = pd.DataFrame({"A": [1, 2]})

In [3]: b = pd.DataFrame({"B": [3, 4]}, index=pd.CategoricalIndex(['a', 'b']))

In [4]: pd.concat([a, b], axis=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-a98a78ec2995> in <module>
----> 1 pd.concat([a, b], axis=1)

~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    256     )
    257
--> 258     return op.get_result()
    259
    260

~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
    466                     obj_labels = mgr.axes[ax]
    467                     if not new_labels.equals(obj_labels):
--> 468                         indexers[ax] = obj_labels.reindex(new_labels)[1]
    469
    470                 mgrs_indexers.append((obj._data, indexers))

~/sandbox/pandas/pandas/core/indexes/category.py in reindex(self, target, method, level, limit, tolerance)
    615                 # coerce to a regular index here!
    616                 result = Index(np.array(self), name=self.name)
--> 617                 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
    618             else:
    619

~/sandbox/pandas/pandas/core/indexes/base.py in _reindex_non_unique(self, target)
   3388
   3389         target = ensure_index(target)
-> 3390         indexer, missing = self.get_indexer_non_unique(target)
   3391         check = indexer != -1
   3392         new_labels = self.take(indexer[check])

~/sandbox/pandas/pandas/core/indexes/base.py in get_indexer_non_unique(self, target)
   4751             tgt_values = target._ndarray_values
   4752
-> 4753         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
   4754         return ensure_platform_int(indexer), missing
   4755

~/sandbox/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()
    305             # increasing, then use binary search for each starget
    306             for starget in stargets:
--> 307                 start = values.searchsorted(starget, side='left')
    308                 end = values.searchsorted(starget, side='right')
    309                 if start != end:

TypeError: '<' not supported between instances of 'str' and 'int'

In [5]: a = pd.DataFrame({"A": [1, 2]})

In [6]: b = pd.DataFrame({"B": [3, 4]}, index=pd.CategoricalIndex(['a', 'b']))

In [7]: pd.concat([a, b], axis=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-a98a78ec2995> in <module>
----> 1 pd.concat([a, b], axis=1)

~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    256     )
    257
--> 258     return op.get_result()
    259
    260

~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
    466                     obj_labels = mgr.axes[ax]
    467                     if not new_labels.equals(obj_labels):
--> 468                         indexers[ax] = obj_labels.reindex(new_labels)[1]
    469
    470                 mgrs_indexers.append((obj._data, indexers))

~/sandbox/pandas/pandas/core/indexes/category.py in reindex(self, target, method, level, limit, tolerance)
    615                 # coerce to a regular index here!
    616                 result = Index(np.array(self), name=self.name)
--> 617                 new_target, indexer, _ = result._reindex_non_unique(np.array(target))
    618             else:
    619

~/sandbox/pandas/pandas/core/indexes/base.py in _reindex_non_unique(self, target)
   3388
   3389         target = ensure_index(target)
-> 3390         indexer, missing = self.get_indexer_non_unique(target)
   3391         check = indexer != -1
   3392         new_labels = self.take(indexer[check])

~/sandbox/pandas/pandas/core/indexes/base.py in get_indexer_non_unique(self, target)
   4751             tgt_values = target._ndarray_values
   4752
-> 4753         indexer, missing = self._engine.get_indexer_non_unique(tgt_values)
   4754         return ensure_platform_int(indexer), missing
   4755

~/sandbox/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_indexer_non_unique()
    305             # increasing, then use binary search for each starget
    306             for starget in stargets:
--> 307                 start = values.searchsorted(starget, side='left')
    308                 end = values.searchsorted(starget, side='right')
    309                 if start != end:

I think that should coerce both to object-dtype index before going too far.

BaruchYoussin · 2019-12-01T05:04:27Z

Here is another bug which is probably another face of this bug:

Aggregating a categorical column in a groupby yields a Categorical object rather than a Series.

Here is the code:

df = pd.DataFrame({
    'nr': [1,2,3,4,5,6,7,8], 
    'cat_ord': list('aabbccdd'), 
    'cat':list('aaaabbbb')
})
df = df.astype({'cat': 'category', 'cat_ord': 'category'})
df['cat_ord'] = df['cat_ord'].cat.as_ordered()
df

nr cat_ord cat
0 1 a a
1 2 a a
2 3 b a
3 4 b a

df.groupby('cat_ord')['cat'].first()

[a, a, b, b]
Categories (2, object): [a, b]

type(df.groupby('cat_ord')['cat'].first())

<class 'pandas.core.arrays.categorical.Categorical'>

BaruchYoussin · 2019-12-01T07:53:41Z

I suspect that this bug was introduced by the following change in 0.25.

Aggregation in groupby in Pandas 0.24 dropped the categorical dtype, replacing it by object which was a string.

Version 0.25 introduced additional code to restore the original dtype. It seems to me that it is this code which is to blame for this bug.

I also forgot to mention that the aggregation of categorical columns take huge amount of time on large datasets, much much more than for other column types. For this reason I bypassed it entirely in my code, aggregating string columns instead, despite the memory cost.

mroeschke · 2020-06-28T23:26:28Z

Looks like the examples work on master. Could use tests

In [125]: df = pd.DataFrame({
     ...:     'nr': [1,2,3,4,5,6,7,8],
     ...:     'cat_ord': list('aabbccdd'),
     ...:     'cat':list('aaaabbbb')
     ...: })
     ...: df = df.astype({'cat': 'category', 'cat_ord': 'category'})
     ...: df['cat_ord'] = df['cat_ord'].cat.as_ordered()

In [126]: df.groupby('cat').agg({'nr': 'min', 'cat_ord': ['min', 'max']})
     ...:
Out[126]:
     nr cat_ord
    min     min max
cat
a     1       a   b
b     5       c   d

In [127]: df.groupby('cat').agg({'nr': ['min', 'max'], 'cat_ord': 'min'})
     ...:
Out[127]:
     nr     cat_ord
    min max     min
cat
a     1   4       a
b     5   8       c

kpflugshaupt · 2020-07-01T07:56:38Z

This seems solved on the current productive release.

In my testing, all cases from the initial report came out correctly, with no regressions.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : German_Switzerland.1252
pandas : 1.0.5
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.9
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : 1.2.9
numba : None

Well done, whoever fixed this!

@mroeschke: Should I close this, or would you wait for more positive test results? (I don't have an installation of the development branch, so cannot test on that)

BaruchYoussin · 2020-07-01T08:41:37Z

Works fine on 1.0.4.

mroeschke · 2020-07-01T16:49:42Z

@kpflugshaupt we'll want to add a regression test to ensure that this issue doesn't occur again. Are you interested in submitted a PR with that test?

kpflugshaupt · 2020-07-07T07:54:06Z

@kpflugshaupt we'll want to add a regression test to ensure that this issue doesn't occur again. Are you interested in submitted a PR with that test?

@mroeschke Not sure I'll have the time -- work & family tend to get in the way. Possibly in the week starting July 20th. I'll try!

mathurk1 · 2020-08-03T22:17:09Z

take

TomAugspurger added Categorical Categorical Data Type Groupby labels Aug 7, 2019

TomAugspurger added this to the Contributions Welcome milestone Aug 7, 2019

TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 14, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Categorical Categorical Data Type Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 28, 2020

github-actions bot assigned mathurk1 Aug 3, 2020

mathurk1 removed their assignment Aug 7, 2020

mathurk1 mentioned this issue Aug 8, 2020

TST: add test for agg on ordered categorical cols #35630

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Aug 10, 2020

jreback added Categorical Categorical Data Type Groupby labels Aug 10, 2020

jreback closed this as completed in #35630 Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: aggregation on ordered categorical column drops grouping index or crashes, depending on context #27800

BUG: aggregation on ordered categorical column drops grouping index or crashes, depending on context #27800

kpflugshaupt commented Aug 7, 2019

INSTALLED VERSIONS

TomAugspurger commented Aug 7, 2019

bongolegend commented Aug 13, 2019 •

edited

Loading

kpflugshaupt commented Aug 13, 2019

TomAugspurger commented Aug 13, 2019

bongolegend commented Aug 13, 2019

TomAugspurger commented Aug 14, 2019

BaruchYoussin commented Dec 1, 2019

BaruchYoussin commented Dec 1, 2019

mroeschke commented Jun 28, 2020

kpflugshaupt commented Jul 1, 2020

INSTALLED VERSIONS

BaruchYoussin commented Jul 1, 2020

mroeschke commented Jul 1, 2020

kpflugshaupt commented Jul 7, 2020 •

edited

Loading

mathurk1 commented Aug 3, 2020

BUG: aggregation on ordered categorical column drops grouping index or crashes, depending on context #27800

BUG: aggregation on ordered categorical column drops grouping index or crashes, depending on context #27800

Comments

kpflugshaupt commented Aug 7, 2019

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Aug 7, 2019

bongolegend commented Aug 13, 2019 • edited Loading

kpflugshaupt commented Aug 13, 2019

TomAugspurger commented Aug 13, 2019

bongolegend commented Aug 13, 2019

TomAugspurger commented Aug 14, 2019

BaruchYoussin commented Dec 1, 2019

BaruchYoussin commented Dec 1, 2019

mroeschke commented Jun 28, 2020

kpflugshaupt commented Jul 1, 2020

Output of pd.show_versions()

INSTALLED VERSIONS

BaruchYoussin commented Jul 1, 2020

mroeschke commented Jul 1, 2020

kpflugshaupt commented Jul 7, 2020 • edited Loading

mathurk1 commented Aug 3, 2020

Output of `pd.show_versions()`

bongolegend commented Aug 13, 2019 •

edited

Loading

Output of `pd.show_versions()`

kpflugshaupt commented Jul 7, 2020 •

edited

Loading