Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087

hantoine · 2019-02-01T22:34:45Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

dd = pd.DataFrame(np.arange(10))
dd['x2'] = dd[0] * dd[0]
dd['q'] = pd.qcut(dd['x2'], 5)
dd.set_index('q', inplace=True)
dd.dropna()

Problem description

The call to dropna raised the following exception:

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

This seems to happen only with a categorical index for which the intervals are not all of the same length.

There was no issue in version 0.23.4 and the issue is not fixed in the master

Expected Output

No exception should be raised.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: 25ff472
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.15-300.fc29.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.25.0.dev0+44.g25ff47292
pytest: None
pip: 19.0.1
setuptools: 40.6.3
Cython: 0.29.4
numpy: 1.16.1
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

rs2 · 2019-02-02T12:15:58Z

Referencing #24048 that changed the .dropna() callstack between 0.23.4 and 0.24.0.

rs2 · 2019-02-02T12:18:34Z

This is caused by https://github.com/pandas-dev/pandas/pull/24048/files#diff-1e79abbbdd150d4771b91ea60a4e1cc7R4592

rs2 · 2019-02-02T12:19:19Z

CC @jreback @makbigc

samuelsinayoko · 2019-02-02T16:38:37Z

Have got a PR out for this #25090.
Have added 3 tests and made them pass, but the changes to the interval range get_indexer is lacking. Would welcome some directions about that as I'm not too familiar with get_indexer.

jorisvandenbossche · 2019-03-12T12:59:43Z

@jreback if you are changing milestones for regressions, can you leave them on 0.24.3 ?

jreback · 2019-03-12T13:10:38Z

@jorisvandenbossche we are highly unlikely to do a 0.24.3

jorisvandenbossche · 2019-03-12T13:25:45Z

That's not how I interpreted the discussion from the dev meeting (but we also didn't discuss it in detail ..). In any case, if you don't want to tag remaining regressions with 0.24.3, can you raise that in the 0.24.x release issue: #24949 ?

Froskekongen · 2019-04-08T07:42:36Z

I just wanted to check the status on this issue. Will there be a 0.24.3 release addressing this? Personally, I think it's an important issue to deal with for using machine learning with pandas. Are there other PRs that may fix the failing builds on windows/macOS that can be merged into #25090?

alexdevmotion · 2019-04-18T18:26:01Z

I'm running into the same error for this sequence:

import pandas as pd
display(pd.__version__)
binned_series_1 = pd.qcut([1,2,3], 2)
binned_series_2 = pd.qcut([4,5,6], 2)
ct = pd.crosstab(binned_series_1, binned_series_2)

'0.24.2'

TypeError Traceback (most recent call last)
in ()
2 binned_series_1 = pd.qcut([1,2,3], 2)
3 binned_series_2 = pd.qcut([4,5,6], 2)
----> 4 ct = pd.crosstab(binned_series_1, binned_series_2)

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\reshape\pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)
519 table = df.pivot_table('dummy', index=rownames, columns=colnames,
520 margins=margins, margins_name=margins_name,
--> 521 dropna=dropna, **kwargs)
522
523 # Post-process

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\frame.py in pivot_table(self, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
5757 aggfunc=aggfunc, fill_value=fill_value,
5758 margins=margins, dropna=dropna,
-> 5759 margins_name=margins_name)
5760
5761 def stack(self, level=-1, dropna=True):

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\reshape\pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
145 # GH 15193 Make sure empty columns are removed if dropna=True
146 if isinstance(table, ABCDataFrame) and dropna:
--> 147 table = table.dropna(how='all', axis=1)
148
149 return table

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\frame.py in dropna(self, axis, how, thresh, subset, inplace)
4596 raise TypeError('must specify how or thresh')
4597
-> 4598 result = self.loc(axis=axis)[mask]
4599
4600 if inplace:

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1498
1499 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1500 return self._getitem_axis(maybe_callable, axis=axis)
1501
1502 def _is_scalar_access(self, key):

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1857 axis = self.axis or 0
1858
-> 1859 if is_iterator(key):
1860 key = list(key)
1861

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\dtypes\inference.py in is_iterator(obj)
155 # Python 3 generators have
156 # next instead of next
--> 157 return hasattr(obj, 'next')
158
159

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\generic.py in getattr(self, name)
5063 return object.getattribute(self, name)
5064 else:
-> 5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
5067 return object.getattribute(self, name)

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\base.py in _can_hold_identifiers_and_holds_name(self, name)
3983 """
3984 if self.is_object() or self.is_categorical():
-> 3985 return name in self
3986 return False
3987

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\category.py in contains(self, key)
325 return self.hasnans
326
--> 327 return contains(self, key, container=self._engine)
328
329 @appender(_index_shared_docs['contains'] % _index_doc_kwargs)

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\arrays\categorical.py in contains(cat, key, container)
186 # can't be in container either.
187 try:
--> 188 loc = cat.categories.get_loc(key)
189 except KeyError:
190 return False

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\interval.py in get_loc(self, key, method)
768 key = self._maybe_cast_slice_bound(key, 'left', None)
769
--> 770 start, stop = self._find_non_overlapping_monotonic_bounds(key)
771
772 if start is None or stop is None:

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\interval.py in _find_non_overlapping_monotonic_bounds(self, key)
715 # scalar or index-like
716
--> 717 start = self._searchsorted_monotonic(key, 'left')
718 stop = self._searchsorted_monotonic(key, 'right')
719 return start, stop

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\interval.py in _searchsorted_monotonic(self, label, side, exclude_label)
679 label = _get_prev_label(label)
680
--> 681 return sub_idx._searchsorted_monotonic(label, side)
682
683 def _get_loc_only_exact_matches(self, key):

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(self, label, side)
4754 def _searchsorted_monotonic(self, label, side='left'):
4755 if self.is_monotonic_increasing:
-> 4756 return self.searchsorted(label, side=side)
4757 elif self.is_monotonic_decreasing:
4758 # np.searchsorted expects ascending sort order, have to reverse

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\base.py in searchsorted(self, value, side, sorter)
1499 def searchsorted(self, value, side='left', sorter=None):
1500 # needs coercion on the key (DatetimeIndex does already)
-> 1501 return self._values.searchsorted(value, side=side, sorter=sorter)
1502
1503 def drop_duplicates(self, keep='first', inplace=False):

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

jschendel · 2019-06-28T19:46:03Z

This will be fixed by #27100.

A more concise equivalent example for testing purposes:

In [1]: import pandas as pd; pd.__version__                                                                             
Out[1]: '0.24.2'

In [2]: idx = pd.CategoricalIndex(pd.IntervalIndex.from_breaks([0, 2.78, 3.14, 6.28]))                                  

In [3]: df = pd.DataFrame({'A': list('abc')}, index=idx)                                                                

In [4]: df                                                                                                              
Out[4]: 
              A
(0.0, 2.78]   a
(2.78, 3.14]  b
(3.14, 6.28]  c

In [5]: df.dropna()                                                                                                     
---------------------------------------------------------------------------
TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

jorisvandenbossche added the europandas2019 label Feb 2, 2019

rs2 added a commit to rs2/pandas that referenced this issue Feb 2, 2019

pandas-dev#25087: Fix .dropna() functionality for categorical indices

09a9f80

rs2 mentioned this issue Feb 2, 2019

## DO NOT MERGE. BUG: Fix .dropna() functionality for categorical indices #25091

Closed

4 tasks

rs2 added a commit to rs2/pandas that referenced this issue Feb 2, 2019

pandas-dev#25087: Favor .loc instead of _take

e0c720d

samuelsinayoko mentioned this issue Feb 2, 2019

BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090

Closed

4 tasks

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type labels Feb 2, 2019

hantoine mentioned this issue Feb 5, 2019

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'` blue-yonder/tsfresh#485

Closed

jorisvandenbossche added this to the 0.24.2 milestone Feb 7, 2019

jorisvandenbossche removed the europandas2019 label Feb 7, 2019

jorisvandenbossche mentioned this issue Feb 7, 2019

MultiIndex.__contains__ try-catch exception types too narrow #24570

Closed

jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Feb 7, 2019

jreback mentioned this issue Feb 11, 2019

[BUG] exception handling of MultiIndex.__contains__ too narrow #25268

Merged

jreback modified the milestones: 0.24.2, 0.25.0 Mar 12, 2019

jorisvandenbossche modified the milestones: 0.25.0, 0.24.3 Mar 12, 2019

jreback modified the milestones: 0.24.3, Contributions Welcome Apr 20, 2019

jschendel mentioned this issue Jun 28, 2019

API: Implement new indexing behavior for intervals #27100

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.25.0 Jul 1, 2019

jreback closed this as completed in #27100 Jul 2, 2019

jnj16180340 mentioned this issue Jul 23, 2019

tsfresh seems to be incompatible with latest versions of pandas blue-yonder/tsfresh#528

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087

Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087

hantoine commented Feb 1, 2019 •

edited

Loading

INSTALLED VERSIONS

rs2 commented Feb 2, 2019

rs2 commented Feb 2, 2019

rs2 commented Feb 2, 2019

samuelsinayoko commented Feb 2, 2019

jorisvandenbossche commented Mar 12, 2019

jreback commented Mar 12, 2019

jorisvandenbossche commented Mar 12, 2019

Froskekongen commented Apr 8, 2019

alexdevmotion commented Apr 18, 2019

jschendel commented Jun 28, 2019

Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087

Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087

Comments

hantoine commented Feb 1, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

rs2 commented Feb 2, 2019

rs2 commented Feb 2, 2019

rs2 commented Feb 2, 2019

samuelsinayoko commented Feb 2, 2019

jorisvandenbossche commented Mar 12, 2019

jreback commented Mar 12, 2019

jorisvandenbossche commented Mar 12, 2019

Froskekongen commented Apr 8, 2019

alexdevmotion commented Apr 18, 2019

jschendel commented Jun 28, 2019

hantoine commented Feb 1, 2019 •

edited

Loading

Output of `pd.show_versions()`