Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 0.24: TypeError exception when using dropna on dataframe with categorical index #25087

Closed
hantoine opened this issue Feb 1, 2019 · 10 comments · Fixed by #27100
Closed
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@hantoine
Copy link

hantoine commented Feb 1, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

dd = pd.DataFrame(np.arange(10))
dd['x2'] = dd[0] * dd[0]
dd['q'] = pd.qcut(dd['x2'], 5)
dd.set_index('q', inplace=True)
dd.dropna()

Problem description

The call to dropna raised the following exception:

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

This seems to happen only with a categorical index for which the intervals are not all of the same length.

There was no issue in version 0.23.4 and the issue is not fixed in the master

Expected Output

No exception should be raised.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 25ff472
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.15-300.fc29.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.25.0.dev0+44.g25ff47292
pytest: None
pip: 19.0.1
setuptools: 40.6.3
Cython: 0.29.4
numpy: 1.16.1
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@rs2
Copy link
Contributor

rs2 commented Feb 2, 2019

Referencing #24048 that changed the .dropna() callstack between 0.23.4 and 0.24.0.

@rs2
Copy link
Contributor

rs2 commented Feb 2, 2019

@rs2
Copy link
Contributor

rs2 commented Feb 2, 2019

CC @jreback @makbigc

@samuelsinayoko
Copy link
Contributor

Have got a PR out for this #25090.
Have added 3 tests and made them pass, but the changes to the interval range get_indexer is lacking. Would welcome some directions about that as I'm not too familiar with get_indexer.

@jorisvandenbossche
Copy link
Member

@jreback if you are changing milestones for regressions, can you leave them on 0.24.3 ?

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.25.0, 0.24.3 Mar 12, 2019
@jreback
Copy link
Contributor

jreback commented Mar 12, 2019

@jorisvandenbossche we are highly unlikely to do a 0.24.3

@jorisvandenbossche
Copy link
Member

That's not how I interpreted the discussion from the dev meeting (but we also didn't discuss it in detail ..). In any case, if you don't want to tag remaining regressions with 0.24.3, can you raise that in the 0.24.x release issue: #24949 ?

@Froskekongen
Copy link

I just wanted to check the status on this issue. Will there be a 0.24.3 release addressing this? Personally, I think it's an important issue to deal with for using machine learning with pandas. Are there other PRs that may fix the failing builds on windows/macOS that can be merged into #25090?

@alexdevmotion
Copy link

I'm running into the same error for this sequence:

import pandas as pd
display(pd.__version__)
binned_series_1 = pd.qcut([1,2,3], 2)
binned_series_2 = pd.qcut([4,5,6], 2)
ct = pd.crosstab(binned_series_1, binned_series_2)

'0.24.2'


TypeError Traceback (most recent call last)
in ()
2 binned_series_1 = pd.qcut([1,2,3], 2)
3 binned_series_2 = pd.qcut([4,5,6], 2)
----> 4 ct = pd.crosstab(binned_series_1, binned_series_2)

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\reshape\pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)
519 table = df.pivot_table('dummy', index=rownames, columns=colnames,
520 margins=margins, margins_name=margins_name,
--> 521 dropna=dropna, **kwargs)
522
523 # Post-process

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\frame.py in pivot_table(self, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
5757 aggfunc=aggfunc, fill_value=fill_value,
5758 margins=margins, dropna=dropna,
-> 5759 margins_name=margins_name)
5760
5761 def stack(self, level=-1, dropna=True):

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\reshape\pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
145 # GH 15193 Make sure empty columns are removed if dropna=True
146 if isinstance(table, ABCDataFrame) and dropna:
--> 147 table = table.dropna(how='all', axis=1)
148
149 return table

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\frame.py in dropna(self, axis, how, thresh, subset, inplace)
4596 raise TypeError('must specify how or thresh')
4597
-> 4598 result = self.loc(axis=axis)[mask]
4599
4600 if inplace:

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1498
1499 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1500 return self._getitem_axis(maybe_callable, axis=axis)
1501
1502 def _is_scalar_access(self, key):

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1857 axis = self.axis or 0
1858
-> 1859 if is_iterator(key):
1860 key = list(key)
1861

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\dtypes\inference.py in is_iterator(obj)
155 # Python 3 generators have
156 # next instead of next
--> 157 return hasattr(obj, 'next')
158
159

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\generic.py in getattr(self, name)
5063 return object.getattribute(self, name)
5064 else:
-> 5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
5067 return object.getattribute(self, name)

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\base.py in _can_hold_identifiers_and_holds_name(self, name)
3983 """
3984 if self.is_object() or self.is_categorical():
-> 3985 return name in self
3986 return False
3987

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\category.py in contains(self, key)
325 return self.hasnans
326
--> 327 return contains(self, key, container=self._engine)
328
329 @appender(_index_shared_docs['contains'] % _index_doc_kwargs)

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\arrays\categorical.py in contains(cat, key, container)
186 # can't be in container either.
187 try:
--> 188 loc = cat.categories.get_loc(key)
189 except KeyError:
190 return False

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\interval.py in get_loc(self, key, method)
768 key = self._maybe_cast_slice_bound(key, 'left', None)
769
--> 770 start, stop = self._find_non_overlapping_monotonic_bounds(key)
771
772 if start is None or stop is None:

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\interval.py in _find_non_overlapping_monotonic_bounds(self, key)
715 # scalar or index-like
716
--> 717 start = self._searchsorted_monotonic(key, 'left')
718 stop = self._searchsorted_monotonic(key, 'right')
719 return start, stop

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\interval.py in _searchsorted_monotonic(self, label, side, exclude_label)
679 label = _get_prev_label(label)
680
--> 681 return sub_idx._searchsorted_monotonic(label, side)
682
683 def _get_loc_only_exact_matches(self, key):

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\indexes\base.py in _searchsorted_monotonic(self, label, side)
4754 def _searchsorted_monotonic(self, label, side='left'):
4755 if self.is_monotonic_increasing:
-> 4756 return self.searchsorted(label, side=side)
4757 elif self.is_monotonic_decreasing:
4758 # np.searchsorted expects ascending sort order, have to reverse

~\AppData\Local\conda\conda\envs\main\lib\site-packages\pandas\core\base.py in searchsorted(self, value, side, sorter)
1499 def searchsorted(self, value, side='left', sorter=None):
1500 # needs coercion on the key (DatetimeIndex does already)
-> 1501 return self._values.searchsorted(value, side=side, sorter=sorter)
1502
1503 def drop_duplicates(self, keep='first', inplace=False):

TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

@jreback jreback modified the milestones: 0.24.3, Contributions Welcome Apr 20, 2019
@jschendel
Copy link
Member

This will be fixed by #27100.

A more concise equivalent example for testing purposes:

In [1]: import pandas as pd; pd.__version__                                                                             
Out[1]: '0.24.2'

In [2]: idx = pd.CategoricalIndex(pd.IntervalIndex.from_breaks([0, 2.78, 3.14, 6.28]))                                  

In [3]: df = pd.DataFrame({'A': list('abc')}, index=idx)                                                                

In [4]: df                                                                                                              
Out[4]: 
              A
(0.0, 2.78]   a
(2.78, 3.14]  b
(3.14, 6.28]  c

In [5]: df.dropna()                                                                                                     
---------------------------------------------------------------------------
TypeError: Cannot cast array data from dtype('float64') to dtype('<U32') according to the rule 'safe'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type Regression Functionality that used to work in a prior pandas version
Projects
None yet
8 participants