Inconsistent behaviour when calling apply() on a categorical column with missing data #20714

mojones · 2018-04-16T14:18:22Z

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series(['1-1','1-1',np.NaN], dtype='category')
>>> s1.apply(lambda x: x.split('-')[0])
0      1
1      1
2    NaN
dtype: category
Categories (1, object): [1]
>>> s2 = pd.Series(['1-1','1-2',np.NaN], dtype='category')
>>> s2.apply(lambda x: x.split('-')[0])
0    1
1    1
2    1
dtype: object

Problem description

In the above code, s1 shows the expected behaviour. We are trying to transform a categorical series by getting the part before the hyphen, and for rows where the original value is NaN the output is also NaN.

The series s2 shows the unexpected behaviour - note only a single change to the original series, the middle value has changed from '1-1' to '1-2'. The third value, which was NaN in the original series now becomes '1' in the output rather than staying as NaN. Also, the dtype of the result series is now object rather than category. It looks like maybe the NaN is somehow getting the applied value of the previous row.

Expected Output

0      1
1      1
2    NaN

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2018-04-16T20:32:05Z

Thanks for the report. This is partially a symptom of / related to #15706, but that is more an API issue this is an actual bug.

If the resulting map against the categories isn't unique we take against them, but are using np.take which wraparounds the -1 used for missing values, should use our take_1d instead.

pandas/pandas/core/arrays/categorical.py

Line 1156 in 4a34497

return np.take(new_categories, self._codes)

ladydata · 2018-04-18T03:41:47Z

Hey @chris-b1 can I work on this bug? I will aim to have it done before the next major release.
cc: @geoninja

chris-b1 · 2018-04-18T11:15:25Z

Yes, please do!

…

On Tue, Apr 17, 2018, 10:42 PM Paula ***@***.***> wrote: Hey @chris-b1 <https://github.com/chris-b1> can I work on this bug? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20714 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB1b_KUmZ6u2GAsCSXNVQ4z5dWyQjY7Gks5tprYUgaJpZM4TWn7A> .

nprad · 2018-05-13T20:14:41Z

Hello @chris-b1. Is @ladydata still working on this issue? If not, can I take this issue up?

ladydata · 2018-05-14T04:17:32Z

Hi @nprad, I'm participating of a sprint tomorrow and I plan to work on this issue. Please don't be discouraged, there are more than 2000 issues waiting to be worked on, you will surely find something else that you're interested. Also, if for any reason I'm unable to work on this issue specifically, I will ping you, but as I said, it has been in my plan.

ladydata · 2018-05-16T04:39:05Z

Hello @chris-b1, I currently have the result below. Should s2 be of the category type without casting? If so, any suggestion on the best way to approach this?

>>> import pandas as pd
>>> import numpy as np

>>> s1 = pd.Series(['1-1','1-1',np.NaN], dtype='category')
>>> s1.apply(lambda x: x.split('-')[0])
0      1
1      1
2    NaN
dtype: category
Categories (1, object): [1]

>>> s2 = pd.Series(['1-1','1-2',np.NaN], dtype='category')
>>> s2.apply(lambda x: x.split('-')[0])
0      1
1      1
2    NaN
dtype: object

>>> s2.apply(lambda x: x.split('-')[0]).astype('category')
0      1
1      1
2    NaN
dtype: category
Categories (1, object): [1]

chris-b1 · 2018-05-16T13:56:12Z

Yes, that's consistent with the current API - there's certainly an argument for changing it (#15706), but to just fix the bug that behavior is fine.

In [4]: s1 = pd.Series(['1-1','1-2'], dtype='category')

In [5]: s1.apply(lambda x: x.split('-')[0])
Out[5]:
0    1
1    1
dtype: object

ladydata · 2018-05-16T18:53:27Z

@chris-b1 great! I will write the tests based on that. I will also take a look at the older issue you mentioned.

manuhortet · 2018-07-13T09:23:41Z

Hey @ladydata, did you write those tests? If not, I'm up to take the task.

ladydata · 2018-07-13T22:52:19Z

hey @manuhortet, yes I did write a couple of tests but haven't completed the process. I will set a deadline to complete by the end of the month, and if it's not done by then, from August 1st it's all yours, sounds good?

manuhortet · 2018-08-01T07:30:18Z

Hi again @ladydata, should I take this already? 😄

ladydata · 2018-08-01T20:05:21Z

@manuhortet sure, go ahead!

…ith missing data pandas-dev#20714 SOLVED

chris-b1 added Bug Categorical Categorical Data Type Effort Low good first issue labels Apr 16, 2018

chris-b1 added this to the Next Major Release milestone Apr 16, 2018

manuhortet added a commit to manuhortet/pandas that referenced this issue Aug 3, 2018

Inconsistent behaviour when calling apply() on a categorical column w…

b98ab11

…ith missing data pandas-dev#20714 SOLVED

manuhortet mentioned this issue Aug 3, 2018

Inconsistent behaviour when apply() used on categorical with NaN values #22191

Closed

4 tasks

batterseapower mentioned this issue Aug 28, 2018

Series.map on a categorical does not process missing values #22527

Closed

alimcmaster1 mentioned this issue Feb 2, 2019

TST: tests for categorical apply #25095

Merged

2 tasks

jreback modified the milestones: Contributions Welcome, 0.25.0 Feb 2, 2019

jreback closed this as completed in #25095 Feb 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour when calling apply() on a categorical column with missing data #20714

Inconsistent behaviour when calling apply() on a categorical column with missing data #20714

mojones commented Apr 16, 2018

INSTALLED VERSIONS

chris-b1 commented Apr 16, 2018

ladydata commented Apr 18, 2018 •

edited

Loading

chris-b1 commented Apr 18, 2018 via email

nprad commented May 13, 2018 •

edited

Loading

ladydata commented May 14, 2018

ladydata commented May 16, 2018 •

edited

Loading

chris-b1 commented May 16, 2018

ladydata commented May 16, 2018

manuhortet commented Jul 13, 2018

ladydata commented Jul 13, 2018

manuhortet commented Aug 1, 2018

ladydata commented Aug 1, 2018

Inconsistent behaviour when calling apply() on a categorical column with missing data #20714

Inconsistent behaviour when calling apply() on a categorical column with missing data #20714

Comments

mojones commented Apr 16, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Apr 16, 2018

ladydata commented Apr 18, 2018 • edited Loading

chris-b1 commented Apr 18, 2018 via email

nprad commented May 13, 2018 • edited Loading

ladydata commented May 14, 2018

ladydata commented May 16, 2018 • edited Loading

chris-b1 commented May 16, 2018

ladydata commented May 16, 2018

manuhortet commented Jul 13, 2018

ladydata commented Jul 13, 2018

manuhortet commented Aug 1, 2018

ladydata commented Aug 1, 2018

Output of `pd.show_versions()`

ladydata commented Apr 18, 2018 •

edited

Loading

nprad commented May 13, 2018 •

edited

Loading

ladydata commented May 16, 2018 •

edited

Loading