Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH24241 make Categorical.map transform nans #24275

Merged
merged 5 commits into from
Dec 20, 2018

Conversation

JustinZhengBC
Copy link
Contributor

@JustinZhengBC JustinZhengBC commented Dec 13, 2018

Alters Categorical.map so that the mapper function is also applied to NaN values. The mentioned bug report brings up the example of calling apply(pd.isna) on a categorical series, arguing that the values and dtype of the returned list should be consistent with non-categorical series. This PR makes the values consistent. The discrepancies in dtypes are present in categorical series without NaN's and consistent with documentation.

@pep8speaks
Copy link

Hello @JustinZhengBC! Thanks for submitting the PR.

@codecov
Copy link

codecov bot commented Dec 13, 2018

Codecov Report

Merging #24275 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24275      +/-   ##
==========================================
- Coverage   92.22%   92.22%   -0.01%     
==========================================
  Files         162      162              
  Lines       51787    51798      +11     
==========================================
+ Hits        47761    47771      +10     
- Misses       4026     4027       +1
Flag Coverage Δ
#multiple 90.62% <100%> (ø) ⬆️
#single 42.99% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.37% <100%> (+0.06%) ⬆️
pandas/io/json/json.py 92.61% <0%> (-0.48%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31a3512...d765dc3. Read the comment docs.

3 similar comments
@codecov
Copy link

codecov bot commented Dec 13, 2018

Codecov Report

Merging #24275 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24275      +/-   ##
==========================================
- Coverage   92.22%   92.22%   -0.01%     
==========================================
  Files         162      162              
  Lines       51787    51798      +11     
==========================================
+ Hits        47761    47771      +10     
- Misses       4026     4027       +1
Flag Coverage Δ
#multiple 90.62% <100%> (ø) ⬆️
#single 42.99% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.37% <100%> (+0.06%) ⬆️
pandas/io/json/json.py 92.61% <0%> (-0.48%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31a3512...d765dc3. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 13, 2018

Codecov Report

Merging #24275 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24275      +/-   ##
==========================================
- Coverage   92.22%   92.22%   -0.01%     
==========================================
  Files         162      162              
  Lines       51787    51798      +11     
==========================================
+ Hits        47761    47771      +10     
- Misses       4026     4027       +1
Flag Coverage Δ
#multiple 90.62% <100%> (ø) ⬆️
#single 42.99% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.37% <100%> (+0.06%) ⬆️
pandas/io/json/json.py 92.61% <0%> (-0.48%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31a3512...d765dc3. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 13, 2018

Codecov Report

Merging #24275 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24275      +/-   ##
==========================================
- Coverage   92.22%   92.22%   -0.01%     
==========================================
  Files         162      162              
  Lines       51787    51798      +11     
==========================================
+ Hits        47761    47771      +10     
- Misses       4026     4027       +1
Flag Coverage Δ
#multiple 90.62% <100%> (ø) ⬆️
#single 42.99% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.37% <100%> (+0.06%) ⬆️
pandas/io/json/json.py 92.61% <0%> (-0.48%) ⬇️
pandas/util/testing.py 87.51% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31a3512...d765dc3. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 13, 2018

Codecov Report

Merging #24275 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24275      +/-   ##
==========================================
+ Coverage   92.29%   92.29%   +<.01%     
==========================================
  Files         162      162              
  Lines       51808    51834      +26     
==========================================
+ Hits        47814    47840      +26     
  Misses       3994     3994
Flag Coverage Δ
#multiple 90.7% <100%> (ø) ⬆️
#single 42.98% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.32% <100%> (+0.01%) ⬆️
pandas/core/groupby/groupby.py 96.65% <0%> (-0.03%) ⬇️
pandas/util/testing.py 87.57% <0%> (-0.01%) ⬇️
pandas/core/generic.py 96.66% <0%> (ø) ⬆️
pandas/core/frame.py 96.91% <0%> (ø) ⬆️
pandas/core/series.py 93.71% <0%> (ø) ⬆️
pandas/core/reshape/merge.py 94.3% <0%> (+0.01%) ⬆️
pandas/core/base.py 97.66% <0%> (+0.01%) ⬆️
pandas/core/window.py 96.41% <0%> (+0.01%) ⬆️
pandas/core/groupby/generic.py 87.15% <0%> (+0.03%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6111f64...82859d9. Read the comment docs.

return self.from_codes(self._codes.copy(),
categories=new_categories,
ordered=self.ordered)
if isinstance(mapper, (dict, ABCSeries)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a special case? use is_dict_like

except (AttributeError, KeyError, TypeError, ValueError):
new_value = np.nan

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you try/except here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how / why can this fail

Copy link
Contributor Author

@JustinZhengBC JustinZhengBC Dec 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttributeError: if mapper calls a method of the element (e.g. lambda x: x.lower())
KeyError: if mapper is a dict without a key for NaN
TypeError: if mapper expects some type other than a float
ValueError: if mapper tries converting float values to ints (e.g. lambda x: int(x))

if you mean the try/except below that, that was already there. from_codes raises a ValueError if the mapping isn't one-to-one

@@ -311,6 +311,37 @@ def test_map_with_categorical_series(self):
exp = pd.Index(["odd", "even", "odd", np.nan])
tm.assert_index_equal(a.map(c), exp)

@pytest.mark.parametrize('data, f', [[[1, 1, np.nan], pd.isna],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write this as

@pytest.mark.parametrize(
     'data',
    'f',
      [
         ......
      ]))
.....```

expected = pd.Index([False, False, True])
tm.assert_index_equal(result, expected)

@pytest.mark.parametrize('data, f', [[[1, 1, np.nan], {1: False}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

pandas/tests/indexes/test_category.py Show resolved Hide resolved
@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Categorical Categorical Data Type labels Dec 14, 2018
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the handling of missing values in the docstring?

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
new_value = mapper[np.nan]
else:
new_value = mapper(np.nan)
except (AttributeError, KeyError, TypeError, ValueError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really comfortable with this. mapper is a user-defined function. Consider

def f(x):
    if isnan(x):
        raise TypeError
    ...

that TypeError would be swallowed by pandas.

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
@JustinZhengBC
Copy link
Contributor Author

JustinZhengBC commented Dec 19, 2018

I decided to go with @TomAugspurger's suggestion to just change the documentation. I did leave in one change in the function, because previously calling np.take would cause NaN values (represented by -1 in self._codes) to take the last element of new_categories.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just a small comment. ping on green.

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 0.24.0 milestone Dec 19, 2018
@JustinZhengBC
Copy link
Contributor Author

@jreback green

@@ -1234,6 +1234,11 @@ def map(self, mapper):
categories=new_categories,
ordered=self.ordered)
except ValueError:
# NA values are represented in self._codes with -1
# np.take causes NA values to take final element in new_categories
if any(self._codes == -1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use np.any? I think any will short-circuit, but np.any will likely be faster.

@TomAugspurger
Copy link
Contributor

Thanks @JustinZhengBC!

@TomAugspurger TomAugspurger merged commit ff69f45 into pandas-dev:master Dec 20, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* BUG-24241 make Categorical.map transform nans
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* BUG-24241 make Categorical.map transform nans
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.apply on categorical with NaN has wrong behavior
4 participants