Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: GH2003 Series.isin for categorical dtypes #20522

Merged
merged 21 commits into from
Apr 25, 2018
Merged

PERF: GH2003 Series.isin for categorical dtypes #20522

merged 21 commits into from
Apr 25, 2018

Conversation

bourbaki
Copy link
Contributor

@bourbaki bourbaki commented Mar 28, 2018

I have added a branching for the categorical case in Series.isin function.
I have also added a test for the most crucial cases (nans).
closes #20003

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to move your categorical isin logic to a new method Categorical.isin.

Then in algorithms.py you can do

if is_categorical_dtype(values):
    values = getattr(values, '_values', values)  # extract Categorical from Series/Index
    # your code

Can you add an ASV for this?

@@ -3507,7 +3507,11 @@ def isin(self, values):
5 False
Name: animal, dtype: bool
"""
result = algorithms.isin(com._values_from_object(self), values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the _values_from_object can be moved to algorithms.isin? Then this could just be result = algorithms.isin(self, values)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's try to do this, @Ma3aXaKa can you make this change

@@ -1255,6 +1255,17 @@ def test_isin_empty(self, empty):
result = s.isin(empty)
tm.assert_series_equal(expected, result)

def test_isin_cats(self):
Copy link
Contributor

@TomAugspurger TomAugspurger Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go in pandas/tests/categorical/test_algos.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@TomAugspurger
Copy link
Contributor

I'd prefer to move your categorical isin logic to a new method Categorical.isin.

To be a bit clearer about this. I'd imagine the Categorical.isin method doing the codes and categories stuff, to see which codes the items in values would map to. Then calling algorithms.value_counts(codes, code_values), where code_values is the code that each value maps to (if any).

@@ -345,6 +345,7 @@ Other Enhancements
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
- zip compression is supported via ``compression=zip`` in :func:`DataFrame.to_pickle`, :func:`Series.to_pickle`, :func:`DataFrame.to_csv`, :func:`Series.to_csv`, :func:`DataFrame.to_json`, :func:`Series.to_json`. (:issue:`17778`)
- Performance enhancement for :func:`Series.isin` in the case of categorical dtypes (:issue:`20003`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the "Performance Improvements" section? (starts around line 783).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@bourbaki
Copy link
Contributor Author

bourbaki commented Mar 28, 2018

@TomAugspurger Why do I need to call algorithms.value_counts(codes, code_values) in Categorical.isin? Do you mean algorithms.isin?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 28, 2018 via email

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have an associated issue? this would need asv benchmarks

@@ -403,8 +403,15 @@ def isin(comps, values):
if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)):
values = construct_1d_object_array_from_listlike(list(values))

comps, dtype, _ = _ensure_data(comps)
values, _, _ = _ensure_data(values, dtype=dtype)
if not is_categorical_dtype(comps):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse this logic here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I am working on asv benchmark

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Mar 30, 2018
@codecov
Copy link

codecov bot commented Mar 30, 2018

Codecov Report

Merging #20522 into master will increase coverage by <.01%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20522      +/-   ##
==========================================
+ Coverage   91.77%   91.77%   +<.01%     
==========================================
  Files         153      153              
  Lines       49257    49270      +13     
==========================================
+ Hits        45207    45220      +13     
  Misses       4050     4050
Flag Coverage Δ
#multiple 90.17% <93.33%> (ø) ⬆️
#single 41.89% <93.33%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.39% <100%> (+0.02%) ⬆️
pandas/core/indexes/base.py 96.63% <100%> (ø) ⬆️
pandas/core/series.py 93.99% <100%> (+0.09%) ⬆️
pandas/core/arrays/categorical.py 95.7% <90%> (-0.08%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ec74e5...7b680cd. Read the comment docs.

@bourbaki
Copy link
Contributor Author

bourbaki commented Mar 30, 2018

@TomAugspurger I have moved the logic to Categorical.isin. But I still need to do more careful handling of values argument of that function. What are the possible argument types that can be used in Series.isin for example? Is using _sanitize_array enough to cover all basic cases?

And I cannot understand why checks on CircleCI have failed. The same tests on my laptop are passing.
And I think my changes do not affect JSON parsing 😄 .

@TomAugspurger
Copy link
Contributor

Merged in master. That'll hopefully fix the circleCI failure.

@TomAugspurger
Copy link
Contributor

@Ma3aXaKa did you run the ASV benchmarks? Or just a simple %timeit before / after? How does the performance look?

@TomAugspurger
Copy link
Contributor

What are the possible argument types that can be used in Series.isin for example?

It looks like it can be just about anything. Did you run into issues without using _sanitize_array?

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 2, 2018

Yep. I have run the asv benchmark. I got the positive result. Do I need to post them here?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 2, 2018 via email

@TomAugspurger
Copy link
Contributor

Can you also add a release note (doc/source/whatsnew/v0.23.0.txt) noting the performance improvement? Make sure to pull my commit first.

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 2, 2018

I've already added the line about the performance improvement to whatsnew. Do I need to add another one?

@TomAugspurger
Copy link
Contributor

Sorry, missed that.

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 2, 2018

That's what I got on my laptop from asv:

       before           after         ratio
     [fac2ef1b]       [d6c39534]
-         230±6ms      17.6±0.01ms     0.08  categoricals.IsIn.time_isin_categorical_strings

@@ -3564,7 +3564,10 @@ def isin(self, values):
5 False
Name: animal, dtype: bool
"""
result = algorithms.isin(com._values_from_object(self), values)
if is_categorical_dtype(self):
result = self._values.isin(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more ducklike check would work here, something like

if hasattr(self._values, 'isin'):
    result = self._values.isin(values)
else:
 ......

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback What's the purpose? Is it better for performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its more generic. you don't have to know its a categorical here.

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 3, 2018

@jreback Changed the condition to hasattr

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 5, 2018

I would prefer to convert values to ndarray before this logic. Is there any function I could use to safely do it?

@jreback
Copy link
Contributor

jreback commented Apr 5, 2018

use np.asarray as
tom suggests

@pep8speaks
Copy link

pep8speaks commented Apr 7, 2018

Hello @Ma3aXaKa! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on April 25, 2018 at 10:02 Hours UTC

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 7, 2018

@TomAugspurger @jreback I have added the docs and fixed the problem with Python 2 (thanks to @TomAugspurger). Is there something else to be done?


def test_isin_cats():
cat = pd.Categorical(["a", "b", np.nan])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add the issue number here as a comment

@@ -148,3 +148,18 @@ def time_rank_int_cat(self):

def time_rank_int_cat_ordered(self):
self.s_int_cat_ordered.rank()


class IsIn(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u make Isin

@@ -407,6 +407,12 @@ def isin(comps, values):
if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)):
values = construct_1d_object_array_from_listlike(list(values))

if is_categorical_dtype(comps):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u change this to is_extension_type does everything still work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you add a note: TODO(extension) here

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 9, 2018

@jreback I have renamed the benchmark class and a reference to the issue. Switching to is_extension_type doesn't work. It fails on SparseArray for example the class doesn't have isin method. I am assuming implementing these methods is the goal of #20617?

self.sample = np.random.choice(arr, sample_size)
self.ts = pd.Series(arr).astype('category')

def time_isin_categorical_strings(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 4 cases in the original issue can you cover them

@@ -407,6 +407,12 @@ def isin(comps, values):
if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)):
values = construct_1d_object_array_from_listlike(list(values))

if is_categorical_dtype(comps):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can you add a note: TODO(extension) here

raise TypeError("only list-like objects are allowed to be passed"
" to isin(), you passed a [{values_type}]"
.format(values_type=type(values).__name__))
from pandas.core.series import _sanitize_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the import to top of function

tm.assert_numpy_array_equal(expected, result)

result = cat.isin(["a", "c"])
expected = np.array([True, False, False], dtype=bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a couple of tests in pandas/tests/test_algos.py that test cats for isin, can you move here

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 15, 2018

@jreback I've made requested changes except moving tests to a different file (see comments). But the Travis CI build is failing with an S3 error.

@bourbaki
Copy link
Contributor Author

@jreback There are still errors with S3. Is this OK?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 18, 2018 via email

@bourbaki
Copy link
Contributor Author

bourbaki commented Apr 24, 2018

@TomAugspurger Hi. Did you fix the issue with S3? Is there something left to do before merging the. ranch?

@jreback
Copy link
Contributor

jreback commented Apr 24, 2018

@Ma3aXaKa can you rebase

@jreback jreback added this to the 0.23.0 milestone Apr 24, 2018
@jreback
Copy link
Contributor

jreback commented Apr 25, 2018

rebased, so let's merge on green.

@jreback jreback merged commit 60fe82c into pandas-dev:master Apr 25, 2018
@jreback
Copy link
Contributor

jreback commented Apr 25, 2018

thanks @Ma3aXaKa what looked like an easy issue morphed into code refactoring! thanks for the patch and patience! keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: isin() is slower for categorical data than for integers
5 participants