-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: GH2003 Series.isin for categorical dtypes #20522
Changes from 6 commits
19ac11a
54021b9
2514b45
80f687a
d6c3953
33e3b07
ceffccd
3247dce
2b7b1c4
4478a49
64fef49
b25da12
9f8e790
60ac658
50aca26
713712e
18c827d
993afd8
a2b70ee
fa7f0f1
7b680cd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -148,3 +148,18 @@ def time_rank_int_cat(self): | |
|
||
def time_rank_int_cat_ordered(self): | ||
self.s_int_cat_ordered.rank() | ||
|
||
|
||
class IsIn(object): | ||
|
||
goal_time = 0.2 | ||
|
||
def setup(self): | ||
n = 5 * 10**5 | ||
sample_size = 100 | ||
arr = ['s%04d' % i for i in np.random.randint(0, n // 10, size=n)] | ||
self.sample = np.random.choice(arr, sample_size) | ||
self.ts = pd.Series(arr).astype('category') | ||
|
||
def time_isin_categorical_strings(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there are 4 cases in the original issue can you cover them |
||
self.ts.isin(self.sample) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,6 +39,8 @@ | |
from pandas.util._decorators import ( | ||
Appender, cache_readonly, deprecate_kwarg, Substitution) | ||
|
||
import pandas.core.algorithms as algorithms | ||
|
||
from pandas.io.formats.terminal import get_terminal_size | ||
from pandas.util._validators import validate_bool_kwarg, validate_fillna_kwargs | ||
from pandas.core.config import get_option | ||
|
@@ -2216,6 +2218,15 @@ def _concat_same_type(self, to_concat): | |
def _formatting_values(self): | ||
return self | ||
|
||
def isin(self, values): | ||
from pandas.core.series import _sanitize_array | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a doc-string here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you move the import to top of function |
||
values = _sanitize_array(values, None, None) | ||
null_mask = isna(values) | ||
code_values = self.categories.get_indexer(values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger is there an EA version of this? (meaning API) (maybe should raise NotImplementedError which could be caught with a default implementation) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There isn't yet, but I was thinking about add it. I actually define it on IPArray, since an IP address can be "in" a network. I'll open an issue (shouldn't delay the changes here). |
||
code_values = code_values[null_mask | (code_values >= 0)] | ||
return algorithms.isin(self.codes, code_values) | ||
|
||
|
||
# The Series.cat accessor | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3564,7 +3564,10 @@ def isin(self, values): | |
5 False | ||
Name: animal, dtype: bool | ||
""" | ||
result = algorithms.isin(com._values_from_object(self), values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes let's try to do this, @Ma3aXaKa can you make this change |
||
if is_categorical_dtype(self): | ||
result = self._values.isin(values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a more ducklike check would work here, something like
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback What's the purpose? Is it better for performance? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. its more generic. you don't have to know its a categorical here. |
||
else: | ||
result = algorithms.isin(com._values_from_object(self), values) | ||
return self._constructor(result, index=self.index).__finalize__(self) | ||
|
||
def between(self, left, right, inclusive=True): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,3 +47,24 @@ def test_factorized_sort_ordered(): | |
|
||
tm.assert_numpy_array_equal(labels, expected_labels) | ||
tm.assert_categorical_equal(uniques, expected_uniques) | ||
|
||
|
||
def test_isin_cats(): | ||
cat = pd.Categorical(["a", "b", np.nan]) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can u add the issue number here as a comment |
||
result = cat.isin(["a", np.nan]) | ||
expected = np.array([True, False, True], dtype=bool) | ||
tm.assert_numpy_array_equal(expected, result) | ||
|
||
result = cat.isin(["a", "c"]) | ||
expected = np.array([True, False, False], dtype=bool) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there are a couple of tests in pandas/tests/test_algos.py that test cats for isin, can you move here |
||
tm.assert_numpy_array_equal(expected, result) | ||
|
||
|
||
@pytest.mark.parametrize("empty", [[], pd.Series(), np.array([])]) | ||
def test_isin_empty(empty): | ||
s = pd.Categorical(["a", "b"]) | ||
expected = np.array([False, False], dtype=bool) | ||
|
||
result = s.isin(empty) | ||
tm.assert_numpy_array_equal(expected, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u make Isin