Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: isin() is slower for categorical data than for integers #20003

Closed
vfilimonov opened this issue Mar 5, 2018 · 2 comments · Fixed by #20522
Closed

PERF: isin() is slower for categorical data than for integers #20003

vfilimonov opened this issue Mar 5, 2018 · 2 comments · Fixed by #20522
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type Performance Memory or execution speed performance
Milestone

Comments

@vfilimonov
Copy link
Contributor

Problem description

For long series and many categories 'Series.isin()' is slower for categorical data rather than for int64. If categories are built from strings, then the degradation of the performance is even larger.

import pandas as pd
import numpy as np

N = 3000000
Ncats = 100

cats = pd.Series(['abcdef%d'%_ for _ in range(Ncats)])

df = pd.DataFrame({'A': np.random.randn(N),
                   'B': np.random.randn(N),
                   'C': np.random.randint(0, Ncats, N),
                  })
df['D'] = cats.loc[df['C'].values].values
df['E'] = df['C'].astype('category')
df['F'] = df['D'].astype('category')

sel_codes = [1,2]
sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes)  # int64
%timeit inds = df.E.isin(sel_codes)  # category based on int64
%timeit inds = df.D.isin(sel_cats)  # object / string
%timeit inds = df.F.isin(sel_cats)  # category based on string

On my machine:

6.25 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.7 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
142 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Interestingly, if there're many categories to compare with, categorical data is faster, e.g. for

sel_codes = range(90)
sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes)  # int64
%timeit inds = df.E.isin(sel_codes)  # category based on int64
%timeit inds = df.D.isin(sel_cats)  # object / string
%timeit inds = df.F.isin(sel_cats)  # category based on string

the timings are:

441 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
422 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
147 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
171 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

p.s. I'm not sure if such performance issues are worth filing.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Thanks for the report, these are absolutely worth filing.

In this case we'll want to get the index position of values in the categorical categories (get_indexer should do the trick), pass the codes to algos.isin. Will just have to be careful with missing values, which will both be -1 by default.

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Difficulty Intermediate Categorical Categorical Data Type labels Mar 5, 2018
@TomAugspurger TomAugspurger modified the milestones: 0.23.0, Next Major Release Mar 5, 2018
@bourbaki
Copy link
Contributor

I'm looking at this issue.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 9, 2018
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants