PERF: isin() is slower for categorical data than for integers #20003

vfilimonov · 2018-03-05T21:53:11Z

Problem description

For long series and many categories 'Series.isin()' is slower for categorical data rather than for int64. If categories are built from strings, then the degradation of the performance is even larger.

import pandas as pd
import numpy as np

N = 3000000
Ncats = 100

cats = pd.Series(['abcdef%d'%_ for _ in range(Ncats)])

df = pd.DataFrame({'A': np.random.randn(N),
                   'B': np.random.randn(N),
                   'C': np.random.randint(0, Ncats, N),
                  })
df['D'] = cats.loc[df['C'].values].values
df['E'] = df['C'].astype('category')
df['F'] = df['D'].astype('category')

sel_codes = [1,2]
sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes)  # int64
%timeit inds = df.E.isin(sel_codes)  # category based on int64
%timeit inds = df.D.isin(sel_cats)  # object / string
%timeit inds = df.F.isin(sel_cats)  # category based on string

On my machine:

6.25 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.7 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
142 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Interestingly, if there're many categories to compare with, categorical data is faster, e.g. for

sel_codes = range(90)
sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes)  # int64
%timeit inds = df.E.isin(sel_codes)  # category based on int64
%timeit inds = df.D.isin(sel_cats)  # object / string
%timeit inds = df.F.isin(sel_cats)  # category based on string

the timings are:

441 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
422 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
147 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
171 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

p.s. I'm not sure if such performance issues are worth filing.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-05T23:00:17Z

Thanks for the report, these are absolutely worth filing.

In this case we'll want to get the index position of values in the categorical categories (get_indexer should do the trick), pass the codes to algos.isin. Will just have to be careful with missing values, which will both be -1 by default.

bourbaki · 2018-03-26T14:43:05Z

I'm looking at this issue.

TomAugspurger added Performance Memory or execution speed performance Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Difficulty Intermediate Categorical Categorical Data Type labels Mar 5, 2018

TomAugspurger modified the milestones: 0.23.0, Next Major Release Mar 5, 2018

bourbaki mentioned this issue Apr 9, 2018

PERF: GH2003 Series.isin for categorical dtypes #20522

Merged

jreback modified the milestones: Next Major Release, 0.23.0 Apr 9, 2018

jreback modified the milestones: Next Major Release, 0.23.0 Apr 24, 2018

jreback closed this as completed in #20522 Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: isin() is slower for categorical data than for integers #20003

PERF: isin() is slower for categorical data than for integers #20003

vfilimonov commented Mar 5, 2018

INSTALLED VERSIONS

TomAugspurger commented Mar 5, 2018

bourbaki commented Mar 26, 2018

PERF: isin() is slower for categorical data than for integers #20003

PERF: isin() is slower for categorical data than for integers #20003

Comments

vfilimonov commented Mar 5, 2018

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Mar 5, 2018

bourbaki commented Mar 26, 2018

Output of `pd.show_versions()`