GroupBy transform() is surprisingly slow #2121

bluefir · 2012-10-25T16:38:10Z

I came across a strange slowness in GroupBy transform() function. I put together a simple function to avoid using apply() because it can be REALLY slow:

def apply_by_group(grouped, f):
    """
    Applies a function to each DataFrame in a DataFrameGroupBy object, concatenates the results
    and returns the resulting DataFrame.

    Parameters
    ----------
    grouped: DataFrameGroupBy
        The grouped DataFrame that contains column(s) to be ranked and, potentially, a column with weights.
    f: callable
        Function to apply to each DataFrame.

    Returns
    -------
    DataFrame that results from applying the function to each DataFrame in the DataFrameGroupBy object and
    concatenating the results.

    """
    assert isinstance(grouped, DataFrameGroupBy)
    assert hasattr(f, '__call__')

    data_frames = []
    for key, data_frame in grouped:
        data_frames.append(f(data_frame))
    return pd.concat(data_frames)

Now I observe the following for the two equivalent ways of doing the same thing:

%timeit data.groupby(level=field_security_id).transform(lambda x: x.fillna())

1 loops, best of 3: 24.3 s per loop

%timeit apply_by_group(data.groupby(level=field_security_id), lambda x: x.fillna())

1 loops, best of 3: 2.72 s per loop

That was unexpected. Am I doing something wrong in using transform()?

The text was updated successfully, but these errors were encountered:

bluefir · 2012-10-25T16:54:12Z

Ok, to be fair, the DataFrame comes out unsorted from my function while transform() preserves the sort order of the original index:

data.index.lexsort_depth

2

data2 = data.groupby(level=field_security_id).transform(lambda x: x.fillna())
data2.index.lexsort_depth

2

data3 = apply_by_group(data.groupby(level=field_security_id), lambda x: x.fillna())
data3.index.lexsort_depth

0

Still, even with having to sort the index, transform() is much slower:

%timeit data3.sort_index()

1 loops, best of 3: 2.17 s per loop

wesm · 2013-01-19T20:04:32Z

Can you produce a test data set for me to look at this?

bluefir · 2013-01-20T14:43:05Z

Sure. Here it is, quick and dirty.

def apply_by_group(grouped, f):
    """
    Applies a function to each Series or DataFrame in a GroupBy object, concatenates the results
    and returns the resulting Series or DataFrame.

    Parameters
    ----------
    grouped: SeriesGroupBy or DataFrameGroupBy
    f: callable
        Function to apply to each Series or DataFrame in the grouped object.

    Returns
    -------
    Series or DataFrame that results from applying the function to each Series or DataFrame in the
    GroupBy object and concatenating the results.

    """
    assert isinstance(grouped, (SeriesGroupBy, DataFrameGroupBy))
    assert hasattr(f, '__call__')

    groups = []
    for key, group in grouped:
        groups.append(f(group))
    return pd.concat(groups)

import numpy as np
import pandas as pd
from pandas import Index, MultiIndex, DataFrame

n_dates = 1000
n_securities = 2000
n_columns = 3
share_na = 0.1

dates = pd.date_range('1997-12-31', periods=n_dates, freq='B')
dates = Index(map(lambda x: x.year * 10000 + x.month * 100 + x.day, dates))

secid_min = int('10000000', 16)
secid_max = int('F0000000', 16)
step = (secid_max - secid_min) // (n_securities - 1)
security_ids = map(lambda x: hex(x)[2:10].upper(), range(secid_min, secid_max + 1, step))

data_index = MultiIndex(levels=[dates.values, security_ids],
    labels=[[i for i in xrange(n_dates) for _ in xrange(n_securities)], range(n_securities) * n_dates],
    names=['date', 'security_id'])
n_data = len(data_index)

columns = Index(['factor{}'.format(i) for i in xrange(1, n_columns + 1)])

data = DataFrame(np.random.randn(n_data, n_columns), index=data_index, columns=columns)

step = int(n_data * share_na)
for column_index in xrange(n_columns):
    index = column_index
    while index < n_data:
        data.set_value(data_index[index], columns[column_index], np.nan)
        index += step

grouped = data.groupby(level='security_id')
f_fillna = lambda x: x.fillna(method='pad')

data2 = grouped.transform(f_fillna)

data3 = apply_by_group(grouped, f_fillna)
data3.sort_index(inplace=True)

%timeit grouped.transform(f_fillna)

1 loops, best of 3: 8.7 s per loop

%timeit apply_by_group(grouped, f_fillna)

1 loops, best of 3: 1.97 s per loop

%timeit data3.sort_index(inplace=True)

1 loops, best of 3: 1.26 s per loop

jreback · 2013-03-23T16:00:30Z

@wesm I created a PR #3145 to solve this. Everything passes, but pls validate for corrrectness.

jreback · 2013-03-25T21:25:21Z

closed by #3145

wesm · 2013-03-25T21:53:24Z

Thanks Jeff

ghost assigned wesm Feb 7, 2013

jreback mentioned this issue Mar 23, 2013

PERF: GH2121 groupby transform #3145

Merged

jreback closed this as completed Mar 25, 2013

ghost assigned jreback Mar 25, 2013

aullrich2013 mentioned this issue Jun 6, 2014

transform function is very slow in 0.14 - similar to issue #2121 #7383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy transform() is surprisingly slow #2121

GroupBy transform() is surprisingly slow #2121

bluefir commented Oct 25, 2012

bluefir commented Oct 25, 2012

wesm commented Jan 19, 2013

bluefir commented Jan 20, 2013

jreback commented Mar 23, 2013

jreback commented Mar 25, 2013

wesm commented Mar 25, 2013

GroupBy transform() is surprisingly slow #2121

GroupBy transform() is surprisingly slow #2121

Comments

bluefir commented Oct 25, 2012

bluefir commented Oct 25, 2012

wesm commented Jan 19, 2013

bluefir commented Jan 20, 2013

jreback commented Mar 23, 2013

jreback commented Mar 25, 2013

wesm commented Mar 25, 2013