Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy transform() is surprisingly slow #2121

Closed
bluefir opened this issue Oct 25, 2012 · 6 comments · Fixed by #3145
Closed

GroupBy transform() is surprisingly slow #2121

bluefir opened this issue Oct 25, 2012 · 6 comments · Fixed by #3145
Assignees
Labels
Performance Memory or execution speed performance
Milestone

Comments

@bluefir
Copy link

bluefir commented Oct 25, 2012

I came across a strange slowness in GroupBy transform() function. I put together a simple function to avoid using apply() because it can be REALLY slow:

def apply_by_group(grouped, f):
    """
    Applies a function to each DataFrame in a DataFrameGroupBy object, concatenates the results
    and returns the resulting DataFrame.

    Parameters
    ----------
    grouped: DataFrameGroupBy
        The grouped DataFrame that contains column(s) to be ranked and, potentially, a column with weights.
    f: callable
        Function to apply to each DataFrame.

    Returns
    -------
    DataFrame that results from applying the function to each DataFrame in the DataFrameGroupBy object and
    concatenating the results.

    """
    assert isinstance(grouped, DataFrameGroupBy)
    assert hasattr(f, '__call__')

    data_frames = []
    for key, data_frame in grouped:
        data_frames.append(f(data_frame))
    return pd.concat(data_frames)

Now I observe the following for the two equivalent ways of doing the same thing:

%timeit data.groupby(level=field_security_id).transform(lambda x: x.fillna())

1 loops, best of 3: 24.3 s per loop

%timeit apply_by_group(data.groupby(level=field_security_id), lambda x: x.fillna())

1 loops, best of 3: 2.72 s per loop

That was unexpected. Am I doing something wrong in using transform()?

@bluefir
Copy link
Author

bluefir commented Oct 25, 2012

Ok, to be fair, the DataFrame comes out unsorted from my function while transform() preserves the sort order of the original index:

data.index.lexsort_depth

2

data2 = data.groupby(level=field_security_id).transform(lambda x: x.fillna())
data2.index.lexsort_depth

2

data3 = apply_by_group(data.groupby(level=field_security_id), lambda x: x.fillna())
data3.index.lexsort_depth

0

Still, even with having to sort the index, transform() is much slower:

%timeit data3.sort_index()

1 loops, best of 3: 2.17 s per loop

@wesm
Copy link
Member

wesm commented Jan 19, 2013

Can you produce a test data set for me to look at this?

@bluefir
Copy link
Author

bluefir commented Jan 20, 2013

Sure. Here it is, quick and dirty.

def apply_by_group(grouped, f):
    """
    Applies a function to each Series or DataFrame in a GroupBy object, concatenates the results
    and returns the resulting Series or DataFrame.

    Parameters
    ----------
    grouped: SeriesGroupBy or DataFrameGroupBy
    f: callable
        Function to apply to each Series or DataFrame in the grouped object.

    Returns
    -------
    Series or DataFrame that results from applying the function to each Series or DataFrame in the
    GroupBy object and concatenating the results.

    """
    assert isinstance(grouped, (SeriesGroupBy, DataFrameGroupBy))
    assert hasattr(f, '__call__')

    groups = []
    for key, group in grouped:
        groups.append(f(group))
    return pd.concat(groups)
import numpy as np
import pandas as pd
from pandas import Index, MultiIndex, DataFrame

n_dates = 1000
n_securities = 2000
n_columns = 3
share_na = 0.1

dates = pd.date_range('1997-12-31', periods=n_dates, freq='B')
dates = Index(map(lambda x: x.year * 10000 + x.month * 100 + x.day, dates))

secid_min = int('10000000', 16)
secid_max = int('F0000000', 16)
step = (secid_max - secid_min) // (n_securities - 1)
security_ids = map(lambda x: hex(x)[2:10].upper(), range(secid_min, secid_max + 1, step))

data_index = MultiIndex(levels=[dates.values, security_ids],
    labels=[[i for i in xrange(n_dates) for _ in xrange(n_securities)], range(n_securities) * n_dates],
    names=['date', 'security_id'])
n_data = len(data_index)

columns = Index(['factor{}'.format(i) for i in xrange(1, n_columns + 1)])

data = DataFrame(np.random.randn(n_data, n_columns), index=data_index, columns=columns)

step = int(n_data * share_na)
for column_index in xrange(n_columns):
    index = column_index
    while index < n_data:
        data.set_value(data_index[index], columns[column_index], np.nan)
        index += step

grouped = data.groupby(level='security_id')
f_fillna = lambda x: x.fillna(method='pad')

data2 = grouped.transform(f_fillna)

data3 = apply_by_group(grouped, f_fillna)
data3.sort_index(inplace=True)
%timeit grouped.transform(f_fillna)

1 loops, best of 3: 8.7 s per loop

%timeit apply_by_group(grouped, f_fillna)

1 loops, best of 3: 1.97 s per loop

%timeit data3.sort_index(inplace=True)

1 loops, best of 3: 1.26 s per loop

@jreback
Copy link
Contributor

jreback commented Mar 23, 2013

@wesm I created a PR #3145 to solve this. Everything passes, but pls validate for corrrectness.

@jreback jreback closed this as completed Mar 25, 2013
@jreback
Copy link
Contributor

jreback commented Mar 25, 2013

closed by #3145

@wesm
Copy link
Member

wesm commented Mar 25, 2013

Thanks Jeff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants