Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy on DatetimeIndex with float32 values VERY slow #2772

Closed
scottkidder opened this issue Jan 30, 2013 · 8 comments
Closed

GroupBy on DatetimeIndex with float32 values VERY slow #2772

scottkidder opened this issue Jan 30, 2013 · 8 comments
Labels
Testing pandas testing functions or related to the test suite
Milestone

Comments

@scottkidder
Copy link

I have a DataFrame with a DatetimeIndex and two float32 columns.

In [35]: %timeit a.groupby(level=0).last()
1 loops, best of 3: 2.11 s per loop

In [36]: a = a.astype(np.float64)

In[37]: %timeit a.groupby(level=0).last()
1000 loops, best of 3: 911 us per loop

Either way, the result of the groupby is all float64s. I would lilke to preserve float32 dtypes if possible.

Also, there are other operations (resample, shift) that are also very slow on float32 data but I'm pretty sure this is related.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2013

dtype support is not fully there yet, see #2708, support for algos (e.g. pad,backfill,take) is only there for float64,int64,int32,object,bool,datetime64[ns]), have to define for other types (which the PR will do, for the most part)

@jreback
Copy link
Contributor

jreback commented Jan 30, 2013

can you post code to generate your frame, a, and i'll build in a test

@scottkidder
Copy link
Author

Okay great, I will look into #2708

Code to generate a:

In [29]: N  = 1000

In [30]: some_nums = lambda n: np.random.uniform(0, 2, size=n)

In [31]: a = pd.DataFrame({'a': some_nums(N), 'b': some_nums(N)}).astype(np.float32)

In [32]: %timeit a.groupby(level=0).last()
1 loops, best of 3: 201 ms per loop

In [33]: a = a.astype(np.float64)

In [34]: %timeit a.groupby(level=0).last()
1000 loops, best of 3: 244 us per loop

@jreback
Copy link
Contributor

jreback commented Jan 30, 2013

in dtypes branch (after groupby cythonized for float32)

In [11]: a.astype('float64').groupby(level=0).last()
Out[11]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns:
a    1000  non-null values
b    1000  non-null values
dtypes: float64(2)

In [12]: %timeit a.astype('float64').groupby(level=0).last()
1000 loops, best of 3: 381 us per loop

In [13]: a.astype('float32').groupby(level=0).last()
Out[13]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns:
a    1000  non-null values
b    1000  non-null values
dtypes: float32(2)

In [14]: %timeit a.astype('float32').groupby(level=0).last()
1000 loops, best of 3: 378 us per loop

@wesm
Copy link
Member

wesm commented Jan 31, 2013

I am reopening because it needs a vbench to be written

@wesm wesm reopened this Jan 31, 2013
@jreback
Copy link
Contributor

jreback commented Jan 31, 2013

this is done already (in dtypes branch :), have group_first_float32,group_last_float32 (added some for reindexing as well)
https://travis-ci.org/jreback/pandas/jobs/4501738

@wesm
Copy link
Member

wesm commented Jan 31, 2013

sweet thanks!

@jreback
Copy link
Contributor

jreback commented Feb 15, 2013

closed via #2708 (vbench exists)

@jreback jreback closed this as completed Feb 15, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

3 participants