GroupBy on DatetimeIndex with float32 values VERY slow #2772

scottkidder · 2013-01-30T03:48:49Z

I have a DataFrame with a DatetimeIndex and two float32 columns.

In [35]: %timeit a.groupby(level=0).last()
1 loops, best of 3: 2.11 s per loop

In [36]: a = a.astype(np.float64)

In[37]: %timeit a.groupby(level=0).last()
1000 loops, best of 3: 911 us per loop

Either way, the result of the groupby is all float64s. I would lilke to preserve float32 dtypes if possible.

Also, there are other operations (resample, shift) that are also very slow on float32 data but I'm pretty sure this is related.

The text was updated successfully, but these errors were encountered:

jreback · 2013-01-30T04:01:12Z

dtype support is not fully there yet, see #2708, support for algos (e.g. pad,backfill,take) is only there for float64,int64,int32,object,bool,datetime64[ns]), have to define for other types (which the PR will do, for the most part)

jreback · 2013-01-30T04:03:59Z

can you post code to generate your frame, a, and i'll build in a test

scottkidder · 2013-01-30T17:27:46Z

Okay great, I will look into #2708

Code to generate a:

In [29]: N  = 1000

In [30]: some_nums = lambda n: np.random.uniform(0, 2, size=n)

In [31]: a = pd.DataFrame({'a': some_nums(N), 'b': some_nums(N)}).astype(np.float32)

In [32]: %timeit a.groupby(level=0).last()
1 loops, best of 3: 201 ms per loop

In [33]: a = a.astype(np.float64)

In [34]: %timeit a.groupby(level=0).last()
1000 loops, best of 3: 244 us per loop

jreback · 2013-01-30T17:48:24Z

in dtypes branch (after groupby cythonized for float32)

In [11]: a.astype('float64').groupby(level=0).last()
Out[11]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns:
a    1000  non-null values
b    1000  non-null values
dtypes: float64(2)

In [12]: %timeit a.astype('float64').groupby(level=0).last()
1000 loops, best of 3: 381 us per loop

In [13]: a.astype('float32').groupby(level=0).last()
Out[13]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns:
a    1000  non-null values
b    1000  non-null values
dtypes: float32(2)

In [14]: %timeit a.astype('float32').groupby(level=0).last()
1000 loops, best of 3: 378 us per loop

wesm · 2013-01-31T19:50:35Z

I am reopening because it needs a vbench to be written

jreback · 2013-01-31T19:56:39Z

this is done already (in dtypes branch :), have group_first_float32,group_last_float32 (added some for reindexing as well)
https://travis-ci.org/jreback/pandas/jobs/4501738

wesm · 2013-01-31T19:59:00Z

sweet thanks!

jreback · 2013-02-15T00:15:54Z

closed via #2708 (vbench exists)

scottkidder closed this as completed Jan 30, 2013

wesm reopened this Jan 31, 2013

jreback closed this as completed Feb 15, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy on DatetimeIndex with float32 values VERY slow #2772

GroupBy on DatetimeIndex with float32 values VERY slow #2772

scottkidder commented Jan 30, 2013

jreback commented Jan 30, 2013

jreback commented Jan 30, 2013

scottkidder commented Jan 30, 2013

jreback commented Jan 30, 2013

wesm commented Jan 31, 2013

jreback commented Jan 31, 2013

wesm commented Jan 31, 2013

jreback commented Feb 15, 2013

GroupBy on DatetimeIndex with float32 values VERY slow #2772

GroupBy on DatetimeIndex with float32 values VERY slow #2772

Comments

scottkidder commented Jan 30, 2013

jreback commented Jan 30, 2013

jreback commented Jan 30, 2013

scottkidder commented Jan 30, 2013

jreback commented Jan 30, 2013

wesm commented Jan 31, 2013

jreback commented Jan 31, 2013

wesm commented Jan 31, 2013

jreback commented Feb 15, 2013