Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: regression in DataFrame reduction ops performance #37081 #37118

Merged
merged 2 commits into from
Oct 17, 2020

Conversation

ukarroum
Copy link
Contributor

@ukarroum ukarroum commented Oct 14, 2020

Made the change proposed by @jorisvandenbossche in #35881 (comment)

Did a very quick comparison :

With self.dtypes (old version) :

In [8]: values = np.random.randn(100000, 4)   
   ...: df = pd.DataFrame(values).astype("int") 
   ...: %timeit df.sum() 
714 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With self._iter_column_arrays() (new version) :

In [4]: values = np.random.randn(100000, 4)   
   ...: df = pd.DataFrame(values).astype("int") 
   ...: %timeit df.sum() 
477 µs ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@jreback jreback changed the title [PERF] Fixed issue #37081 PERF: regression in DataFrame reduction ops performance #37081 Oct 14, 2020
@jreback jreback added the Performance Memory or execution speed performance label Oct 14, 2020
@jreback jreback added this to the 1.1.4 milestone Oct 14, 2020
@jreback jreback added the Regression Functionality that used to work in a prior pandas version label Oct 14, 2020
@jreback jreback modified the milestones: 1.1.4, 1.2 Oct 14, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an asv that covers this case, if not can you add one?

this doesn't need a note as its on master.

@ukarroum
Copy link
Contributor Author

do we have an asv that covers this case, if not can you add one?

this doesn't need a note as its on master.

I believe we do :

https://pandas.pydata.org/speed/pandas/#stat_ops.FrameOps.time_op?p-op='sum'&p-dtype='int'

@jorisvandenbossche
Copy link
Member

Indeed. Jeff, see the issue, it was actually catched thanks to our asv suite

any_object = np.array(
[is_object_dtype(values.dtype) for values in self._iter_column_arrays()],
dtype=bool,
).any()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's only find the dtypes once (i.e. share with dtype_is_dt above

own_dtypes = [arr.dtype for arr in self._iter_column_arrays()]
# or 
own_dtypes = [blk.dtype for blk in self._mgr.blocks]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 41827fb

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ukarroum, looks good!

@ukarroum ukarroum requested a review from jreback October 17, 2020 09:55
@ukarroum
Copy link
Contributor Author

Should i do something about the : 2 failed azure pipelines ?
Looks like they're failing on master too and the ./ci/code_checks.sh localy return no error.

@jreback jreback merged commit 9fed16c into pandas-dev:master Oct 17, 2020
@jreback
Copy link
Contributor

jreback commented Oct 17, 2020

thanks @ukarroum

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: regression in DataFrame reduction ops performance
4 participants