-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: implement scalar ops blockwise #29853
Conversation
Hello @jbrockmendel! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-12-27 16:35:39 UTC |
Resolved the issue with test_expressions behaving unexpectedly. Added an asv that times operations on a homogeneous-dtype DataFrame (rows=20k, cols=100) with a scalar. Not sure how many variants of these to do; could be easy to go overboard.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. need to go thru again, some minor commments.
@@ -372,6 +373,10 @@ def dispatch_to_series(left, right, func, str_rep=None, axis=None): | |||
right = lib.item_from_zerodim(right) | |||
if lib.is_scalar(right) or np.ndim(right) == 0: | |||
|
|||
array_op = get_array_op(func, str_rep=str_rep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here on what is going on
@@ -411,7 +411,10 @@ def apply(self, f: str, filter=None, **kwargs): | |||
axis = obj._info_axis_number | |||
kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy) | |||
|
|||
applied = getattr(b, f)(**kwargs) | |||
if callable(f): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this strictly necessary? meaning happy to require only callables here (would require some changing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of our existing usages pass strings here to get at Block methods. i think @WillAyd had a suggestion about re-working Block.apply to do str vs callable handling there; that should be its own PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k, yeah this whole section could use some TLC
pandas/core/ops/array_ops.py
Outdated
@@ -367,3 +368,13 @@ def fill_bool(x, left=None): | |||
res_values = filler(res_values) # type: ignore | |||
|
|||
return res_values | |||
|
|||
|
|||
def get_array_op(op, str_rep=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string / what this is doing
block = self.make_block(values=nv, placement=[loc]) | ||
nbs.append(block) | ||
return nbs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be an elif here and re-assign to result, just to make the flow more natural. alt could make this into a method on BM. but for followon's
array_op = get_array_op(func, str_rep=str_rep) | ||
bm = left._data.apply(array_op, right=right) | ||
return type(left)(bm) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could just be an if (as you are returning), e.g. change the following elif to an if, but NBD
ok a couple of minor comments, rebase and looks ok to go. I suspect you will be refactoring things after this is in anyhow. |
rebased+green |
thanks @jbrockmendel |
Alright! This was a tough slog, thanks to all who helped along the way. Next up: dispatching for op(frame, series). |
…ndexing-1row-df * upstream/master: (333 commits) CI: troubleshoot Web_and_Docs failing (pandas-dev#30534) WARN: Ignore NumbaPerformanceWarning in test suite (pandas-dev#30525) DEPR: camelCase in offsets, get_offset (pandas-dev#30340) PERF: implement scalar ops blockwise (pandas-dev#29853) DEPR: Remove Series.compress (pandas-dev#30514) ENH: Add numba engine for rolling apply (pandas-dev#30151) [ENH] Add to_markdown method (pandas-dev#30350) DEPR: Deprecate pandas.np module (pandas-dev#30386) ENH: Add ignore_index for df.drop_duplicates (pandas-dev#30405) BUG: The setting xrot=0 in DataFrame.hist() doesn't work with by and subplots pandas-dev#30288 (pandas-dev#30491) CI: Fix GBQ Tests (pandas-dev#30478) Bug groupby quantile listlike q and int columns (pandas-dev#30485) ENH: Add ignore_index for df.sort_values and series.sort_values (pandas-dev#30402) TYP: Typing hints in pandas/io/formats/{css,csvs}.py (pandas-dev#30398) BUG: raise on non-hashable Index name, closes pandas-dev#29069 (pandas-dev#30335) Replace "foo!r" to "repr(foo)" syntax pandas-dev#29886 (pandas-dev#30502) BUG: preserve EA dtype in transpose (pandas-dev#30091) BLD: add check to prevent tempita name error, clsoes pandas-dev#28836 (pandas-dev#30498) REF/TST: method-specific files for test_append (pandas-dev#30503) marked unused parameters (pandas-dev#30504) ...
Hi @jbrockmendel! is there already an issue regarding this where we could track the progress? |
no, but i can tell you the answer. The four cases are: scalar, series(axis=0), series(axis=1), and frame. This PR handled the scalar case. Another PR handled the series(axis=1) case (except when that series is EA-backed). #32779 handles the frame case. |
Thank you for your reply!
by "handled" you mean, that this has already been resolved? If so in which pandas-Version? We are still seeing performance issues in e.g. import pandas as pd
df = pd.DataFrame(index=['A'], columns=range(1000), data=1.0)
s = pd.Series(index=df.columns, data=1.0)
%timeit x = df - s is much slower on 0.24.x all the way through to the pypy-available 1.0.3 than it was on 0.23.4 regards, |
Yes, handled as in resolved. Not sure off the top of my head when that was. Before my caffeine I'd guess 60/40 that it made it into 1.0
That case hasn't been addressed yet, will be next up after #32779. If you'd like to make a PR and improve it before I do, go for it. |
@jbrockmendel, here is a performance comparison of the different pandas versions using @huitrouge's benchmark:
hope this helps. |
Similar to #28583, but going through BlockManager.apply.