Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: enable column-wise reductions for EA-backed columns #32867

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 20, 2020

Currently, for reductions on a DataFrame, we convert the full DataFrame to a single "interleaved" array and then perform the operation. That's the default, but when numeric_only=True is specified, it is done block-wise.

Enabling column-wise reductions (or block-wise for EAs):

  • Gives better performance in common cases / no need to create a new 2D array (which goes through object dtype for nullable ints)
  • Ensures to use the reduction implementation of the EA itself, which can be more correct / more efficient than converting to an ndarray and using our generic nanops.

For illustration purposes, I added a column_wise keyword in this PR (not meant to keep this, just for testing), so we can compare a few cases:

In [9]: df_wide = pd.DataFrame(np.random.randint(1000, size=(1000,100))).astype("Int64").copy() 

In [15]: %timeit df_wide.mean()   
9.68 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: %timeit df_wide.mean(numeric_only=True)
10.1 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [17]: %timeit df_wide.mean(column_wise=True)  
5.22 ms ± 29.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [18]: df_long = pd.DataFrame(np.random.randint(1000, size=(10000,10))).astype("Int64").copy()  

In [19]: %timeit df_long.mean()
7.77 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: %timeit df_long.mean(numeric_only=True)         
2.07 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit df_long.mean(column_wise=True) 
1.04 ms ± 4.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So I experimented with two approaches:

  • First by iterating through the columns and calling the _reduce of the underlying EA (this path gets taken by using the temporary keyword column_wise=True)
  • Fixing the block-wise case for extension blocks (triggered by numeric_only=True) by changing that to also use _reduce of the EA (currently this was failing by calling nanops functions on the EA)

The first gives better performance (it is simpler in implementation by not involding the blocks), but requires some more new code (it uses less the existing machinery).

Ideally, for EA columns, we should always use their own reduction implementation (thus call EA._reduce), I think. So for both approaches, the question will be how to trigger this behaviour.

Closes #32651, closes #34520

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 20, 2020
@jorisvandenbossche
Copy link
Member Author

cc @jbrockmendel

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member

Come to think of it, the place where this dispatch belongs may be in the relevant nanops functions

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 20, 2020

Come to think of it, the place where this dispatch belongs may be in the relevant nanops functions

Personally, I prefer the nanops to be about ops on numpy arrays, and not deal with extension arrays

pandas/core/frame.py Outdated Show resolved Hide resolved
@@ -7898,6 +7915,19 @@ def _get_data(axis_matters):
raise NotImplementedError(msg)
return data

def blk_func(values):
if isinstance(values, ExtensionArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this inside blk_func, shouldn't the block-wise operation have the same performance bump as the column-wise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this inside blk_func, shouldn't the block-wise operation have the same performance bump as the column-wise?

It didn't actually change anything performance wise (it's the same function being called as before).
The reason that both paths have different performance, is because the re-assembling of the results into a Series is more expensive for the block-wise compared to column-wise.

(it's possible that the block-wise way could be optimized to get rid of this difference though. The main thing is that the block results are not in order)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it's possible that the block-wise way could be optimized to get rid of this difference though. The main thing is that the block results are not in order)

This would be really nice.

@jorisvandenbossche
Copy link
Member Author

So the bigger question is: how can we get to use this by default (at least for EAs). Some ideas:

  • The column_wise route could for example be triggered by checking if all columns are ExtensionBlock columns (we already have a BlockManager.any_extension_types, could also have an all_extension_types).
    This way, this would only be triggered in rare cases though (certainly as long we don't yet have a float extension type). We could also trigger it with any_extension_types: for the new nullable dtype that would be possible (since those are new, they can introduces changes in behaviour), but not for the older extension dtypes (categorical, datetimelikes)

  • Currently, with numeric_only is None, we first try on the full values (DataFrame.values), and then fall back to frame_apply with ignore_failures=True.
    I suppose we could also do this block-wise, just there ignore failures in case of numeric_only is None, and in such a case skip those blocks to assemble the output (but will need to check what all fails by doing this ..).
    If that works, we could get rid of the full numeric_only is None-block in _reduce.

@jbrockmendel
Copy link
Member

Currently, with numeric_only is None, we first try on the full values (DataFrame.values), and then fall back to frame_apply with ignore_failures=True.

I've got a branch that identifies the cases where frame_apply is used and does that before the .values call. Trying to 1) avoid a .values call and 2) de-nest _reduce. I'll move that branch up the priority-list.

@jbrockmendel
Copy link
Member

Personally, I prefer the nanops to be about ops on numpy arrays, and not deal with extension arrays

Depends on the public-ness of those functions. ATM their docstrings have examples with Series, not sure if we have other docs or tests with those.

@jorisvandenbossche
Copy link
Member Author

I've got a branch that identifies the cases where frame_apply is used and does that before the .values call. Trying to 1) avoid a .values call and 2) de-nest _reduce. I'll move that branch up the priority-list.

Note that my suggestion was to eliminate this part entirely (by using block-wise for all). So let's ensure we don't do duplicate / conflicting work. Does the branch already have something? (can you maybe push it to your fork?)

Depends on the public-ness of those functions. ATM their docstrings have examples with Series, not sure if we have other docs or tests with those.

nanops are not public (regardless of their docstrings). What I meanly meant is that (in my head) they are meant to work on numpy arrays (whether extracted from a Series first or not)

@jbrockmendel
Copy link
Member

https://github.com/jbrockmendel/pandas/tree/cln-reduce

Note that my suggestion was to eliminate this part entirely (by using block-wise for all).

This would require making BlockManager._reduce handle numeric_only=None similar to how frame_apply handles allow_failures=True, right? I've been reticent to do that (because I dont like the ignoring-exceptions pattern, xref #28900), but it may end up being the best option (off the top of my head, this would make it easy to solve #28773).

@jorisvandenbossche
Copy link
Member Author

Thanks!

because I dont like the ignoring-exceptions pattern,

Yes, but as long as we keep the numeric_only=None behaviour, that's the whole point of it .. (to ignore exceptions). So it would still be nice to try to unify the two code paths a bit more than what we have now.
Will try to take a look at it next week.

@jbrockmendel
Copy link
Member

hmm block-wise and column-wise with ignore_failures wont necessarily be equivalent for object-dtype

@jorisvandenbossche
Copy link
Member Author

hmm block-wise and column-wise with ignore_failures wont necessarily be equivalent for object-dtype

Ah, that's a good point. So for ObjectBlock, we would still need to do it column-wise, if we want to get rid of the fallback in general. Will take a further look one of the coming days if that looks feasable

@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Jul 11, 2020
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jul 12, 2020

this is massivley increasing the complexity here. Please find a way to do better.

Do you have something more concrete as feedback? I don't find this particularly complex. It adds some more code, for sure, to ensure we perform the reductions properly column-wise with the correct operation (which fixes bugs).

Note that there was still code that could be removed (so I was actually replacing something, not only adding). I removed that now, to make this more clear (there was a comment about it)

@jbrockmendel
Copy link
Member

Do you have something more concrete as feedback?

A suggestion for the short-term, (i.e. to address #35112) is to change the inline-defined f to something like:

    def blk_func(values):
            if is_extension_array_dtype(values.dtype):
                return extract_array(values)._reduce(name, skipna=skipna, **kwds)
            else:
                return op(values, axis=axis, skipna=skipna, **kwds)

in a dedicated PR. I think that would address a subset of what this PR is doing and is orthogonal to the rest of this.

@jreback
Copy link
Contributor

jreback commented Jul 13, 2020

this needs to wait for 1.2

@jreback jreback modified the milestones: 1.1, 1.2 Jul 15, 2020
@jreback
Copy link
Contributor

jreback commented Jul 15, 2020

moving to 1.2

@jorisvandenbossche
Copy link
Member Author

I am fine with that, if we then include #35254 instead

@jbrockmendel
Copy link
Member

@jorisvandenbossche I think #36076 has a bearing on this, thoughts?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 4, 2020

  • Reduction on empty object-dtype DataFrame, currently returns float for DataFrame but integer for Series:
    ...
    while on this PR the DataFrame behaviour follows Series to return integer series.

But in current behavior, a similar thing happens when the DataFrame has object dtype, but consists of int values and is not empty. Not sure what your PR does in this case.

>>> ddf=pd.DataFrame([[1,2,3]],columns=['a','b','c'],dtype=object)
>>> ddf
   a  b  c
0  1  2  3
>>> ddf.sum()
a    1.0
b    2.0
c    3.0
dtype: float64
>>> ddf.dtypes
a    object
b    object
c    object
dtype: object
>>> ddf['a'].sum()
1
>>> type(_)
<class 'int'>
>>>

@jorisvandenbossche
Copy link
Member Author

@Dr-Irv ah, thanks, that's something we should then also test. And another thing to look into ;)

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2020

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 9, 2020
@jbrockmendel
Copy link
Member

@jorisvandenbossche i think this is closable, as it really isnt actionable until we get the #36076-and-similar inconsistencies fixed, and at that point we'll be able to go all blockwise (which for ArrayManager will be columnwise anyway)

@jreback jreback removed this from the 1.2 milestone Nov 18, 2020
@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

looks ok, but not for 1.2

@jbrockmendel
Copy link
Member

@jorisvandenbossche does this still have a perf impact given that we don't go through .values anymore for axis=0? Or is the perf impact all in re-assembling blockwise results?

We now always use EA._reduce I think, so that part of the motivation should no longer be relevant.

As @jreback referred to in his previous comment, we've gone to a lot of trouble to simplify DataFrame._reduce, i'm wary of adding more re-complexifying it.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

@jorisvandenbossche pls update or close

@jreback jreback closed this Feb 11, 2021
@jreback jreback reopened this Feb 11, 2021
@jbrockmendel
Copy link
Member

is this still relevant?

@simonjayhawkins
Copy link
Member

@jorisvandenbossche closing as stale. reopen when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance Stale
Projects
None yet
7 participants