Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: enable column-wise reductions for EA-backed columns #32867

Closed
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a9ca0fa
ENH/PERF: enable column-wise reductions for EA-backed columns
jorisvandenbossche Mar 20, 2020
21aee0d
fix numeric_only for EAs
jorisvandenbossche Mar 20, 2020
9f83f6e
fix _reduce_columns call
jorisvandenbossche Mar 20, 2020
a9706e0
move EA._reduce call into blk_func
jorisvandenbossche Mar 20, 2020
07372e3
reuse blk_func for column-wise, inline _iter_arrays
jorisvandenbossche Mar 20, 2020
2d08450
temp
jorisvandenbossche Mar 20, 2020
c7f1dae
Merge remote-tracking branch 'upstream/master' into EA-column-wise-re…
jorisvandenbossche Mar 27, 2020
9e2a780
first attempts of going block-wise with numeric_only=None
jorisvandenbossche Mar 27, 2020
7f847e8
Merge remote-tracking branch 'upstream/master' into EA-column-wise-re…
jorisvandenbossche Apr 3, 2020
594d2b0
TEMP
jorisvandenbossche Apr 3, 2020
3d62e68
Merge remote-tracking branch 'upstream/master' into EA-column-wise-re…
jorisvandenbossche Apr 24, 2020
5b0370e
use iter_column_arrays
jorisvandenbossche May 29, 2020
6e3c287
Merge remote-tracking branch 'upstream/master' into EA-column-wise-re…
jorisvandenbossche May 29, 2020
7088cfc
intermediate clean-up: remove BM.reduce changes + do column-wise for …
jorisvandenbossche May 29, 2020
925d660
fixup
jorisvandenbossche May 29, 2020
852331e
fix dtype of empty result
jorisvandenbossche May 30, 2020
1c9a685
Merge remote-tracking branch 'upstream/master' into EA-column-wise-re…
jorisvandenbossche Jun 6, 2020
34731f2
clean-up
jorisvandenbossche Jun 6, 2020
a8e61d0
whitespace
jorisvandenbossche Jun 6, 2020
f233109
Merge remote-tracking branch 'upstream/master' into EA-column-wise-re…
jorisvandenbossche Jun 15, 2020
ec65c57
Merge remote-tracking branch 'upstream/master' into EA-column-wise-re…
jorisvandenbossche Jul 12, 2020
15ec9b6
add test case for GH34520, copied from GH35112
jorisvandenbossche Jul 12, 2020
2653d02
add test to ensure EA op is used for integer array
jorisvandenbossche Jul 12, 2020
64e0069
remove try except
jorisvandenbossche Jul 12, 2020
bb0a47b
remove unused code
jorisvandenbossche Jul 12, 2020
9323f0e
add test for GH32651, copied from GH34210
jorisvandenbossche Jul 12, 2020
eb33f86
remove check for EAs for block-wise path
jorisvandenbossche Jul 12, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 70 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8481,6 +8481,22 @@ def _count_level(self, level, axis=0, numeric_only=False):
def _reduce(
self, op, name, axis=0, skipna=True, numeric_only=None, filter_type=None, **kwds
):
"""
Reduce DataFrame over axis with given operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing args a +

Parameters
----------
op : func
The reducing function to be called on the values.
name : str
The name of the reduction.
axis : int
numeric_only : bool, optional
filter_type : None or "bool"
Set to "bool" for ops that only work on boolean values.
skipna, **kwds : keywords to pass to the `op` function

"""

assert filter_type is None or filter_type == "bool", filter_type

Expand Down Expand Up @@ -8531,7 +8547,12 @@ def _get_data(axis_matters):
raise NotImplementedError(msg)
return data

if numeric_only is not None and axis in [0, 1]:
# special case for block-wise
if (
not self._mgr.any_extension_types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we excluding any_extension_types here? blk_func looks like it should handle EAs appropriately

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed handles EA correctly (that fix was split off from this PR a while ago), but the reason I put this check here: the motivation for this branch (do block-wise when possible because it is faster) is not the case for EAs, as block-wise is actually slower. And secondly, that also ensures there is only a single code path when having EAs.

Now, we could also only check for the nullable extension type (nullable int, float, bool etc, so it only checks for the new EAs), and not for all extension types in general (which then also includes things like datetimetz)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I reverted this to keep the original check (so the same cases will take the block-wise code path as before this PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, ill take a fresh look

and numeric_only is not None
and axis in [0, 1]
):
df = self
if numeric_only is True:
df = _get_data(axis_matters=True)
Expand Down Expand Up @@ -8559,6 +8580,54 @@ def blk_func(values):
out[:] = coerce_to_dtypes(out.values, df.dtypes)
return out

def array_func(values):
if isinstance(values, ExtensionArray):
return values._reduce(name, skipna=skipna, **kwds)
else:
return op(values, skipna=skipna, **kwds)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

# all other options with axis=0 are done column-array-wise
if axis == 0:

def _constructor(df, result, index=None):
index = index if index is not None else df.columns
if len(result):
return df._constructor_sliced(result, index=index)
else:
# set correct dtype for empty result
dtype = "bool" if filter_type == "bool" else "float64"
return df._constructor_sliced(result, index=index, dtype=dtype)

def _reduce_columns(df, op):
result = [op(arr) for arr in df._iter_column_arrays()]
return _constructor(df, result)

df = self
if numeric_only is True:
df = _get_data(axis_matters=True)

if numeric_only is not None:
return _reduce_columns(df, array_func)
else:
# need to catch and ignore exceptions when numeric_only=None
try:
return _reduce_columns(df, array_func)
except TypeError:
# if column-wise fails and numeric_only was None, we try
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
# again but removing those columns for which it fails
result = []
indices = []
for i, arr in enumerate(df._iter_column_arrays()):
try:
res = array_func(arr)
except Exception:
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
pass
else:
result.append(res)
indices.append(i)

return _constructor(df, result, index=df.columns[indices])

if not self._is_homogeneous_type:
# try to avoid self.values call

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block (using frame_apply) could be removed now, since this is only called when axis=0, which is now already handled above.

We might want to limit the new code above to if axis == 0 and self._mgr.any_extension_types:, and then keep the block here for backwards compatibility (or eg only remove when doing a major version bump pandas 2.0).

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -11397,7 +11397,7 @@ def stat_func(
if level is not None:
return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
return self._reduce(
func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only,
)

return set_function_name(stat_func, name, cls)
Expand Down
8 changes: 4 additions & 4 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,24 +327,24 @@ def _verify_integrity(self) -> None:
f"tot_items: {tot_items}"
)

def reduce(self, func, *args, **kwargs):
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
def reduce(self, func):
# If 2D, we assume that we're operating column-wise
if self.ndim == 1:
# we'll be returning a scalar
blk = self.blocks[0]
return func(blk.values, *args, **kwargs)
return func(blk.values)

res = {}
for blk in self.blocks:
bres = func(blk.values, *args, **kwargs)
bres = func(blk.values)

if np.ndim(bres) == 0:
# EA
assert blk.shape[0] == 1
new_res = zip(blk.mgr_locs.as_array, [bres])
else:
assert bres.ndim == 1, bres.shape
assert blk.shape[0] == len(bres), (blk.shape, bres.shape, args, kwargs)
assert blk.shape[0] == len(bres), (blk.shape, bres.shape)
new_res = zip(blk.mgr_locs.as_array, bres)

nr = dict(new_res)
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -410,7 +410,7 @@ def test_stat_operators_attempt_obj_array(self, method):
for df in [df1, df2]:
assert df.values.dtype == np.object_
result = getattr(df, method)(1)
expected = getattr(df.astype("f8"), method)(1)
expected = getattr(df, method)(1)

if method in ["sum", "prod"]:
tm.assert_series_equal(result, expected)
Expand Down