Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: ignore_failures in BlockManager.reduce #35881

Merged
merged 42 commits into from
Oct 10, 2020
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
4c5eddd
REF: remove unnecesary try/except
jbrockmendel Aug 21, 2020
c632c9f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 21, 2020
9e64be3
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 21, 2020
42649fb
TST: add test for agg on ordered categorical cols (#35630)
mathurk1 Aug 21, 2020
47121dd
TST: resample does not yield empty groups (#10603) (#35799)
tkmz-n Aug 21, 2020
1decb3e
revert accidental rebase
jbrockmendel Aug 22, 2020
57c5dd3
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 22, 2020
a358463
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 23, 2020
ffa7ad7
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
jbrockmendel Aug 23, 2020
5281ce7
REF: implement Block.reduce
jbrockmendel Aug 24, 2020
cdcc1a0
remove outdated comment
jbrockmendel Aug 24, 2020
b04e023
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 24, 2020
356cdf7
DOC: Updated aggregate docstring (#35042)
gurukiran07 Aug 24, 2020
e29283b
REF: BlockManager.reduce with ignore_failures
jbrockmendel Aug 24, 2020
4b52eda
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 25, 2020
a58fdf0
de-duplicate
jbrockmendel Aug 25, 2020
313280f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 25, 2020
cdad23d
mypy fixup
jbrockmendel Aug 25, 2020
028a0b7
mypy fixup
jbrockmendel Aug 25, 2020
3467913
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 27, 2020
2f10b72
comment
jbrockmendel Aug 27, 2020
8f2a047
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Aug 30, 2020
699b96b
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 2, 2020
e128da8
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 2, 2020
fd964d9
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 2, 2020
253c0ea
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 2, 2020
fbe67f4
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 4, 2020
23e3a6a
update tested behavior
jbrockmendel Sep 4, 2020
37e2f99
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 6, 2020
146b322
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 7, 2020
ccea5a5
whatsnew
jbrockmendel Sep 7, 2020
c6b4e2c
rewords whatsnew
jbrockmendel Sep 10, 2020
1165129
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 11, 2020
11126bc
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 20, 2020
a5df009
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 21, 2020
e1c1a5b
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 22, 2020
d7099fe
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 23, 2020
97376bf
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 24, 2020
0a7fa6f
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Sep 25, 2020
c5de076
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 3, 2020
6a30fcf
Merge branch 'master' of https://github.com/pandas-dev/pandas into re…
jbrockmendel Oct 6, 2020
f349ef7
Revert behavior-changing component
jbrockmendel Oct 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ Timezones

Numeric
^^^^^^^
-
- Bug in :class:`DataFrame` reductions incorrectly ignoring ``ExtensionArray`` behaviors (:issue:`35881`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ntoe is not very helpful for users I think. Can you list the cases we are aware of that will change? (which will also help reviewing this PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly for single-column EA-dtypes where the reduction on the array would raise. Tough to put into a succinct note bc of the dropping-failures behavior. suggestions welcome

-

Conversion
Expand Down
19 changes: 15 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8625,6 +8625,7 @@ def _reduce(
cols = self.columns[~dtype_is_dt]
self = self[cols]

any_object = self.dtypes.apply(is_object_dtype).any()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is partly the culprit of the slowdown. See also the top post of #33252, which shows that self.dtypes.apply(..) is slower than the method that is used a few lines above for dtype_is_dt

# TODO: Make other agg func handle axis=None properly
axis = self._get_axis_number(axis)
labels = self._get_agg_axis(axis)
Expand Down Expand Up @@ -8654,7 +8655,13 @@ def _get_data(axis_matters: bool) -> DataFrame:
raise NotImplementedError(msg)
return data

if numeric_only is not None:
if numeric_only is not None or (
numeric_only is None and axis == 0 and not any_object
):
# For numeric_only non-None and axis non-None, we know
# which blocks to use and no try/except is needed.
# For numeric_only=None only the case with axis==0 and no object
# dtypes are unambiguous can be handled with BlockManager.reduce
df = self
if numeric_only is True:
df = _get_data(axis_matters=True)
Expand All @@ -8663,6 +8670,7 @@ def _get_data(axis_matters: bool) -> DataFrame:
axis = 0

out_dtype = "bool" if filter_type == "bool" else None
ignore_failures = numeric_only is None

def blk_func(values):
if isinstance(values, ExtensionArray):
Expand All @@ -8672,12 +8680,15 @@ def blk_func(values):

# After possibly _get_data and transposing, we are now in the
# simple case where we can use BlockManager.reduce
res = df._mgr.reduce(blk_func)
out = df._constructor(res,).iloc[0].rename(None)
res, indexer = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
out = df._constructor(res).iloc[0]
jreback marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my profiling, this getitem also seems to take a significant amount of the total time, although this cannot explain the recent perf degradation (but I am comparing my profile on master vs 1.1, where the iloc was not yet present)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getitem being iloc[0]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed


if out_dtype is not None:
out = out.astype(out_dtype)
if axis == 0 and is_object_dtype(out.dtype):
out[:] = coerce_to_dtypes(out.values, df.dtypes)
# GH#35865 careful to cast explicitly to object
nvs = coerce_to_dtypes(out.values, df.dtypes.iloc[np.sort(indexer)])
out[:] = np.array(nvs, dtype=object)
return out

assert numeric_only is None
Expand Down
45 changes: 39 additions & 6 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import itertools
from typing import (
Any,
Callable,
DefaultDict,
Dict,
List,
Expand Down Expand Up @@ -324,18 +325,46 @@ def _verify_integrity(self) -> None:
f"tot_items: {tot_items}"
)

def reduce(self: T, func) -> T:
def reduce(
self: T, func: Callable, ignore_failures: bool = False
) -> Tuple[T, np.ndarray]:
"""
Apply reduction function blockwise, returning a single-row BlockManager.

Parameters
----------
func : reduction function
ignore_failures : bool, default False
Whether to drop blocks where func raises TypeError.

Returns
-------
BlockManager
np.ndarray
Indexer of mgr_locs that are retained.
"""
# If 2D, we assume that we're operating column-wise
assert self.ndim == 2

res_blocks: List[Block] = []
for blk in self.blocks:
nbs = blk.reduce(func)
try:
nbs = blk.reduce(func)
except TypeError:
if ignore_failures:
continue
raise
res_blocks.extend(nbs)

index = Index([0]) # placeholder
new_mgr = BlockManager.from_blocks(res_blocks, [self.items, index])
return new_mgr
index = Index([None]) # placeholder
if res_blocks:
indexer = np.concatenate([blk.mgr_locs.as_array for blk in res_blocks])
new_mgr = self._combine(res_blocks, copy=False, index=index)
else:
indexer = []
new_mgr = type(self).from_blocks([], [Index([]), index])

return new_mgr, indexer

def operate_blockwise(self, other: "BlockManager", array_op) -> "BlockManager":
"""
Expand Down Expand Up @@ -693,7 +722,9 @@ def get_numeric_data(self, copy: bool = False) -> "BlockManager":
"""
return self._combine([b for b in self.blocks if b.is_numeric], copy)

def _combine(self: T, blocks: List[Block], copy: bool = True) -> T:
def _combine(
self: T, blocks: List[Block], copy: bool = True, index: Optional[Index] = None
) -> T:
""" return a new manager with the blocks """
if len(blocks) == 0:
return self.make_empty()
Expand All @@ -709,6 +740,8 @@ def _combine(self: T, blocks: List[Block], copy: bool = True) -> T:
new_blocks.append(b)

axes = list(self.axes)
if index is not None:
axes[-1] = index
axes[0] = self.items.take(indexer)

return type(self).from_blocks(new_blocks, axes)
Expand Down
6 changes: 3 additions & 3 deletions pandas/tests/frame/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -1068,10 +1068,10 @@ def test_any_all_bool_only(self):
pytest.param(np.any, {"A": pd.Series([0, 1], dtype="m8[ns]")}, True,),
pytest.param(np.all, {"A": pd.Series([1, 2], dtype="m8[ns]")}, True,),
pytest.param(np.any, {"A": pd.Series([1, 2], dtype="m8[ns]")}, True,),
(np.all, {"A": pd.Series([0, 1], dtype="category")}, False),
(np.any, {"A": pd.Series([0, 1], dtype="category")}, True),
(np.all, {"A": pd.Series([0, 1], dtype="category")}, True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is the bug fix? can you add a whatsnew note

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bugfix per se, this is the behavior that changes if we declare that ser.to_frame().all() should be consistent with ser.all(), xref #36076

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think we need to deprecate this right (that was consensus?)

also i suppose ok to just change. this is not a very large case. cc @jorisvandenbossche

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with whatsnew

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have another PR actually trying to deprecate this right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in #36076, you commented a few days ago "the consensus seems to be that we should deprecate the current behavior in favor of matching the Series behavior". But so this PR is not doing that? Or is this PR not handling that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closed that one. This is a giant PITA and we should just rip the bandaid off.

(np.any, {"A": pd.Series([0, 1], dtype="category")}, False),
(np.all, {"A": pd.Series([1, 2], dtype="category")}, True),
(np.any, {"A": pd.Series([1, 2], dtype="category")}, True),
(np.any, {"A": pd.Series([1, 2], dtype="category")}, False),
# Mix GH#21484
pytest.param(
np.all,
Expand Down