REF: ignore_failures in BlockManager.reduce #35881

jbrockmendel · 2020-08-25T00:54:54Z

Moving towards collecting all of the ignore_failures code in one place.

The case where we have object dtypes is kept separate in this PR, will be handled in the next pass.

…f-blockwise-3

…dev#35799)

…ster

…f-mgr-reduce

* DOC: Updated aggregate docstring * Doc: updated aggregate docstring * Update pandas/core/generic.py Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com> * Update generic.py * Update generic.py * Revert "Update generic.py" This reverts commit 15ecaf7. * Revert "Revert "Update generic.py"" This reverts commit cc231c8. * Updated docstring of agg * Trailing whitespace removed * DOC: Updated docstring of agg * Update generic.py * Updated Docstring Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

…f-reduce-blockwise-2

pandas/core/internals/managers.py

…f-reduce-blockwise-2

pandas/core/frame.py

…f-reduce-blockwise-2

jreback · 2020-09-04T15:34:25Z

pandas/tests/frame/test_analytics.py

@@ -1108,10 +1108,10 @@ def test_any_all_bool_only(self):
                True,
                marks=[td.skip_if_np_lt("1.15")],
            ),
-            (np.all, {"A": pd.Series([0, 1], dtype="category")}, False),
-            (np.any, {"A": pd.Series([0, 1], dtype="category")}, True),
+            (np.all, {"A": pd.Series([0, 1], dtype="category")}, True),


so this is the bug fix? can you add a whatsnew note

Not a bugfix per se, this is the behavior that changes if we declare that ser.to_frame().all() should be consistent with ser.all(), xref #36076

ok i think we need to deprecate this right (that was consensus?)

also i suppose ok to just change. this is not a very large case. cc @jorisvandenbossche

updated with whatsnew

You have another PR actually trying to deprecate this right?

Also in #36076, you commented a few days ago "the consensus seems to be that we should deprecate the current behavior in favor of matching the Series behavior". But so this PR is not doing that? Or is this PR not handling that case?

Closed that one. This is a giant PITA and we should just rip the bandaid off.

…f-reduce-blockwise-2

jorisvandenbossche

Can you try to summarize the issue a bit more?

So with this PR, a categorical column will be skipped for any/all (does this also impact other reductions? Or other dtypes? Are there tests for this?).
This is because with this PR it now takes the route of BlockManager.reduce->Block.reduce where Categorical._reduce is called, which raises an error, and BlockManager.reduce catches this error and skips the column. Is that correct?
In master, on the other hand, for any/ all (filter reductions) we convert the full dataframe to an array (at least with the default numeric_only=None, and then on the numpy array the any/all operation works.

But so, that also means that the behaviour with this PR depends on the presence of (another) object dtype column or not?

If we want to deprecate this, would a relatively clean option be: pass through the name of the op to Block.reduce, and let CategoricalBlock.reduce have a special case checking for the any/all op (in which case we can raise a warning and perform the op on the ndarray), and otherwise use the normal Block.reduce implementation ?

jorisvandenbossche · 2020-09-10T07:31:23Z

doc/source/whatsnew/v1.2.0.rst

@@ -244,7 +244,7 @@ Timezones

 Numeric
 ^^^^^^^
-
+- Bug in :class:`DataFrame` reductions incorrectly ignoring ``ExtensionArray`` behaviors (:issue:`35881`)


This ntoe is not very helpful for users I think. Can you list the cases we are aware of that will change? (which will also help reviewing this PR)

Mostly for single-column EA-dtypes where the reduction on the array would raise. Tough to put into a succinct note bc of the dropping-failures behavior. suggestions welcome

jbrockmendel · 2020-09-10T18:32:19Z

But so, that also means that the behaviour with this PR depends on the presence of (another) object dtype column or not?

I think it means the behavior in master depends on the presence of (another) object dtpye column. Which is part of the cluster-frack that is the reason we should rip off the bandaid.

…f-reduce-blockwise-2

jbrockmendel · 2020-10-06T21:33:47Z

Reverted behavior-changing component so that bug can be fixed separately

jreback · 2020-10-06T23:03:23Z

lgtm. i know you are trying to cleanup this whole area, so good to go. @jorisvandenbossche any comments (I am sure this is going to be simplified over time)

jbrockmendel · 2020-10-09T22:10:10Z

yah the real simplifications dont come until we do the actual bugfixes, but this is a step in the right direction

jreback · 2020-10-10T18:36:12Z

great. glad finally to get this in.

jorisvandenbossche

Doing some profiling on the example from #37081, and reporting a few findings here.

jorisvandenbossche · 2020-10-13T15:13:43Z

pandas/core/frame.py

@@ -8595,6 +8595,7 @@ def _reduce(
            cols = self.columns[~dtype_is_dt]
            self = self[cols]

+        any_object = self.dtypes.apply(is_object_dtype).any()


I think this is partly the culprit of the slowdown. See also the top post of #33252, which shows that self.dtypes.apply(..) is slower than the method that is used a few lines above for dtype_is_dt

jorisvandenbossche · 2020-10-13T15:21:49Z

pandas/core/frame.py

-            res = df._mgr.reduce(blk_func)
-            out = df._constructor(res).iloc[0].rename(None)
+            res, indexer = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
+            out = df._constructor(res).iloc[0]


Based on my profiling, this getitem also seems to take a significant amount of the total time, although this cannot explain the recent perf degradation (but I am comparing my profile on master vs 1.1, where the iloc was not yet present)

The getitem being iloc[0]?

jbrockmendel and others added 16 commits August 20, 2020 21:19

REF: remove unnecesary try/except

4c5eddd

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

c632c9f

…f-blockwise-3

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

9e64be3

…f-blockwise-3

TST: add test for agg on ordered categorical cols (pandas-dev#35630)

42649fb

TST: resample does not yield empty groups (pandas-dev#10603) (pandas-…

47121dd

…dev#35799)

revert accidental rebase

1decb3e

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

57c5dd3

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

a358463

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

ffa7ad7

…ster

REF: implement Block.reduce

5281ce7

remove outdated comment

cdcc1a0

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

b04e023

…f-mgr-reduce

REF: BlockManager.reduce with ignore_failures

e29283b

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

4b52eda

…f-reduce-blockwise-2

de-duplicate

a58fdf0

jreback reviewed Aug 25, 2020

View reviewed changes

pandas/core/internals/managers.py Outdated Show resolved Hide resolved

jreback added Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code labels Aug 25, 2020

jbrockmendel added 3 commits August 25, 2020 09:05

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

313280f

…f-reduce-blockwise-2

mypy fixup

cdad23d

mypy fixup

028a0b7

jbrockmendel mentioned this pull request Aug 25, 2020

REF: handle axis=None case inside DataFrame.any/all to simplify _reduce #35899

Merged

jreback added this to the 1.2 milestone Aug 27, 2020

jreback requested changes Aug 27, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/frame.py Show resolved Hide resolved

jbrockmendel added 5 commits August 26, 2020 20:36

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

3467913

…f-reduce-blockwise-2

comment

2f10b72

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

8f2a047

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

699b96b

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

e128da8

…f-reduce-blockwise-2

update tested behavior

23e3a6a

jreback requested changes Sep 4, 2020

View reviewed changes

jbrockmendel added 3 commits September 5, 2020 18:00

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

37e2f99

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

146b322

…f-reduce-blockwise-2

whatsnew

ccea5a5

jorisvandenbossche reviewed Sep 10, 2020

View reviewed changes

rewords whatsnew

c6b4e2c

jbrockmendel added 10 commits September 11, 2020 12:32

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

1165129

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

11126bc

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

a5df009

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

e1c1a5b

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

d7099fe

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

97376bf

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

0a7fa6f

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

c5de076

…f-reduce-blockwise-2

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

6a30fcf

…f-reduce-blockwise-2

Revert behavior-changing component

f349ef7

jreback approved these changes Oct 6, 2020

View reviewed changes

jreback merged commit 7d257c6 into pandas-dev:master Oct 10, 2020

jbrockmendel deleted the ref-reduce-blockwise-2 branch October 10, 2020 18:43

jbrockmendel mentioned this pull request Oct 12, 2020

PERF: regression in DataFrame reduction ops performance #37081

Closed

jorisvandenbossche reviewed Oct 13, 2020

View reviewed changes

ukarroum mentioned this pull request Oct 14, 2020

PERF: regression in DataFrame reduction ops performance #37081 #37118

Merged

4 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

REF: ignore_failures in BlockManager.reduce (pandas-dev#35881)

1c391ff

simonjayhawkins mentioned this pull request Feb 12, 2021

BUG: reduction operations failing if min_count is larger #39738

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: ignore_failures in BlockManager.reduce #35881

REF: ignore_failures in BlockManager.reduce #35881

jbrockmendel commented Aug 25, 2020

jreback Sep 4, 2020

jbrockmendel Sep 4, 2020

jreback Sep 6, 2020

jbrockmendel Sep 8, 2020

jorisvandenbossche Sep 9, 2020

jorisvandenbossche Sep 9, 2020

jbrockmendel Sep 9, 2020

jorisvandenbossche left a comment

jorisvandenbossche Sep 10, 2020

jbrockmendel Sep 10, 2020

jbrockmendel commented Sep 10, 2020

jbrockmendel commented Oct 6, 2020

jreback commented Oct 6, 2020

jbrockmendel commented Oct 9, 2020

jreback commented Oct 10, 2020

jorisvandenbossche left a comment

jorisvandenbossche Oct 13, 2020

jorisvandenbossche Oct 13, 2020

jbrockmendel Oct 13, 2020

jorisvandenbossche Oct 14, 2020

REF: ignore_failures in BlockManager.reduce #35881

REF: ignore_failures in BlockManager.reduce #35881

Conversation

jbrockmendel commented Aug 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 10, 2020

jbrockmendel commented Oct 6, 2020

jreback commented Oct 6, 2020

jbrockmendel commented Oct 9, 2020

jreback commented Oct 10, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment