Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: ignore_failures in BlockManager.reduce #35881

Merged
merged 42 commits into from
Oct 10, 2020

Conversation

jbrockmendel
Copy link
Member

Moving towards collecting all of the ignore_failures code in one place.

The case where we have object dtypes is kept separate in this PR, will be handled in the next pass.

jbrockmendel and others added 16 commits August 20, 2020 21:19
* DOC: Updated aggregate docstring

* Doc: updated aggregate docstring

* Update pandas/core/generic.py

Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>

* Update generic.py

* Update generic.py

* Revert "Update generic.py"

This reverts commit 15ecaf7.

* Revert "Revert "Update generic.py""

This reverts commit cc231c8.

* Updated docstring of agg

* Trailing whitespace removed

* DOC: Updated docstring of agg

* Update generic.py

* Updated Docstring

Co-authored-by: Marco Gorelli <33491632+MarcoGorelli@users.noreply.github.com>
@jreback jreback added Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code labels Aug 25, 2020
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Show resolved Hide resolved
@@ -1108,10 +1108,10 @@ def test_any_all_bool_only(self):
True,
marks=[td.skip_if_np_lt("1.15")],
),
(np.all, {"A": pd.Series([0, 1], dtype="category")}, False),
(np.any, {"A": pd.Series([0, 1], dtype="category")}, True),
(np.all, {"A": pd.Series([0, 1], dtype="category")}, True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is the bug fix? can you add a whatsnew note

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bugfix per se, this is the behavior that changes if we declare that ser.to_frame().all() should be consistent with ser.all(), xref #36076

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think we need to deprecate this right (that was consensus?)

also i suppose ok to just change. this is not a very large case. cc @jorisvandenbossche

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with whatsnew

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have another PR actually trying to deprecate this right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in #36076, you commented a few days ago "the consensus seems to be that we should deprecate the current behavior in favor of matching the Series behavior". But so this PR is not doing that? Or is this PR not handling that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closed that one. This is a giant PITA and we should just rip the bandaid off.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to summarize the issue a bit more?

So with this PR, a categorical column will be skipped for any/all (does this also impact other reductions? Or other dtypes? Are there tests for this?).
This is because with this PR it now takes the route of BlockManager.reduce->Block.reduce where Categorical._reduce is called, which raises an error, and BlockManager.reduce catches this error and skips the column. Is that correct?
In master, on the other hand, for any/ all (filter reductions) we convert the full dataframe to an array (at least with the default numeric_only=None, and then on the numpy array the any/all operation works.

But so, that also means that the behaviour with this PR depends on the presence of (another) object dtype column or not?

If we want to deprecate this, would a relatively clean option be: pass through the name of the op to Block.reduce, and let CategoricalBlock.reduce have a special case checking for the any/all op (in which case we can raise a warning and perform the op on the ndarray), and otherwise use the normal Block.reduce implementation ?

@@ -244,7 +244,7 @@ Timezones

Numeric
^^^^^^^
-
- Bug in :class:`DataFrame` reductions incorrectly ignoring ``ExtensionArray`` behaviors (:issue:`35881`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ntoe is not very helpful for users I think. Can you list the cases we are aware of that will change? (which will also help reviewing this PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly for single-column EA-dtypes where the reduction on the array would raise. Tough to put into a succinct note bc of the dropping-failures behavior. suggestions welcome

@jbrockmendel
Copy link
Member Author

But so, that also means that the behaviour with this PR depends on the presence of (another) object dtype column or not?

I think it means the behavior in master depends on the presence of (another) object dtpye column. Which is part of the cluster-frack that is the reason we should rip off the bandaid.

@jbrockmendel
Copy link
Member Author

Reverted behavior-changing component so that bug can be fixed separately

@jreback
Copy link
Contributor

jreback commented Oct 6, 2020

lgtm. i know you are trying to cleanup this whole area, so good to go. @jorisvandenbossche any comments (I am sure this is going to be simplified over time)

@jbrockmendel
Copy link
Member Author

yah the real simplifications dont come until we do the actual bugfixes, but this is a step in the right direction

@jreback jreback merged commit 7d257c6 into pandas-dev:master Oct 10, 2020
@jreback
Copy link
Contributor

jreback commented Oct 10, 2020

great. glad finally to get this in.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some profiling on the example from #37081, and reporting a few findings here.

@@ -8595,6 +8595,7 @@ def _reduce(
cols = self.columns[~dtype_is_dt]
self = self[cols]

any_object = self.dtypes.apply(is_object_dtype).any()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is partly the culprit of the slowdown. See also the top post of #33252, which shows that self.dtypes.apply(..) is slower than the method that is used a few lines above for dtype_is_dt

res = df._mgr.reduce(blk_func)
out = df._constructor(res).iloc[0].rename(None)
res, indexer = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
out = df._constructor(res).iloc[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my profiling, this getitem also seems to take a significant amount of the total time, although this cannot explain the recent perf degradation (but I am comparing my profile on master vs 1.1, where the iloc was not yet present)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getitem being iloc[0]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants