Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.agg/transform casts UDF results #40790

Merged
merged 24 commits into from
May 3, 2021

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Apr 5, 2021

This PR stops casting when the func is a callable since we can't tell when casting will be more harmful than helpful. I think this results in a more expected/consistent behavior for users.

@rhshadrach rhshadrach changed the title BUG: groupby.agg/transform downcasts UDF results BUG: groupby.agg/transform casts UDF results Apr 5, 2021
@rhshadrach rhshadrach added Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Apr 5, 2021
doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved
…nt_cast_udfs

� Conflicts:
�	doc/source/whatsnew/v1.3.0.rst
@rhshadrach rhshadrach marked this pull request as draft April 10, 2021 14:17
@rhshadrach rhshadrach marked this pull request as ready for review April 24, 2021 12:02
…into dont_cast_udfs

� Conflicts:
�	pandas/_libs/lib.pyx
�	pandas/core/groupby/generic.py
�	pandas/tests/groupby/test_groupby.py
�	pandas/tests/resample/test_datetime_index.py
@rhshadrach
Copy link
Member Author

@jreback - change made and green. TODOs are added in regards to mean/median/var, opened #41137 and will be taken care of by #41139

may also need to update groupby.rst, we might have this documented

I looked through, there is no mention of return dtype for agg/transform/apply.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2021

right i think we should try to start documenting return types for the udfs in the main docs with some examples (and ideally add to the doc strings too)

@rhshadrach
Copy link
Member Author

@jreback - docs added and green.

@jreback jreback added this to the 1.3 milestone Apr 30, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. can you rebase and some very minor doc-comments. ping on green.


.. ipython:: python

In [5]: df.groupby('key').agg(lambda x: x.sum())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the prompt

2 3 4"""
2 3 4

The resulting dtype will reflect the return value of the aggregating function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do a versionchanged tag here

See :ref:`groupby.aggregate.named` for more."""
See :ref:`groupby.aggregate.named` for more.

The resulting dtype will reflect the return value of the aggregating function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged


>>> g[['B', 'C']].apply(lambda x: x.max() - x.min())
B C
>>> g[['B', 'C']].apply(lambda x: x.astype(float).max() - x.min())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged

The resulting dtype will reflect the return value of the transformation function,
for example:

>>> grouped[['C', 'D']].transform(lambda x: x.astype(int).max())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versioncahnged

@rhshadrach
Copy link
Member Author

@jreback - changes made and green. I added versionchanged to the Notes section as well as the Examples sections.

@jreback jreback merged commit 7833fdf into pandas-dev:master May 3, 2021
@jreback
Copy link
Contributor

jreback commented May 3, 2021

thanks @rhshadrach

@rhshadrach rhshadrach deleted the dont_cast_udfs branch May 3, 2021 13:59
@@ -1240,9 +1276,6 @@ def _python_agg_general(self, func, *args, **kwargs):
assert result is not None
key = base.OutputKey(label=name, position=idx)

if is_numeric_dtype(obj.dtype):
result = maybe_downcast_numeric(result, obj.dtype)

if self.grouper._filter_empty_groups:
mask = counts.ravel() > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach can i get your help in this nearby piece of code? in all existing tests, when we get here, we have mask.all(). trying to come up with a case where this doesnt hold (or prove that it must always hold). any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is now removed - guessing mask.all() always held.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.

next thing to ask your help with : IIRC you've done a lot of work in core.apply, which DataFrameGroupBy.aggregate uses. id like to make SeriesGroupBy.aggregate and DataFrameGroupBy.aggregate share more code (or at least be more obviously-similar). can i get your thoughts on how to achieve this (and whether its the effort)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we make aggregate always aggregate (PoC implemented in #40275), we can greatly simplify these methods. However, in order to do that we need to separate the apply/agg paths: currently apply uses agg for list/dicts and agg also uses apply for UDFs. I make a couple of attempts to do this but kept running into issues with changing behaviors without having a clear way to deprecate. This was the motivation for #41112. I plan to start working on that, assuming that's a good approach, in 1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
3 participants