-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby.agg/transform casts UDF results #40790
Changes from all commits
477d813
d932c93
f2069a7
35c789f
93fa089
1cb216e
0cafcee
785ac9d
737a366
0b00aa7
de0f7b5
e95bb49
4ef6794
4f97288
ad7d990
11529e3
0ca49f6
a0a2640
eb1943a
180bc23
47d97ae
4a0978e
2b38e5c
6b80c10
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -158,14 +158,19 @@ class providing the base-class of operations. | |
side-effects, as they will take effect twice for the first | ||
group. | ||
|
||
.. versionchanged:: 1.3.0 | ||
|
||
The resulting dtype will reflect the return value of the passed ``func``, | ||
see the examples below. | ||
|
||
Examples | ||
-------- | ||
{examples} | ||
""", | ||
"dataframe_examples": """ | ||
>>> df = pd.DataFrame({'A': 'a a b'.split(), | ||
... 'B': [1,2,3], | ||
... 'C': [4,6, 5]}) | ||
... 'C': [4,6,5]}) | ||
>>> g = df.groupby('A') | ||
|
||
Notice that ``g`` has two groups, ``a`` and ``b``. | ||
|
@@ -183,13 +188,17 @@ class providing the base-class of operations. | |
|
||
Example 2: The function passed to `apply` takes a DataFrame as | ||
its argument and returns a Series. `apply` combines the result for | ||
each group together into a new DataFrame: | ||
each group together into a new DataFrame. | ||
|
||
.. versionchanged:: 1.3.0 | ||
|
||
>>> g[['B', 'C']].apply(lambda x: x.max() - x.min()) | ||
B C | ||
The resulting dtype will reflect the return value of the passed ``func``. | ||
|
||
>>> g[['B', 'C']].apply(lambda x: x.astype(float).max() - x.min()) | ||
B C | ||
A | ||
a 1 2 | ||
b 0 0 | ||
a 1.0 2.0 | ||
b 0.0 0.0 | ||
|
||
Example 3: The function passed to `apply` takes a DataFrame as | ||
its argument and returns a scalar. `apply` combines the result for | ||
|
@@ -210,12 +219,16 @@ class providing the base-class of operations. | |
|
||
Example 1: The function passed to `apply` takes a Series as | ||
its argument and returns a Series. `apply` combines the result for | ||
each group together into a new Series: | ||
each group together into a new Series. | ||
|
||
.. versionchanged:: 1.3.0 | ||
|
||
>>> g.apply(lambda x: x*2 if x.name == 'b' else x/2) | ||
The resulting dtype will reflect the return value of the passed ``func``. | ||
|
||
>>> g.apply(lambda x: x*2 if x.name == 'a' else x/2) | ||
a 0.0 | ||
a 0.5 | ||
b 4.0 | ||
a 2.0 | ||
b 1.0 | ||
dtype: float64 | ||
|
||
Example 2: The function passed to `apply` takes a Series as | ||
|
@@ -367,12 +380,17 @@ class providing the base-class of operations. | |
in the subframe. If f also supports application to the entire subframe, | ||
then a fast path is used starting from the second chunk. | ||
* f must not mutate groups. Mutation is not supported and may | ||
produce unexpected results. See :ref:`udf-mutation` for more details. | ||
produce unexpected results. See :ref:`gotchas.udf-mutation` for more details. | ||
|
||
When using ``engine='numba'``, there will be no "fall back" behavior internally. | ||
The group data and group index will be passed as numpy arrays to the JITed | ||
user defined function, and no alternative execution attempts will be tried. | ||
|
||
.. versionchanged:: 1.3.0 | ||
|
||
The resulting dtype will reflect the return value of the passed ``func``, | ||
see the examples below. | ||
|
||
Examples | ||
-------- | ||
|
||
|
@@ -402,6 +420,20 @@ class providing the base-class of operations. | |
3 3 8.0 | ||
4 4 6.0 | ||
5 3 8.0 | ||
|
||
.. versionchanged:: 1.3.0 | ||
|
||
The resulting dtype will reflect the return value of the passed ``func``, | ||
for example: | ||
|
||
>>> grouped[['C', 'D']].transform(lambda x: x.astype(int).max()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. versioncahnged |
||
C D | ||
0 5 8 | ||
1 5 9 | ||
2 5 8 | ||
3 5 9 | ||
4 5 8 | ||
5 5 9 | ||
""" | ||
|
||
_agg_template = """ | ||
|
@@ -469,12 +501,16 @@ class providing the base-class of operations. | |
When using ``engine='numba'``, there will be no "fall back" behavior internally. | ||
The group data and group index will be passed as numpy arrays to the JITed | ||
user defined function, and no alternative execution attempts will be tried. | ||
{examples} | ||
|
||
Functions that mutate the passed object can produce unexpected | ||
behavior or errors and are not supported. See :ref:`udf-mutation` | ||
behavior or errors and are not supported. See :ref:`gotchas.udf-mutation` | ||
for more details. | ||
""" | ||
|
||
.. versionchanged:: 1.3.0 | ||
|
||
The resulting dtype will reflect the return value of the passed ``func``, | ||
see the examples below. | ||
{examples}""" | ||
|
||
|
||
@final | ||
|
@@ -1240,9 +1276,6 @@ def _python_agg_general(self, func, *args, **kwargs): | |
assert result is not None | ||
key = base.OutputKey(label=name, position=idx) | ||
|
||
if is_numeric_dtype(obj.dtype): | ||
result = maybe_downcast_numeric(result, obj.dtype) | ||
|
||
if self.grouper._filter_empty_groups: | ||
mask = counts.ravel() > 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rhshadrach can i get your help in this nearby piece of code? in all existing tests, when we get here, we have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like this is now removed - guessing mask.all() always held. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep. next thing to ask your help with : IIRC you've done a lot of work in core.apply, which DataFrameGroupBy.aggregate uses. id like to make SeriesGroupBy.aggregate and DataFrameGroupBy.aggregate share more code (or at least be more obviously-similar). can i get your thoughts on how to achieve this (and whether its the effort)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once we make aggregate always aggregate (PoC implemented in #40275), we can greatly simplify these methods. However, in order to do that we need to separate the apply/agg paths: currently apply uses agg for list/dicts and agg also uses apply for UDFs. I make a couple of attempts to do this but kept running into issues with changing behaviors without having a clear way to deprecate. This was the motivation for #41112. I plan to start working on that, assuming that's a good approach, in 1.3. |
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionchanged