BUG: `df.agg` call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169

pjireland · 2021-01-14T15:28:58Z

I have checked that this issue has not already been reported
- The issue could potentially be similar to that reported in BUG: Fails and or weird aggregation results when using agg with custom functions #33517, but I'm not sure.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
import scipy.stats


def circ_mean(data, dummy_kwarg=0):
    # print(data)
    return 180/np.pi*scipy.stats.circmean(data*np.pi/180)


def numpy_mean(data, dummy_kwarg=0):
    return np.mean(data)


@pd.api.extensions.register_dataframe_accessor("my")
class CircstatsAccessor(object):
    def __init__(self, pandas_obj):
        self._obj = pandas_obj
        
    def circ_mean(self, axis=0, level=None, **kwargs):
        df = self._obj
        if axis != 0 or level is not None:
            df = df.groupby(axis=axis, level=level)
        return df.agg(circ_mean, **kwargs)
    
    def numpy_mean(self, axis=0, level=None, **kwargs):
        df = self._obj
        if axis != 0 or level is not None:
            df = df.groupby(axis=axis, level=level)
        return df.agg(numpy_mean, **kwargs)


df = pd.DataFrame(
    data={
        "col1": [10, 11, 12, 13],
        "col2": [20, 21, 22, 23],
    },
    index=[1, 2, 3, 4]
)

# Compute results with the standard `df.mean` call
# I'd like my custom mean function to do a similar thing
df.mean(level=0, axis=0)

# If I don't pass in any kwargs, `df.my.circ_mean` behaves as expected
# Results approximately match those from `df.mean`
df.my.circ_mean(level=0, axis=0)

# If I pass in a kwarg that is not ever used, `df.my.circ_mean`
# returns unusual results - the returned values in `col1` are 
# identical to those in `col2`, whereas they were
# different before
df.my.circ_mean(level=0, axis=0, dummy_kwarg=0)

# If I call `df.my.numpy_mean`, results are identical
# without or without providing the kwarg
df.my.numpy_mean(level=0, axis=0)
df.my.numpy_mean(level=0, axis=0, dummy_kwarg=0)

Problem description

As discussed in the code comments above, I see a difference in behavior in my circ_mean function depending on whether a dummy (un-used) keyword argument is specified. Uncommenting the print command in the circ_mean function indicates that df.agg is passing in different things depending on whether or not this keyword is provided.

I would expect there to be no difference in behavior since this keyword has no effect. Interestingly, I see the expected no difference in behavior if I replace the more complicated circular mean call with a simple np.mean call inside my custom function (compare circ_mean and numpy_mean functions).

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.7.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.2.0
numpy : 1.19.2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.0.0.post20201207
Cython : 0.29.21
pytest : 6.2.1
hypothesis : None
sphinx : 3.4.1
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : 0.10.1
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.3
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 0.15.1
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.2
sqlalchemy : 1.3.20
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.51.2

The text was updated successfully, but these errors were encountered:

rhshadrach · 2021-01-15T03:26:28Z

Thanks for the report, can you simplify the example to only include necessary details?

pjireland · 2021-01-15T16:20:34Z

Thanks for the report, can you simplify the example to only include necessary details?

No problem. The example was the simplest one I was able to find that still showed the behavior described, but feel free to let me know if anything is confusing about it, or you see anything that seems like it could be simplified.

rhshadrach · 2021-01-16T15:27:50Z

It appears to me the entire class is unnecessary. Can just call circle_mean directly to demonstrate the issue.

simonjayhawkins · 2021-01-17T13:12:49Z

Can just call circle_mean directly to demonstrate the issue.

>>> import numpy as np
>>> import pandas as pd
>>>
>>> def func(data, **kwargs):
...     return np.sum(np.sum(data)) # np.sum twice to ensure scalar result
...
>>> df = pd.DataFrame([[1,2], [3,4]])
>>> print(df.groupby(level=0).agg(func))
   0  1
0  1  2
1  3  4
>>> print(df.groupby(level=0).agg(func, foo=42))
   0  1
0  3  3
1  7  7
>>>

rhshadrach · 2021-01-17T20:02:44Z

Thanks @simonjayhawkins. In pandas.core.groupby.generic.aggregate, we're taking two different paths depending on whether args/kwargs are being used. The path through _aggregate_frame computes the result on each group as a whole, then uses _wrap_frame_output which in turn uses the original objects columns. On the other hand, the path through agg_list_like aggregates column-by-column.

Further investigations and PRs to fix are welcome.

pjireland added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2021

rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2021

rhshadrach added this to the Contributions Welcome milestone Jan 17, 2021

rhshadrach mentioned this issue Jan 30, 2021

BUG: GroupBy Aggregation Behavior #39489

Closed

rhshadrach mentioned this issue Mar 6, 2021

POC: aggregate always aggregates #40275

Closed

1 task

rhshadrach mentioned this issue May 26, 2022

BUG: DataFrame.groupby.agg has inconsistent behaviour depending on DataFrame.groupby by's Iterable length and use of DataFrame.groupby.agg's *args/**kwargs #47092

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach mentioned this issue Jul 15, 2023

pandas groupby sum min_count misbehaves #23889

Open

rhshadrach linked a pull request Mar 2, 2024 that will close this issue

BUG: groupby.agg should always agg #57706

Open

6 tasks

rhshadrach mentioned this issue Mar 23, 2024

agg behaviour depends on the number of arguments of the function #33242

Closed

rhshadrach mentioned this issue Apr 5, 2024

BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

Closed

3 tasks

mroeschke mentioned this issue Apr 9, 2024

BUG: fix aggregation when using udf with kwarg #58170

Closed

3 tasks

rhshadrach mentioned this issue Apr 11, 2024

ENH: Should we support aggregating by-frame in DataFrameGroupBy.agg #58225

Open

nicholas-ys-tan mentioned this issue Apr 17, 2024

API: add numeric_only support to groupby agg #58132

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `df.agg` call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169

BUG: `df.agg` call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169

pjireland commented Jan 14, 2021

INSTALLED VERSIONS

rhshadrach commented Jan 15, 2021

pjireland commented Jan 15, 2021

rhshadrach commented Jan 16, 2021 •

edited

Loading

simonjayhawkins commented Jan 17, 2021

rhshadrach commented Jan 17, 2021

BUG: df.agg call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169

BUG: df.agg call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169

Comments

pjireland commented Jan 14, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

rhshadrach commented Jan 15, 2021

pjireland commented Jan 15, 2021

rhshadrach commented Jan 16, 2021 • edited Loading

simonjayhawkins commented Jan 17, 2021

rhshadrach commented Jan 17, 2021

BUG: `df.agg` call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169

BUG: `df.agg` call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169

Output of `pd.show_versions()`

rhshadrach commented Jan 16, 2021 •

edited

Loading