BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

nicholas-ys-tan · 2024-04-04T07:17:05Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "B": [10, 20, 30, 40, 50],
        "groupby1": ["diamond", "diamond", "spade", "spade", "spade"]
    }
)

def user_defined_aggfunc(df, addition=0):
    return np.mean(df) + addition


out_df = df.groupby(by="groupby1").agg(func=user_defined_aggfunc, addition=5)
out_df



A	B
groupby1		
diamond	13.25	13.25
spade	27.00	27.00

Issue Description

I am currently working on a PR to add numeric_only support to DataFrameGroupBy.agg() as it relates to some work on Geopandas I am currently trying to resolve. #56946, #58132

I noted there is an edge case that is currently not working as intended. This edge case occurs under the following conditions:

A single callable function is submitted as func
A kwarg is included
There is more than one column in the dataframe.

The first 2 conditions is what would lead it to use: DataFrameGroupBy._aggregate_frame

The third condition causes the bug in DataFrameGroupBy._aggregate_frame because the aggfunc would be applied to the entire dataframe, as opposed to column by column.

This has actually been touched upon in an existing test:

def test_pass_args_kwargs_duplicate_columns(tsframe, as_index):
    # go through _aggregate_frame with self.axis == 0 and duplicate columns
    tsframe.columns = ["A", "B", "A", "C"]
    gb = tsframe.groupby(lambda x: x.month, as_index=as_index)

    res = gb.agg(np.percentile, 80, axis=0)

    ex_data = {
        1: tsframe[tsframe.index.month == 1].quantile(0.8),
        2: tsframe[tsframe.index.month == 2].quantile(0.8),
    }
    expected = DataFrame(ex_data).T
    if not as_index:
        expected.insert(0, "index", [1, 2])
        expected.index = Index(range(2))

    tm.assert_frame_equal(res, expected)

Where the kwarg axis=0 was fed into agg(). Three comments I have on this is that:

it isn't necessarily obvious that this argument is required based on the documentation.
not putting in that argument doesn't raise an error nor warning, but takes the 80th percentile in that group for all columns
if using a UDF, such an argument would not be practical to be required in all cases

The root issue is how the aggfunc is applied in GroupByDataFrame._aggregate_frame() when there is more than one column. Without the axis=0 kwarg, it applies the function on all columns to get a single value.

Below is a very braindead example of something that might work for _aggregate_frame:

    def _aggregate_frame(self, func, *args, **kwargs) -> DataFrame:
        if self._grouper.nkeys != 1:
            raise AssertionError("Number of keys must be 1")

        obj = self._obj_with_exclusions
        result_index = self._grouper.result_index
        out = self.obj._constructor([], index=obj.columns, columns=result_index)
        for name, grp_df in self._grouper.get_iterator(obj):
            aggregated_row_data = list()
            for counter, _ in enumerate(grp_df.columns):
                fres = func(grp_df.iloc[:,counter], *args, **kwargs)
                aggregated_row_data.append(fres)
            out[name] = aggregated_row_data

        out = out.T

        return out

There are a number of things that need to be accounted for:

Need to consider that we can have duplicate column names, therefore we can not right out to the result dict() as dictionaries don't allow duplicates. Hence why the object is created to at the start and written to in the for loop
If calling a dataframe[column_name], where there are duplicates names, all the data in the columns will show up and the function will be applied to all of them. Hence why iloc is used.
To address wanting the function to be operated on only in the column-by-column, it's going through a for-loop.

This was just a naive approach to the issue but would like to confirm that this would be considered an issue before giving serious thought about a more appropriate fix.

Expected Behavior

            A     B
groupby1           
diamond   6.5  20.0
spade     9.0  45.0

Installed Versions

INSTALLED VERSIONS

commit : 6f39c4f
python : 3.10.14.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-26-generic
Version : #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 3.0.0.dev0+666.g6f39c4fa1f.dirty
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.2.0
pip : 24.0
Cython : 3.0.10
pytest : 8.1.1
hypothesis : 6.100.0
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.8
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.8.3
numba : 0.59.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 15.0.2
pyreadstat : 1.2.7
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.3.1
scipy : 1.12.0
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-04-05T12:20:44Z

Thanks for the report - I believe this is a duplicate of #39169. Would you agree?

I am optimistic that we will be able to get #57706 in for 3.0.

nicholas-ys-tan · 2024-04-05T12:41:03Z

Thanks for the report - I believe this is a duplicate of #39169. Would you agree?

I am optimistic that we will be able to get #57706 in for 3.0.

Yes, apologies, I hadn't picked up on the previous issue when I searched for past issues. Must not have had a close enough look.

rhshadrach · 2024-04-05T13:01:36Z

Not a problem at all - with 3500+ issues, it can be quite difficult.

Also - the PR I linked to currently makes the changes to agg as breaking changes. I'm going to be looking at reworking it to instead introduce deprecations - so this bug will likely still exist in 3.0. All this is to say that you will need to work around this issue if we want to get #58132 into 3.0, and you shouldn't be waiting on #57706.

nicholas-ys-tan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2024

nicholas-ys-tan mentioned this issue Apr 4, 2024

API: add numeric_only support to groupby agg #58132

Closed

5 tasks

rhshadrach added Groupby Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2024

nicholas-ys-tan closed this as completed Apr 5, 2024

nicholas-ys-tan mentioned this issue Apr 6, 2024

BUG: fix aggregation when using udf with kwarg #58170

Closed

3 tasks

nicholas-ys-tan mentioned this issue May 21, 2024

Add area averaged mean aggfunc to dissolve geopandas/geopandas#3229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

nicholas-ys-tan commented Apr 4, 2024 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Apr 5, 2024

nicholas-ys-tan commented Apr 5, 2024

rhshadrach commented Apr 5, 2024

BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

Comments

nicholas-ys-tan commented Apr 4, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Apr 5, 2024

nicholas-ys-tan commented Apr 5, 2024

rhshadrach commented Apr 5, 2024

nicholas-ys-tan commented Apr 4, 2024 •

edited

Loading