Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: incorrect aggregation of dataframe when using UDF with kwarg #58146

Closed
3 tasks done
nicholas-ys-tan opened this issue Apr 4, 2024 · 3 comments
Closed
3 tasks done
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@nicholas-ys-tan
Copy link

nicholas-ys-tan commented Apr 4, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "B": [10, 20, 30, 40, 50],
        "groupby1": ["diamond", "diamond", "spade", "spade", "spade"]
    }
)

def user_defined_aggfunc(df, addition=0):
    return np.mean(df) + addition


out_df = df.groupby(by="groupby1").agg(func=user_defined_aggfunc, addition=5)
out_df



A	B
groupby1		
diamond	13.25	13.25
spade	27.00	27.00

Issue Description

I am currently working on a PR to add numeric_only support to DataFrameGroupBy.agg() as it relates to some work on Geopandas I am currently trying to resolve. #56946, #58132

I noted there is an edge case that is currently not working as intended. This edge case occurs under the following conditions:

  1. A single callable function is submitted as func
  2. A kwarg is included
  3. There is more than one column in the dataframe.

The first 2 conditions is what would lead it to use: DataFrameGroupBy._aggregate_frame

The third condition causes the bug in DataFrameGroupBy._aggregate_frame because the aggfunc would be applied to the entire dataframe, as opposed to column by column.

This has actually been touched upon in an existing test:

def test_pass_args_kwargs_duplicate_columns(tsframe, as_index):
    # go through _aggregate_frame with self.axis == 0 and duplicate columns
    tsframe.columns = ["A", "B", "A", "C"]
    gb = tsframe.groupby(lambda x: x.month, as_index=as_index)

    res = gb.agg(np.percentile, 80, axis=0)

    ex_data = {
        1: tsframe[tsframe.index.month == 1].quantile(0.8),
        2: tsframe[tsframe.index.month == 2].quantile(0.8),
    }
    expected = DataFrame(ex_data).T
    if not as_index:
        expected.insert(0, "index", [1, 2])
        expected.index = Index(range(2))

    tm.assert_frame_equal(res, expected)

Where the kwarg axis=0 was fed into agg(). Three comments I have on this is that:

  • it isn't necessarily obvious that this argument is required based on the documentation.
  • not putting in that argument doesn't raise an error nor warning, but takes the 80th percentile in that group for all columns
  • if using a UDF, such an argument would not be practical to be required in all cases

The root issue is how the aggfunc is applied in GroupByDataFrame._aggregate_frame() when there is more than one column. Without the axis=0 kwarg, it applies the function on all columns to get a single value.

Below is a very braindead example of something that might work for _aggregate_frame:

    def _aggregate_frame(self, func, *args, **kwargs) -> DataFrame:
        if self._grouper.nkeys != 1:
            raise AssertionError("Number of keys must be 1")

        obj = self._obj_with_exclusions
        result_index = self._grouper.result_index
        out = self.obj._constructor([], index=obj.columns, columns=result_index)
        for name, grp_df in self._grouper.get_iterator(obj):
            aggregated_row_data = list()
            for counter, _ in enumerate(grp_df.columns):
                fres = func(grp_df.iloc[:,counter], *args, **kwargs)
                aggregated_row_data.append(fres)
            out[name] = aggregated_row_data

        out = out.T

        return out

There are a number of things that need to be accounted for:

  • Need to consider that we can have duplicate column names, therefore we can not right out to the result dict() as dictionaries don't allow duplicates. Hence why the object is created to at the start and written to in the for loop
  • If calling a dataframe[column_name], where there are duplicates names, all the data in the columns will show up and the function will be applied to all of them. Hence why iloc is used.
  • To address wanting the function to be operated on only in the column-by-column, it's going through a for-loop.

This was just a naive approach to the issue but would like to confirm that this would be considered an issue before giving serious thought about a more appropriate fix.

Expected Behavior

            A     B
groupby1           
diamond   6.5  20.0
spade     9.0  45.0

Installed Versions

INSTALLED VERSIONS

commit : 6f39c4f
python : 3.10.14.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-26-generic
Version : #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 3.0.0.dev0+666.g6f39c4fa1f.dirty
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.2.0
pip : 24.0
Cython : 3.0.10
pytest : 8.1.1
hypothesis : 6.100.0
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.8
fastparquet : 2024.2.0
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.8.3
numba : 0.59.1
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pyarrow : 15.0.2
pyreadstat : 1.2.7
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.3.1
scipy : 1.12.0
sqlalchemy : 2.0.29
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.3.0
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@nicholas-ys-tan nicholas-ys-tan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 4, 2024
@rhshadrach
Copy link
Member

Thanks for the report - I believe this is a duplicate of #39169. Would you agree?

I am optimistic that we will be able to get #57706 in for 3.0.

@rhshadrach rhshadrach added Groupby Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2024
@nicholas-ys-tan
Copy link
Author

Thanks for the report - I believe this is a duplicate of #39169. Would you agree?

I am optimistic that we will be able to get #57706 in for 3.0.

Yes, apologies, I hadn't picked up on the previous issue when I searched for past issues. Must not have had a close enough look.

@rhshadrach
Copy link
Member

Not a problem at all - with 3500+ issues, it can be quite difficult.

Also - the PR I linked to currently makes the changes to agg as breaking changes. I'm going to be looking at reworking it to instead introduce deprecations - so this bug will likely still exist in 3.0. All this is to say that you will need to work around this issue if we want to get #58132 into 3.0, and you shouldn't be waiting on #57706.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants