Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistent DataFrame.agg behavoir when passing as kwargs numeric_only=True #49352

Closed
2 of 3 tasks
arnaudlegout opened this issue Oct 27, 2022 · 16 comments
Closed
2 of 3 tasks
Assignees
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Regression Functionality that used to work in a prior pandas version Warnings Warnings that appear or should be added to pandas
Milestone

Comments

@arnaudlegout
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Python 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:12:32) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'1.5.1'
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': list('abc')})
>>> df.agg('mean', numeric_only=True)
a    2.0
dtype: float64
>>> df.agg(['mean', 'std'], numeric_only=True)
<stdin>:1: FutureWarning: ['b'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
        a
mean  2.0
std   1.0
>>> df.agg(pd.DataFrame.mean, numeric_only=True)
<stdin>:1: FutureWarning: Calling Series.mean with numeric_only=True and dtype object will raise a TypeError in the future
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\frame.py", line 9329, in aggregate
    result = op.agg()
             ^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 773, in agg
    result = self.obj.apply(self.orig_f, axis, args=self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\frame.py", line 9555, in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 746, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 873, in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 889, in apply_series_generator
    results[i] = self.f(v)
                 ^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 139, in f
    return func(x, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\generic.py", line 11847, in mean
    return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\generic.py", line 11401, in mean
    return self._stat_function(
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\generic.py", line 11353, in _stat_function
    return self._reduce(
           ^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\series.py", line 4812, in _reduce
    raise NotImplementedError(
NotImplementedError: Series.mean does not implement numeric_only.
>>>

Issue Description

I expect DataFrame.agg to accept kwargs arguments passed to the aggregation functions.
It works in a seemingly inconsistent way.

df.agg('mean', numeric_only=True) works as expected.

But passing multiple functions as string in a list does not work as expected as we still get a FutureWarning showing that the argument numeric_only=True was not correctly passed to either mean, or std, or both

df.agg(['mean', 'std'], numeric_only=True)
>>>
<stdin>:1: FutureWarning: ['b'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
        a
mean  2.0
std   1.0 

Even more surprising, using the function pd.DataFrame.mean raises a NotImplementedError for Series.mean

>>> df.agg(pd.DataFrame.mean, numeric_only=True)
<stdin>:1: FutureWarning: Calling Series.mean with numeric_only=True and dtype object will raise a TypeError in the future
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\frame.py", line 9329, in aggregate
    result = op.agg()
             ^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 773, in agg
    result = self.obj.apply(self.orig_f, axis, args=self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\frame.py", line 9555, in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 746, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 873, in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 889, in apply_series_generator
    results[i] = self.f(v)
                 ^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\apply.py", line 139, in f
    return func(x, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\generic.py", line 11847, in mean
    return NDFrame.mean(self, axis, skipna, level, numeric_only, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\generic.py", line 11401, in mean
    return self._stat_function(
           ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\generic.py", line 11353, in _stat_function
    return self._reduce(
           ^^^^^^^^^^^^^
  File "C:\Users\alegout\Miniconda3\envs\pandas1.5\Lib\site-packages\pandas\core\series.py", line 4812, in _reduce
    raise NotImplementedError(
NotImplementedError: Series.mean does not implement numeric_only.

Expected Behavior

Using kwargs arguments with DataFrame.agg must be correcly passed to all aggregation function, whatever the way they are given (string name or method reference)

Installed Versions

pd.show_versions()
INSTALLED VERSIONS


commit : 91111fd
python : 3.11.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.5
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@arnaudlegout arnaudlegout added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 27, 2022
@thlautenschlaeger
Copy link

take

@rhshadrach rhshadrach added Regression Functionality that used to work in a prior pandas version Warnings Warnings that appear or should be added to pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 3, 2022
@rhshadrach rhshadrach added this to the 1.5.2 milestone Nov 3, 2022
@rhshadrach
Copy link
Member

rhshadrach commented Nov 3, 2022

Marking as a regression since users shouldn't be seeing a FutureWarning in this case.

Thanks for the report! The way df.agg currently works is it splits the DataFrame up into a collection of Series and applies the function(s) provided to each Series. I believe this is what gives rise to your errors here.

We should be applying numeric_only=True in df.agg prior to splitting the DataFrame into Series. We then also need to not pass numeric_only=True to the functions when they are called on the Series. This should then also fix the issue with passing DataFrame.mean.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 4, 2022

@rhshadrach I'm not sure that we intend to support DataFrame.mean as the function passed to DataFrame.agg() .

We support functions that take a Series or DataFrame as arguments. DataFrame.mean() does not take either. It computes the mean of the elements within the enclosing DataFrame.

The documentation for DataFrame.agg() says that the function argument should be:

Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

Accepted combinations are:

  • function
  • string function name
  • list of functions and/or function names, e.g. [np.sum, 'mean']
  • dict of axis labels -> functions, function names or list of such.

Not sure we could detect that DataFrame.mean is inappropriate at runtime (we can do that in the typing stubs), but I'm also not sure we should be fixing this particular behavior with respect to the kwargs, if we agree that the argument here is incorrect.

@thlautenschlaeger
Copy link

@Dr-Irv this issue is about the warning that gets thrown after passing a list of functions.
Passing DataFrame.mean was just a help to reproduce the actual error message that occurs when passing a list of functions e.g. ["std", "mean"] together with the keyword use_numeric=True.

Correct me if I'm wrong with something but I think handling that keyword seems to be appropriate behavior and should be fixed.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 4, 2022

Passing DataFrame.mean was just a help to reproduce the actual error message that occurs when passing a list of functions e.g. ["std", "mean"] together with the keyword use_numeric=True.

Correct me if I'm wrong with something but I think handling that keyword seems to be appropriate behavior and should be fixed.

So the thing to work on is that df.agg(['mean', 'std'], numeric_only=True) is raising a warning, and it shouldn't. The DataFrame.mean example is a red herring.

@arnaudlegout
Copy link
Contributor Author

From a user perspective, the doc of agg is confusing and the behavior of agg accepting function names is even more confusing.

The current doc for DataFrame.agg does not say that the passed function can accept Series (but only DataFrame). So I would not expect Series.mean to work from the doc. Also, passing np.mean is expected to return a scalar (which is now the planned future behavior). It must be clear from the doc, which object will be passed to the function.

If agg passes one Series per column to the function, it must be clearly stated in the doc (but I suspect that the current implementation is more tricky).

Concerning using names for functions, this is even more confusing because we have no idea which function is actually called, so no idea which parameter we can pass to this function. I never found a satisfactory documentation explaining which heuristic agg is using to find the function from a name.

This is a bit orthogonal to the current bug report, but the issue I reported has been discovered by making tests to understand how arguments are passed to functions, without much success.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Nov 4, 2022

From a user perspective, the doc of agg is confusing and the behavior of agg accepting function names is even more confusing.

I agree with you here. Would you mind opening up a separate issue regarding this documentation/behavior of agg ?

@arnaudlegout
Copy link
Contributor Author

If I got it correctly, from core.appy._try_aggregate_string_function it seems function names are lookup along the inheritance tree using the MRO. So for DataFrame.agg('mean') the matching function will be DataFrame.mean. My understanding is that if the behavior for ["std", "mean"] is fixed, it should be also fixed if we pass directly [DataFrame.std, DataFrame.mean]. The documentation should also be fixed to explain that:

The passed function must accept either a Series (and return a scalar) or a DataFrame (and returns a Series). If function names are passed, the called function will be the first function found according to the MRO from a DataFrame object.

@Dr-Irv yes, I will open an issue for this documentation clarification.

@arnaudlegout
Copy link
Contributor Author

Just opened a documentation issue for the discussed subject here #49528

@rhshadrach
Copy link
Member

Thanks @Dr-Irv - I agree users shouldn't be relying DataFrame.mean taking a Series and producing a scalar, but it does, and I don't think we should add any special logic - neither to forbid nor support it - to the aggregation code.

Regarding df.agg(['mean'], numeric_only=True), I think it's reasonable for a user to expect this to be the same as df.mean(numeric_only=True) and should be supported. However, the current implementation of agg makes this difficult. In particular, df.agg('mean') calls DataFrame.mean whereas df.agg(['mean']) splits up the DataFrame into Series and calls Series.mean. This is why these two results are transposed:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
print(df.agg('mean'))
# a    2.0
# b    5.0
# dtype: float64
print(df.agg(['mean']))
#         a    b
# mean  2.0  5.0

I've been of the opinion that this transpose should not happen - rather that df.agg(['mean']) should return the same result as df.agg('mean') except as a DataFrame instead of a Series. Along with resolving this issue, it would also make df.agg(['mean']) more performant. But this is a pretty big change that would likely break a lot of user code (which is why I previously tried to pursue doing it behind an option in #45557)

If we keep the same results as they are today, then I only see two options (but happy to consider others):

  • Intercept numeric_only arguments and apply them to the DataFrame prior to splitting the data up into Series
  • Add numeric_only as an argument to DataFrame.agg itself.

@arnaudlegout
Copy link
Contributor Author

@rhshadrach what is annoying with your two options (if I get them correctly) is that you will create a specific handling of numeric_only in agg instead of passing the arg to the called function. The consequence is that if a new parameter is added to a DataFrame function, but not to the Series function, we will face again the same issue.

Pandas has so many complexity and inconstancy due to legacy, that I would favor breaking backward compatibility than adding one more special case in the code to work around legacy (admittedly easier to claim than to do).

I am not sure how calling the DataFrame function (if it exists) instead of splitting the DataFrame in Series and calling the Series function would break user code. Do you have specific examples in mind. Would a FutureWarning be an acceptable solution?

@rhshadrach
Copy link
Member

rhshadrach commented Nov 6, 2022

with your two options (if I get them correctly) is that you will create a specific handling of numeric_only in agg instead of passing the arg to the called function. The consequence is that if a new parameter is added to a DataFrame function, but not to the Series function, we will face again the same issue.

I agree, a better fix would be to call the DataFrame function, but it would at least fix the issue that the change to numeric_only created. This would be for 1.5.2. For this, I would lean toward intercepting rather than adding a new argument to DataFrame.agg.

Pandas has so many complexity and inconstancy due to legacy, that I would favor breaking backward compatibility

This seems to me to be a common use of agg, and it would change the results in every case - not just some instances. Because of how much user code I anticipate it would break, I disagree. But that is very much opinion based.

I am not sure how calling the DataFrame function (if it exists) instead of splitting the DataFrame in Series and calling the Series function would break user code. Do you have specific examples in mind. Would a FutureWarning be an acceptable solution?

Back when agg could have partial failure, it wasn't feasible to call the DataFrame methods and get the correct ordering of columns in the result. This has been deprecated, and I'm now working to enforce that deprecation. Once removed, I think it will be possible to call the DataFrame methods and reshape the result so that is it backward compatible. I'll take a look at this after the deprecation is enforced for 2.0.

Perhaps a FutureWarning would be an acceptable solution, but I am very hesitant with it. We could tell users what their result would change to, but they would have no way to adopt future behavior and silence the warning.

As for an example of how the output "naturally" differs - when using .agg('mean') vs .agg(['mean']) you currently get a transpose of the result (see the output above). If we were to switch to the DataFrame method (without reshaping it afterwards), the result of .agg(['mean']) would be nearly identical to .agg('mean'), except the former a DataFrame and the latter a Series.

@arnaudlegout
Copy link
Contributor Author

I agree, a better fix would be to call the DataFrame function, but it would at least fix the issue that the change to numeric_only created. This would be for 1.5.2. For this, I would lean toward intercepting rather than adding a new argument to DataFrame.agg.

I agree intercepting is better than adding an argument.

Back when agg could have partial failure, it wasn't feasible to call the DataFrame methods and get the correct ordering of columns in the result. This has been deprecated, and I'm now working to enforce that deprecation. Once removed, I think it will be possible to call the DataFrame methods and reshape the result so that is it backward compatible. I'll take a look at this after the deprecation is enforced for 2.0.

that would be great.

Perhaps a FutureWarning would be an acceptable solution, but I am very hesitant with it. We could tell users what their result would change to, but they would have no way to adopt future behavior and silence the warning.

I see.

@rhshadrach
Copy link
Member

Previously I failed to realize that we don't pass any args / kwargs through when a list is provided:

def foo(x, a=1, b=2):
    print(a, b)
    return x

df = pd.DataFrame({'x': [1, 2, 3]})
# 0 is for the axis argument; 5 is the first (and only) in args
print(df.agg([foo], 0, 5, b=6))
# 1 2
# 1 2
# 1 2

I think this is a bug, but not something to be fixed in 1.5.x. I've opened #50624 for this.

Given that we don't pass through any args / kwargs, I don't think we should make an exception for numeric_only. With that, the current warning is the correct one to give: the user must select the valid columns prior to calling agg.

FutureWarning: ['b'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Jan 7, 2023
@datapythonista datapythonista modified the milestones: 1.5.3, 1.5.4 Jan 18, 2023
@datapythonista datapythonista removed this from the 1.5.4 milestone Feb 27, 2023
@datapythonista datapythonista added this to the 2.0 milestone Feb 27, 2023
@MarcoGorelli
Copy link
Member

moving off the 2.0 milestone as it's a regression from 1.5

@MarcoGorelli MarcoGorelli modified the milestones: 2.0, 2.1 Mar 27, 2023
@rhshadrach
Copy link
Member

I'm going to close this - the only remaining part of #49352 (comment) is captured in #45658. If there is any opposition, please reopen or comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Regression Functionality that used to work in a prior pandas version Warnings Warnings that appear or should be added to pandas
Projects
None yet
6 participants