-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistent DataFrame.agg
behavoir when passing as kwargs numeric_only=True
#49352
Comments
take |
Marking as a regression since users shouldn't be seeing a FutureWarning in this case. Thanks for the report! The way We should be applying |
@rhshadrach I'm not sure that we intend to support We support functions that take a The documentation for Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. Accepted combinations are:
Not sure we could detect that |
@Dr-Irv this issue is about the warning that gets thrown after passing a list of functions. Correct me if I'm wrong with something but I think handling that keyword seems to be appropriate behavior and should be fixed. |
So the thing to work on is that |
From a user perspective, the doc of The current doc for If Concerning using names for functions, this is even more confusing because we have no idea which function is actually called, so no idea which parameter we can pass to this function. I never found a satisfactory documentation explaining which heuristic This is a bit orthogonal to the current bug report, but the issue I reported has been discovered by making tests to understand how arguments are passed to functions, without much success. |
I agree with you here. Would you mind opening up a separate issue regarding this documentation/behavior of |
If I got it correctly, from The passed function must accept either a Series (and return a scalar) or a DataFrame (and returns a Series). If function names are passed, the called function will be the first function found according to the MRO from a DataFrame object. @Dr-Irv yes, I will open an issue for this documentation clarification. |
Just opened a documentation issue for the discussed subject here #49528 |
Thanks @Dr-Irv - I agree users shouldn't be relying Regarding
I've been of the opinion that this transpose should not happen - rather that If we keep the same results as they are today, then I only see two options (but happy to consider others):
|
@rhshadrach what is annoying with your two options (if I get them correctly) is that you will create a specific handling of Pandas has so many complexity and inconstancy due to legacy, that I would favor breaking backward compatibility than adding one more special case in the code to work around legacy (admittedly easier to claim than to do). I am not sure how calling the DataFrame function (if it exists) instead of splitting the DataFrame in Series and calling the Series function would break user code. Do you have specific examples in mind. Would a FutureWarning be an acceptable solution? |
I agree, a better fix would be to call the DataFrame function, but it would at least fix the issue that the change to
This seems to me to be a common use of agg, and it would change the results in every case - not just some instances. Because of how much user code I anticipate it would break, I disagree. But that is very much opinion based.
Back when agg could have partial failure, it wasn't feasible to call the DataFrame methods and get the correct ordering of columns in the result. This has been deprecated, and I'm now working to enforce that deprecation. Once removed, I think it will be possible to call the DataFrame methods and reshape the result so that is it backward compatible. I'll take a look at this after the deprecation is enforced for 2.0. Perhaps a FutureWarning would be an acceptable solution, but I am very hesitant with it. We could tell users what their result would change to, but they would have no way to adopt future behavior and silence the warning. As for an example of how the output "naturally" differs - when using |
I agree intercepting is better than adding an argument.
that would be great.
I see. |
Previously I failed to realize that we don't pass any args / kwargs through when a list is provided:
I think this is a bug, but not something to be fixed in 1.5.x. I've opened #50624 for this. Given that we don't pass through any args / kwargs, I don't think we should make an exception for
|
moving off the 2.0 milestone as it's a regression from 1.5 |
I'm going to close this - the only remaining part of #49352 (comment) is captured in #45658. If there is any opposition, please reopen or comment below. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I expect
DataFrame.agg
to accept kwargs arguments passed to the aggregation functions.It works in a seemingly inconsistent way.
df.agg('mean', numeric_only=True)
works as expected.But passing multiple functions as string in a list does not work as expected as we still get a FutureWarning showing that the argument numeric_only=True was not correctly passed to either mean, or std, or both
Even more surprising, using the function
pd.DataFrame.mean
raises aNotImplementedError
forSeries.mean
Expected Behavior
Using kwargs arguments with
DataFrame.agg
must be correcly passed to all aggregation function, whatever the way they are given (string name or method reference)Installed Versions
commit : 91111fd
python : 3.11.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252
pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.5
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: