-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add numeric_only to groupby frame ops #46524
Comments
I like the idea. Have you checked that there would then be consistency between |
makes sense to add the keyword for consistency with DataFrame ops and also makes sense to make the default False again for consistency. If the reason is to allow users to operate on object columns with numeric data then we should probably also be consistent with the return type for the different ops as there seems to currently be an inconsistency. either keep the object columns with numeric data as object dtype as with quantile or cast eg. float as mean does
|
Side-note: quantile's pre-processing is a bit of a mess. That doesn't invalidate anything said about it here, but I'm optimistic at least the clarity of the situation will improve. Generally +1 on the OP's idea to make things more consistent, but I wonder if making the eventual-default numeric_only=False to match effectively everything else? |
Thanks - I was under the impression that these ops would always work as numeric_only=False, but I am seeing some odd behavior. I've opened #46560 to track and plan to investigate this after.
Agreed, I've added this to #46560
I think we're on the same page here - the plan is to default to True in 1.5 (so as to make it non-breaking) with a FutureWarning that they will default to False in 2.0. |
Once #46072 is implemented, many groupby ops will be defaulting to
numeric_only=False
in 2.0. However there are a number of group ops which can only ever work on numeric data. For API consistency, I believe a user trying to operate on non-numeric columns with these ops should raise. Consider the examplewhich gives the output
If a user has a numeric column that accidentally ends up as object dtype, the result will be silently missing expected columns. This is why I think we should run the op with all provided data, regardless if it is numeric or not.
The following groupby ops have no
numeric_only
argument and act likenumeric_only=True
, but only make sense on numeric data.The following groupby ops have no
numeric_only
argument and act likenumeric_only=True
, but make sense on non-numeric data.For both groups of ops, I propose we add the numeric_only argument defaulting to True in 1.5, which emits a warning message that it will default to False in the future. The warning would only be emitted if setting numeric_only to True/False would give rise to different output; i.e. if there are non-numeric columns that could have been operated on.
It's not ideal to add an argument and deprecate the default value in the same minor release (assuming 1.5 is the last minor release in the 1.x series), however I believe it will be of minor impact to users. The alternatives would be not carrying out the deprecation of
numeric_only=True
or to leave these ops behaving as ifnumeric_only=True
(with no numeric_only argument). Both of these seem like worse alternatives to me.cc @jreback @jbrockmendel @jorisvandenbossche @simonjayhawkins @Dr-Irv
The text was updated successfully, but these errors were encountered: