Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating _resolve_numeric_only function of GroupBy #43154

Merged
merged 38 commits into from
Sep 9, 2021

Conversation

kurchi1205
Copy link
Contributor

@kurchi1205 kurchi1205 commented Aug 21, 2021

I am absolutely new to contributing on Open Source . So please guide me through it .

@pep8speaks
Copy link

pep8speaks commented Aug 21, 2021

Hello @kurchi1205! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-09 16:40:13 UTC

Updated PEP8 issues in groupby.py
Solved PEP8 issues in test_aggregate
@@ -113,7 +113,19 @@ def test_groupby_aggregation_mixed_dtype():
g = df.groupby(["by1", "by2"])
result = g[["v1", "v2"]].mean()
tm.assert_frame_equal(result, expected)
expected2 = DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a new test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will work on that tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have developed a new test case .

@@ -1119,7 +1120,24 @@ def _resolve_numeric_only(self, numeric_only: bool | lib.NoDefault) -> bool:
# i.e. not explicitly passed by user
if self.obj.ndim == 2:
# i.e. DataFrameGroupBy
numeric_only = True
# Checking if the dataframe has non-numeric features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very complicated, see if you can do this in a simpler way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , working on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated a simpler version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill need to look at the linked issue more closely, but im very skeptical that this method is the right place for a fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly numeric_only =False solves the error , so I looked into _resolve_numeric_only function.
It only checked for dataframe groupby object and set numeric_only to True , which forced the non_numeric columns out of the aggregation . That's why I included checks for multi datatype dataframe .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of numeric_only is to force non-numeric columns out of the aggregation. Changing the behavior here "fixed" the case you tested, but also changed a bunch of other behavior that broke a bunch of other tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I get ur part, I will check for numeric columns, if there are none only then it should set numeric_only to false, that should solve the issue and not violate the documentation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you do.

numeric_only is an argument passed by the user, or defaulting to True if not explicitly passed. We should never be changing it based on what dtypes are actually present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why can't the default value be set checking the datatypes , why does it have to be True always .

@@ -10,6 +10,7 @@ class providing the base-class of operations.

from contextlib import contextmanager
import datetime
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already imported below

@jbrockmendel
Copy link
Member

Commented on #43108, I think this is correct as-is.

Checks for the presence of numeric features in columns to be aggregated
Adding a new test case of only non_numeric features
Checking for empty dataframes passed in groupby
@jbrockmendel
Copy link
Member

Changing the default is something we can do, but that is an API change, not a bugfix, and would require a deprecation cycle.

@kurchi1205
Copy link
Contributor Author

ok , so how to go about with that ? I don't know how that works ?

@jbrockmendel
Copy link
Member

ok , so how to go about with that ? I don't know how that works ?

First there needs to be consensus that we want to change the behavior. That discussion seems to be in #43108. Then a PR that doesn't change the behavior, but does issue a warning that the behavior will change in a future version, e.g. #42738

@kurchi1205
Copy link
Contributor Author

ok So I will create a different pull request and issue a warning in groupby.py

@simonjayhawkins
Copy link
Member

@kurchi1205 I've closed #43108 as a duplicate of #42395.

but okay to have tests for both issue reports. can you also add the code sample from #42395

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kurchi1205 can you add a release note.

@@ -1102,14 +1102,11 @@ def _wrap_applied_output(self, data, keys, values, not_indexed_same: bool = Fals
def _resolve_numeric_only(self, numeric_only: bool | lib.NoDefault) -> bool:
"""
Determine subclass-specific default value for 'numeric_only'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you restore this whitespace. docstrings should start with a one-liner summary distinct from the expanded description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"ve done this in 45f54d6

For SeriesGroupBy we want the default to be False (to match Series behavior).
For DataFrameGroupBy we want it to be True (for backwards-compat).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore this one too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"ve done this in 45f54d6

Parameters
----------
numeric_only : bool or lib.no_default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I"ve done this in 45f54d6


# error: Incompatible return value type (got "Union[bool, NoDefault]",
# expected "bool")
return numeric_only # type: ignore[return-value]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you restore this too. The mypy error needs to be ignored as mypy does not know that lib.no_default is a singleton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kurchi1205 can you add a release note.

I assume the release note should be added in v1.4.0.rst

Copy link
Contributor

@Dr-Irv Dr-Irv Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you restore this too. The mypy error needs to be ignored as mypy does not know that lib.no_default is a singleton.

I"ve done this in 45f54d6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kurchi1205 can you add a release note.

I assume the release note should be added in v1.4.0.rst

We discussed at pandas dev meeting today, and put this in v1.3.3.rst

pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/tests/groupby/test_function.py Outdated Show resolved Hide resolved
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2021

@kurchi1205 can you add a release note.

I"ve done this in 45f54d6

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2021

@jbrockmendel @simonjayhawkins @jreback Based on discussion at today's meeting, I updated the PR based on previous comments. Please review to see if it's good to go for 1.3.3.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2021

lgtm. in theory this might have a small penalty, but its ok as we need the correct results.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2021

lgtm. in theory this might have a small penalty, but its ok as we need the correct results.

There's a failure https://github.com/pandas-dev/pandas/pull/43154/checks?check_run_id=3551554045#step:7:70 that needs to be investigated before we can merge.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2021

There's a failure https://github.com/pandas-dev/pandas/pull/43154/checks?check_run_id=3551554045#step:7:70 that needs to be investigated before we can merge.

I put something in to address this failure related to looking for the FutureWarning. See d46091e

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2021

@jreback I think this is good to go. Two CI failures. One is an issue in building the docs due to a connection failure. The other is the Windows timeout.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @kurchi1205 and @Dr-Irv

@jreback jreback merged commit 7d790cf into pandas-dev:master Sep 9, 2021
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Sep 9, 2021
jreback pushed a commit that referenced this pull request Sep 9, 2021
#43481)

Co-authored-by: Prerana Chakraborty <40196782+kurchi1205@users.noreply.github.com>
@jreback
Copy link
Contributor

jreback commented Sep 9, 2021

@simonjayhawkins IIRC there were some duplicates of the original issue that we can now see if they are solved? can you link

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2021

@simonjayhawkins IIRC there were some duplicates of the original issue that we can now see if they are solved? can you link

There was the original issue #42395 and a "duplicate" that I created #43108. @kurchi1205 added tests from both of those. Not aware of any others.

@simonjayhawkins
Copy link
Member

@simonjayhawkins IIRC there were some duplicates of the original issue that we can now see if they are solved? can you link

There was the original issue #42395 and a "duplicate" that I created #43108. @kurchi1205 added tests from both of those. Not aware of any others.

There was also #43209 that was a regression from the same commit #41706, but it doesn't look like the change in this PR has fixed that one.

@simonjayhawkins
Copy link
Member

this test was added in #41706

sorry this was changed... from #43154 (comment)

(there was also a change to test_cython_agg_nothing_to_agg which i'm not sure about. It looks like a case that used to raise, now returns an empty DataFrame, not checked to see if that is a bugfix or an api change, i'm considering that one out of scope for now)

I'll open an new issue rather than continue the discussion here.

@simonjayhawkins
Copy link
Member

I'll open an new issue rather than continue the discussion here.

#43501

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply
Projects
None yet
7 participants