Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby with as_index=False shouldn't modify grouping columns #34012

Merged
merged 5 commits into from
May 27, 2020

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented May 5, 2020

This makes any groupby function that uses _GroupBy._make_wrapper to act on self._obj_with_exclusions rather than self._selected_obj, in parallel with the cython paths. Also similar to the cython paths, the grouping columns are added back onto the result in _wrap_applied_output. Similar remarks apply to nunique, which does not go through the _make_wrapper.

After finishing this, I found PR #29131. This PR does not change the behavior of calling apply() itself, only the previous code paths that would internally go through apply. Also, the change made there does not have any impact when as_index is False.

@rhshadrach rhshadrach force-pushed the as_index_false branch 3 times, most recently from cf8c847 to 4fd773c Compare May 10, 2020 19:08
pandas/tests/groupby/test_function.py Outdated Show resolved Hide resolved
@@ -803,6 +803,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.groupby` where a ``ValueError`` would be raised when grouping by a categorical column with read-only categories and ``sort=False`` (:issue:`33410`)
- Bug in :meth:`GroupBy.first` and :meth:`GroupBy.last` where None is not preserved in object dtype (:issue:`32800`)
- Bug in :meth:`Rolling.min` and :meth:`Rolling.max`: Growing memory usage after multiple calls when using a fixed window (:issue:`30726`)
- Bug in :meth:`DataFrame.groupby` when using ``as_index=False`` would modify the grouping column when used with ``idxmax``, ``idxmin``, ``mad``, ``nunique``, and ``skew`` (:issue:`21090`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make an api breaking section that shows a before / after (pick a function like nunique) to show what has changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don’t need this note as u have the one above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - missed this. Thanks for catching it.

@jreback jreback added Groupby API - Consistency Internal Consistency of API/Behavior labels May 12, 2020
@rhshadrach
Copy link
Member Author

@jreback Changes made, checks are green.

@jreback jreback added this to the 1.1 milestone May 13, 2020
def _python_apply_general(self, f):
keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
def _python_apply_general(self, f, obj):
keys, values, mutated = self.grouper.apply(f, obj, self.axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a doc string and types here (at least for the added args)

Copy link
Member Author

@rhshadrach rhshadrach May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While adding the docstring, I realized that the name "obj" here is very confusing as it is not the same as self.obj. I renamed to data, this is the name used in grouper.apply.

@rhshadrach
Copy link
Member Author

@jreback changes made and green.

@@ -1898,6 +1908,9 @@ def groupby_series(obj, col=None):

if not self.as_index:
results.index = ibase.default_index(len(results))
if results.ndim == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this hit anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not. obj_with_exclusions is always a frame, so I've removed the series path in this function.


Using :meth:`DataFrame.groupby` with ``as_index=False`` and the function ``idxmax``, ``idxmin``, ``mad``, ``nunique``, or ``skew`` would modify the grouping column. Now, the grouping column remains unchanged. (:issue:`21090`)

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also changes nunique for as_index=True, can you put that example here as well (first)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in the same note as its very confusing to read this otherwise.

@rhshadrach
Copy link
Member Author

@jreback Changes made and tests pass. It seemed best to me to add a new api breaking section for nunique when as_index=True. While the code changes to nunique were the same for both issues, from an api standpoint I think they are different cases. If you'd like me to combine into one section though, I can do that.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a few more comments on the doc-note.


Using :meth:`DataFrame.groupby` with ``as_index=False`` and the function ``idxmax``, ``idxmin``, ``mad``, ``nunique``, or ``skew`` would modify the grouping column. Now, the grouping column remains unchanged. (:issue:`21090`)

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in the same note as its very confusing to read this otherwise.

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment otherwise lgtm.

@jreback jreback merged commit 333db4b into pandas-dev:master May 27, 2020
@jreback
Copy link
Contributor

jreback commented May 27, 2020

thanks @rhshadrach fixing groupby bugs sometimes are non-trivial in time, thanks for sticking with it and keep em coming!

@rhshadrach rhshadrach deleted the as_index_false branch July 11, 2020 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QST] Clarify when agg() will return groups as index Different return type when using groupby with nunique
2 participants